diff --git a/README b/README deleted file mode 100644 index 3f53ba2..0000000 --- a/README +++ /dev/null @@ -1,71 +0,0 @@ -SMAZ - compression for very small strings ------------------------------------------ - -Smaz is a simple compression library suitable for compressing very short -strings. General purpose compression libraries will build the state needed -for compressing data dynamically, in order to be able to compress every kind -of data. This is a very good idea, but not for a specific problem: compressing -small strings will not work. - -Smaz instead is not good for compressing general purpose data, but can compress -text by 40-50% in the average case (works better with English), and is able to -perform a bit of compression for HTML and urls as well. The important point is -that Smaz is able to compress even strings of two or three bytes! - -For example the string "the" is compressed into a single byte. - -To compare this with other libraries, think that like zlib will usually not be able to compress text shorter than 100 bytes. - -COMPRESSION EXAMPLES --------------------- - -'This is a small string' compressed by 50% -'foobar' compressed by 34% -'the end' compressed by 58% -'not-a-g00d-Exampl333' enlarged by 15% -'Smaz is a simple compression library' compressed by 39% -'Nothing is more difficult, and therefore more precious, than to be able to decide' compressed by 49% -'this is an example of what works very well with smaz' compressed by 49% -'1000 numbers 2000 will 10 20 30 compress very little' compressed by 10% - -In general, lowercase English will work very well. It will suck with a lot -of numbers inside the strings. Other languages are compressed pretty well too, -the following is Italian, not very similar to English but still compressible -by smaz: - -'Nel mezzo del cammin di nostra vita, mi ritrovai in una selva oscura' compressed by 33% -'Mi illumino di immenso' compressed by 37% -'L'autore di questa libreria vive in Sicilia' compressed by 28% - -It can compress URLS pretty well: - -'http://google.com' compressed by 59% -'http://programming.reddit.com' compressed by 52% -'http://github.com/antirez/smaz/tree/master' compressed by 46% - -USAGE ------ - -The lib consists of just two functions: - - int smaz_compress(char *in, int inlen, char *out, int outlen); - -Compress the buffer 'in' of length 'inlen' and put the compressed data into -'out' of max length 'outlen' bytes. If the output buffer is too short to hold -the whole compressed string, outlen+1 is returned. Otherwise the length of the -compressed string (less then or equal to outlen) is returned. - - int smaz_decompress(char *in, int inlen, char *out, int outlen); - -Decompress the buffer 'in' of length 'inlen' and put the decompressed data into -'out' of max length 'outlen' bytes. If the output buffer is too short to hold -the whole decompressed string, outlen+1 is returned. Otherwise the length of the -compressed string (less then or equal to outlen) is returned. This function will -not automatically put a nul-term at the end of the string if the original -compressed string didn't included a nulterm. - - -CREDITS -------- - -Small was writte by Salvatore Sanfilippo and is released under the BSD license. Check the COPYING file for more information. diff --git a/README.md b/README.md new file mode 100644 index 0000000..9269ab6 --- /dev/null +++ b/README.md @@ -0,0 +1,139 @@ +Smaz +========================================= + +Compression for very small strings +---------------------------------- + +Smaz is a simple compression library suitable for compressing very short +strings. General purpose compression libraries will build the state needed +for compressing data dynamically, in order to be able to compress every kind +of data. This is a very good idea, but not for a specific problem: compressing +small strings will not work. + +Smaz instead is not good for compressing general purpose data, but can compress +text by 40-50% in the average case (works better with English), and is able to +perform a bit of compression for HTML and urls as well. The important point is +that Smaz is able to compress even strings of two or three bytes! + +For example the string "the" is compressed into a single byte. + +To compare this with other libraries, think that like zlib will usually not be +able to compress text shorter than 100 bytes. + +Compression Examples +-------------------- + +* 'This is a small string' compressed by 50% +* 'foobar' compressed by 34% +* 'the end' compressed by 58% +* 'not-a-g00d-Exampl333' enlarged by 15% +* 'Smaz is a simple compression library' compressed by 39% +* 'Nothing is more difficult, and therefore more precious, than to be able to decide' compressed by 49% +* 'this is an example of what works very well with smaz' compressed by 49% +* '1000 numbers 2000 will 10 20 30 compress very little' compressed by 10% + +In general, lowercase English will work very well. It will suck with a lot +of numbers inside the strings. Other languages are compressed pretty well too, +the following is Italian, not very similar to English but still compressible +by smaz: + +* 'Nel mezzo del cammin di nostra vita, mi ritrovai in una selva oscura' compressed by 33% +* 'Mi illumino di immenso' compressed by 37% +* 'L'autore di questa libreria vive in Sicilia' compressed by 28% + +It can compress URLS pretty well: + +* 'http://google.com' compressed by 59% +* 'http://programming.reddit.com' compressed by 52% +* 'http://github.com/antirez/smaz/tree/master' compressed by 46% + +Usage +----- + +**Compression:** + +The compression function is: + +```cpp +int smaz_compress(struct SmazBranch *trie, char *in, int inlen, char *out, int outlen); +``` + +This compresses the buffer 'in' of length 'inlen' and put the compressed data into +'out' of max length 'outlen' bytes. If the output buffer is too short to hold +the whole compressed string, outlen+1 is returned. Otherwise the length of the +compressed string (less then or equal to outlen) is returned. + +The first parameter is the lookup trie used for compression. The default one can be generated with: + +```cpp +struct SmazBranch *smaz_build_trie(); +``` + +Alternatively, you can provide a custom codebook with: + +```cpp +struct SmazBranch *smaz_build_custom_trie(char *codebook[254]); +``` + +*Note:* If you are using a custom codebook, be sure not to have any entries exceeding +11 characters in length. + +The original reference implementation of Smaz compression is included for testing +and benchmarking comparison purposes: + +```cpp +int smaz_compress_ref(char *in, int inlen, char *out, int outlen); +``` + +**Decompression:** + +To decompress with the default codebook: + +```cpp +int smaz_decompress(char *in, int inlen, char *out, int outlen); +``` + +Or if you are using a custom codebook: + +```cpp +int smaz_decompress_custom(char *cb[254], char *in, int inlen, char *out, int outlen); +``` + +These decompress the buffer 'in' of length 'inlen' and put the decompressed data into +'out' of max length 'outlen' bytes. If the output buffer is too short to hold +the whole decompressed string, outlen+1 is returned. Otherwise the length of the +compressed string (less then or equal to outlen) is returned. This function will +not automatically put a null-term at the end of the string if the original +compressed string didn't included a nulterm. + +smaz_test +--------- + +smaz_test.c contains some simple tests and comparitive benchmarks between the reference +implementation and the trie implementation. + +The provided makefile should take care compilation. Running the tests will take up +about a gig of RAM, as some tests pre-generate large numbers of strings. + + +Trie speed improvement +---------------------- + +These are just some rough numbers generated by my machine. + +For very compressible data, the new implementation appears ~2.2x faster than the +reference implementation. + +Basic english strings should see something around a ~2.6x speed improvement. + +For random textual strings you can get somewhere around a 4.9x speed increase. + + +Credits +------- + +Smaz was written by Salvatore Sanfilippo and is released under the 3 clause BSD license. +Check the COPYING file for more information. + +Trie-based implementation by Richard Johnson, released under the same BSD license. + diff --git a/TODO b/TODO deleted file mode 100644 index fe6a5ae..0000000 --- a/TODO +++ /dev/null @@ -1,4 +0,0 @@ -const-correct the source code -release the ruby script to build new specialized dictionaries -play well against currupted input in verbatim 253/254 codes memcpy() -play with some form of entropy coding like Huffman or range coding diff --git a/smaz.c b/smaz.c index aa674c3..9f18681 100644 --- a/smaz.c +++ b/smaz.c @@ -1,4 +1,9 @@ +#include #include +#include +#include + +#include "smaz.h" /* Our compression codebook, used for compression */ static char *Smaz_cb[241] = { @@ -76,7 +81,224 @@ static char *Smaz_rcb[254] = { "e, ", " it", "whi", " ma", "ge", "x", "e c", "men", ".com" }; -int smaz_compress(char *in, int inlen, char *out, int outlen) { +#define SMAZ_END_LETTER 'z' + +void smaz_free_trie(struct SmazBranch *t) { + /* + if (t->children != NULL) { + int x = 0; + for (x = 0; x < SMAZ_LETTER_COUNT; x++) { + if (t->children[x] != NULL) { + smaz_free_trie(t->children[x]); + } + } + } + if (t->shortcut != NULL) { + free(t->shortcut); + } + */ + free(t); +} + + +void smaz_add_to_branch(struct SmazBranch *t, char *remEntry, int value, struct SmazBranch *g_trie, int *g_branch_counter) { + int entryLen; + entryLen = strlen(remEntry); + + if (t->use_shortcut == 0) { + t->shortcut_length = entryLen; + memcpy(t->shortcut, remEntry, entryLen); + t->value = value; + t->use_shortcut = 1; + return; + } + + if (entryLen == 0 && t->shortcut_length == 0) { + t->value = value; + return; + } else { + int smallestLen = entryLen; + int x; + + if (smallestLen > t->shortcut_length) { + smallestLen = t->shortcut_length; + } + + for (x = 0; x < smallestLen && t->shortcut[x] == remEntry[x]; x++) { } + + if (x < t->shortcut_length) { + int tkey; + struct SmazBranch *newTBranch; + + tkey = (int)t->shortcut[x]; + + *g_branch_counter = *g_branch_counter+1; + assert(*g_branch_counter < 256); + newTBranch = &g_trie[*g_branch_counter]; + memcpy( + newTBranch->children, + t->children, + SMAZ_LETTER_COUNT * sizeof(struct SmazBranch *) + ); + memset(t->children, 0, SMAZ_LETTER_COUNT * sizeof(struct SmazBranch *)); + + newTBranch->value = t->value; + + memcpy(&newTBranch->shortcut[0], &t->shortcut[x+1], (t->shortcut_length - x)); + + newTBranch->shortcut_length = strlen(newTBranch->shortcut); + newTBranch->use_shortcut = 1; + + t->children[tkey] = newTBranch; + t->shortcut[x] = 0; + t->shortcut_length = strlen(t->shortcut); + t->value = -1; + } else { + /* the value of t remains */ + } + if (x < entryLen) { + /* we can assign the v to a child */ + int vkey; + char *vtail; + + vkey = remEntry[x]; + vtail = (char *)calloc((entryLen - x + 1), sizeof(char)); + memcpy(vtail, &remEntry[x+1], (entryLen - x)); + + if (t->children[vkey] == NULL) { + struct SmazBranch *newVBranch; + *g_branch_counter = *g_branch_counter+1; + assert(*g_branch_counter < 256); + newVBranch = &g_trie[*g_branch_counter]; + newVBranch->value = -1; + t->children[vkey] = newVBranch; + } + smaz_add_to_branch(t->children[vkey], vtail, value, g_trie, g_branch_counter); + free(vtail); + } else { + /* the value of v now takes up the position */ + t->value = value; + } + } +} + +struct SmazBranch *smaz_build_custom_trie(char *codebook[254]) { + struct SmazBranch *trie; + int x; + + int *g_branch_counter = 0; + struct SmazBranch *g_trie; + + g_trie = (struct SmazBranch *)calloc(256, sizeof(struct SmazBranch)); + g_trie[0].value = -1; + g_branch_counter = (int *)calloc(1, sizeof(int)); + *g_branch_counter = 1; + + for (x = 0; x < 254; x++) { + smaz_add_to_branch(&g_trie[0], codebook[x], x, g_trie, g_branch_counter); + } + + free(g_branch_counter); + + return g_trie; +} + +struct SmazBranch *smaz_build_trie() { + return smaz_build_custom_trie(Smaz_rcb); +} + +int smaz_compress(struct SmazBranch *trie, char *in, int inlen, char *out, int outlen) { + int verblen = 0, _outlen = outlen; + char verb[256], *_out = out; + + while(inlen) { + int needed = 0; + char *flush = NULL; + int length = 0; + struct SmazBranch *branch = NULL; + int remaining_length = inlen; + + branch = trie; + while (remaining_length--) { + unsigned char nextChar; + struct SmazBranch **children; + struct SmazBranch *tmpBranch; + char *shortcut; + int shortcut_length; + + nextChar = in[length]; + if (nextChar > SMAZ_END_LETTER) { + break; + } + children = branch->children; + if (!(children && children[nextChar])) { + break; + } + + tmpBranch = children[nextChar]; + shortcut = tmpBranch->shortcut; + shortcut_length = tmpBranch->shortcut_length; + length++; + if (shortcut) { + if (length <= inlen && memcmp(shortcut, in+length, shortcut_length)) { + length--; + break; + } + length += shortcut_length; + } + branch = tmpBranch; + } + if (branch->value >= 0 && length <= inlen) { + /* Match found, prepare a verbatim bytes flush if needed */ + if (verblen) { + needed = (verblen == 1) ? 2 : 2+verblen; + flush = out; + out += needed; + outlen -= needed; + } + /* Emit the byte */ + if (outlen <= 0) return _outlen+1; + out[0] = branch->value; + out++; + outlen--; + inlen -= length; + in += length; + goto out; + } + + /* Match not found - add the byte to the verbatim buffer */ + verb[verblen] = in[0]; + verblen++; + inlen--; + in++; +out: + /* Prepare a flush if we reached the flush length limit, and there + * is not already a pending flush operation. */ + if (!flush && (verblen == 256 || (verblen > 0 && inlen == 0))) { + needed = (verblen == 1) ? 2 : 2+verblen; + flush = out; + out += needed; + outlen -= needed; + if (outlen < 0) return _outlen+1; + } + /* Perform a verbatim flush if needed */ + if (flush) { + if (verblen == 1) { + flush[0] = (signed char)254; + flush[1] = verb[0]; + } else { + flush[0] = (signed char)255; + flush[1] = (signed char)(verblen-1); + memcpy(flush+2,verb,verblen); + } + flush = NULL; + verblen = 0; + } + } + return out-_out; +} + +int smaz_compress_ref(char *in, int inlen, char *out, int outlen) { unsigned int h1,h2,h3=0; int verblen = 0, _outlen = outlen; char verb[256], *_out = out; @@ -105,6 +327,7 @@ int smaz_compress(char *in, int inlen, char *out, int outlen) { * prepare a verbatim bytes flush if needed */ if (verblen) { needed = (verblen == 1) ? 2 : 2+verblen; + /*printf("Verb good: %d\n", verblen);*/ flush = out; out += needed; outlen -= needed; @@ -112,6 +335,7 @@ int smaz_compress(char *in, int inlen, char *out, int outlen) { /* Emit the byte */ if (outlen <= 0) return _outlen+1; out[0] = slot[slot[0]+1]; + /*printf("Value: %d\n", *(unsigned char *)(&slot[slot[0]+1]));*/ out++; outlen--; inlen -= j; @@ -155,6 +379,9 @@ int smaz_compress(char *in, int inlen, char *out, int outlen) { } int smaz_decompress(char *in, int inlen, char *out, int outlen) { + return smaz_decompress_custom(Smaz_rcb, in, inlen, out, outlen); +} +int smaz_decompress_custom(char *cb[254], char *in, int inlen, char *out, int outlen) { unsigned char *c = (unsigned char*) in; char *_out = out; int _outlen = outlen; @@ -179,7 +406,7 @@ int smaz_decompress(char *in, int inlen, char *out, int outlen) { inlen -= 2+len; } else { /* Codebook entry */ - char *s = Smaz_rcb[*c]; + char *s = cb[*c]; int len = strlen(s); if (outlen < len) return _outlen+1; diff --git a/smaz.h b/smaz.h index ce9c35d..8288570 100644 --- a/smaz.h +++ b/smaz.h @@ -1,7 +1,23 @@ #ifndef _SMAZ_H #define _SMAZ_H -int smaz_compress(char *in, int inlen, char *out, int outlen); +#define SMAZ_LETTER_COUNT ('z'+1) + +struct SmazBranch { + struct SmazBranch *children[SMAZ_LETTER_COUNT]; + char shortcut[12]; + int use_shortcut; + int shortcut_length; + int value; +}; + +struct SmazBranch *smaz_build_trie(); +struct SmazBranch *smaz_build_custom_trie(char *codebook[254]); +void smaz_free_trie(struct SmazBranch *t); + +int smaz_compress_ref(char *in, int inlen, char *out, int outlen); +int smaz_compress(struct SmazBranch *trie, char *in, int inlen, char *out, int outlen); int smaz_decompress(char *in, int inlen, char *out, int outlen); +int smaz_decompress_custom(char *cb[254], char *in, int inlen, char *out, int outlen); #endif diff --git a/smaz_test.c b/smaz_test.c index 47c02d6..f74fcdc 100644 --- a/smaz_test.c +++ b/smaz_test.c @@ -1,48 +1,310 @@ #include #include #include +#include #include "smaz.h" -int main(void) { - char in[512]; - char out[4096]; - char d[4096]; - int comprlen, decomprlen; - int j, ranlen; - int times = 1000000; - char *strings[] = { - "This is a small string", +void hexDump (char *desc, void *addr, int len) { + int i; + unsigned char buff[17]; + unsigned char *pc = addr; + + if (desc != NULL) { + printf ("%s:\n", desc); + } + + for (i = 0; i < len; i++) { + + if ((i % 16) == 0) { + if (i != 0) { + printf (" %s\n", buff); + } + printf (" %04x ", i); + } + + printf (" %02x", pc[i]); + + if ((pc[i] < 0x20) || (pc[i] > 0x7e)) { + buff[i % 16] = '.'; + } else { + buff[i % 16] = pc[i]; + } + buff[(i % 16) + 1] = '\0'; + } + + while ((i % 16) != 0) { + printf (" "); + i++; + } + + printf (" %s\n", buff); +} + +int g_seed = 0; + +int fastrand() { + g_seed = (214013 * g_seed + 2531011); + return (g_seed >> 16) & 0x7FFF; +} + +char *strings[] = { + "ht", "foobar", "the end", + "nojQfTh", + "http://google.com", + "try it against urls", + "Mi illumino di immenso", + "http://programming.reddit.com", + "This is a small string", "not-a-g00d-Exampl333", + "/media/hdb1/music/Alben/The Bla", + "and now a few italian sentences:", "Smaz is a simple compression library", - "Nothing is more difficult, and therefore more precious, than to be able to decide", - "this is an example of what works very well with smaz", + "http://github.com/antirez/smaz/tree/master", + "L'autore di questa libreria vive in Sicilia", "1000 numbers 2000 will 10 20 30 compress very little", - "and now a few italian sentences:", + "this is an example of what works very well with smaz", "Nel mezzo del cammin di nostra vita, mi ritrovai in una selva oscura", - "Mi illumino di immenso", - "L'autore di questa libreria vive in Sicilia", - "try it against urls", - "http://google.com", - "http://programming.reddit.com", - "http://github.com/antirez/smaz/tree/master", - "/media/hdb1/music/Alben/The Bla", + "Nothing is more difficult, and therefore more precious, than to be able to decide", + "QtZpZuMhlzfgHFEGA.Kja/hsIayllFSAMFDl.fQ/bJdzfzCvxdclaIbzzWyhbOhCj.nydSJSbmPUzhOHYqszMhvIBqqsSluQkxLbcUuRVXmhS.CrCIBPpKXEPbyhLDLJNn.pVGFEdFmKDC VLAk.LWDqLOlmhyvviIzBOBWsWGQpIPJjftiEd updeZIZjBVrOmDPGJmcZZ CziiEeAhtvkUnYdaFuvKGvdmQnmGaZVtWCpaxpVozEWjc/HyGQFMaiMqjzKYmgPGzSxsFPuCjP JcHUinZvLWVPTSarCUUYQmSGGyPYfeXCEunngaxFxPleyZjNtClHCRdYdsxWkiopaZqU.kaINJmZiUmp", + "oTnBmdtIaFEFHFpgqkGlYdCtqIXFTKIPfJsdIotaZ/oUGWaKHmBzzMQyKteDKLXHedxalAfHzAQTgesqyLzo/.rjxQzbWZPzUbqdnuceRejfVz/xpDBfCGUUdlLYkSyt.uRv.dQaJEW.bPsJrQWjNBbKLFbdLmauPiCdEVHgXKIZazGSriVrjQs.H.itMHFJDuajeCqOtKZFJdyUtEEqbbj.s.FQkAyXHdjHoQxDWvnFfgBMLXtFKJZvnRMiUfAgMJbH/TsXzMSKdlOHkxAJPWD//QbmuNyQWAHVIevtohUfRbCktvHfSuopjQSTWl/fpV/tNMCCSWOINMGptyRBZNobtdL.KMzKqvnnu.A.jWgOMtLrrHpcCB.GLIREreLBK.BsYABRttLHo/QhDrZNSzJPZQR.nPJEJvHMX/sO/H.tksygrsDlCIzyJMR.O.scMfNcfKufJrbeJYcALDfxRYHKTPLmmUeTe", NULL }; - j=0; +void test_compress_small_out_buff() { + char out[4096]; + struct SmazBranch *trie; + int comprlen = 0; + /* skip over the first test string that will give us only 1 byte */ + int j = 1; + + trie = smaz_build_trie(); while(strings[j]) { - int comprlevel; + comprlen = smaz_compress( + trie, + strings[j], + strlen(strings[j]), + out, + j + ); + if (comprlen != j+1) { + printf("Error: Expected return size: %d, got %d\n", j+1, comprlen); + exit(1); + } + j++; + } + + smaz_free_trie(trie); + + printf("TEST PASSED :)\n"); +} + +void test_null_term() { + char comp_out[256]; + char decomp_out[256]; + char no_null_str[4] = "test"; + char null_term_str[] = "test"; /* implicit null here */ + int comprlen = 0; + int decomprlen = 0; + struct SmazBranch *trie; + + trie = smaz_build_trie(); + comprlen = smaz_compress( + trie, + no_null_str, + 4, + comp_out, + 256 + ); + decomprlen = smaz_decompress( + comp_out, + comprlen, + decomp_out, + 256 + ); + if (decomprlen != 4) { + printf("Error: Expected return size: %d, got %d\n", 4, decomprlen); + exit(1); + } + if (decomp_out[3] != 't') { + printf( + "Error: Incorrect final char on string: %c, expected %c\n", + decomp_out[3], + 't' + ); + exit(1); + } + comprlen = smaz_compress( + trie, + null_term_str, + strlen(null_term_str)+1, /* include the null terminator this time. */ + comp_out, + 256 + ); + decomprlen = smaz_decompress( + comp_out, + comprlen, + decomp_out, + 256 + ); + if (decomprlen != 5) { + printf("Error: Expected return size: %d, got %d\n", 5, decomprlen); + hexDump("out", &decomp_out, decomprlen); + exit(1); + } + if (decomp_out[4] != 0) { + printf( "Error: Incorrect final char on string: %c, expected \\0\n", + decomp_out[4] + ); + exit(1); + } + + smaz_free_trie(trie); + + printf("TEST PASSED :)\n"); +} + +void bench_trie_smaz() { + FILE *infile; + char *in; + char *comp_out; + char *de_comp_out; + long numbytes; + int num_loops = 1000; + + infile = fopen("war_of_the_worlds.txt", "r"); + if (infile == NULL) { + printf("Missing war of the worlds text, you can download the text here: http://www.gutenberg.org/ebooks/36 and save it as war_of_the_worlds.txt\n"); + exit(1); + } + + fseek(infile, 0L, SEEK_END); + numbytes = ftell(infile); + printf("Processing %d bytes, %d times\n", (int)numbytes, num_loops); + fseek(infile, 0L, SEEK_SET); + in = (char*)calloc(numbytes, sizeof(char)); + comp_out = (char*)calloc(numbytes, sizeof(char)); + de_comp_out = (char*)calloc(numbytes, sizeof(char)); + numbytes = fread(in, sizeof(char), numbytes, infile); + fclose(infile); + + { + struct timeval t1, t2; + int x; + struct SmazBranch *trie; + + trie = smaz_build_trie(); + + gettimeofday(&t1, NULL); + for (x = 0; x < num_loops; x++) { + smaz_compress( + trie, + in, + numbytes, + comp_out, + numbytes + ); + } + gettimeofday(&t2, NULL); + printf("time = %u.%06u\n", (unsigned int)t1.tv_sec, (unsigned int)t1.tv_usec); + printf("time = %u.%06u\n", (unsigned int)t2.tv_sec, (unsigned int)t2.tv_usec); + + smaz_free_trie(trie); + } + + free(in); + free(comp_out); + free(de_comp_out); +} + +void bench_old_smaz() { + FILE *infile; + char *in; + char *comp_out; + char *de_comp_out; + long numbytes; + int num_loops = 1000; + + infile = fopen("war_of_the_worlds.txt", "r"); + if (infile == NULL) { + printf("Missing war of the worlds text, you can download the text here: http://www.gutenberg.org/ebooks/36 and save it as war_of_the_worlds.txt\n"); + exit(1); + } + + fseek(infile, 0L, SEEK_END); + numbytes = ftell(infile); + printf("Processing %d bytes, %d times\n", (int)numbytes, num_loops); + fseek(infile, 0L, SEEK_SET); + in = (char*)calloc(numbytes, sizeof(char)); + comp_out = (char*)calloc(numbytes, sizeof(char)); + de_comp_out = (char*)calloc(numbytes, sizeof(char)); + numbytes = fread(in, sizeof(char), numbytes, infile); + fclose(infile); + + { + struct timeval t1, t2; + int x; + + gettimeofday(&t1, NULL); + for (x = 0; x < num_loops; x++) { + smaz_compress_ref( + in, + numbytes, + comp_out, + numbytes + ); + } + gettimeofday(&t2, NULL); + printf("time = %u.%06u\n", (unsigned int)t1.tv_sec, (unsigned int)t1.tv_usec); + printf("time = %u.%06u\n", (unsigned int)t2.tv_sec, (unsigned int)t2.tv_usec); + } + + free(in); + free(comp_out); + free(de_comp_out); +} + +void test_strings() { + char out[4096]; + char out_good[4096]; + char d[4096]; + int comprlen, decomprlen; + struct SmazBranch *trie; + int j = 0; + + trie = smaz_build_trie(); + + while(strings[j]) { + int comprlevel, comprlen_good; + + comprlen = smaz_compress( + trie, + strings[j], + strlen(strings[j]), + out, + sizeof(out) + ); + + comprlen_good = smaz_compress_ref( + strings[j], + strlen(strings[j]), + out_good, + sizeof(out_good) + ); - comprlen = smaz_compress(strings[j],strlen(strings[j]),out,sizeof(out)); comprlevel = 100-((100*comprlen)/strlen(strings[j])); decomprlen = smaz_decompress(out,comprlen,d,sizeof(d)); - if (strlen(strings[j]) != (unsigned)decomprlen || + + if (comprlen != comprlen_good || + strlen(strings[j]) != (unsigned)decomprlen || memcmp(strings[j],d,decomprlen)) { printf("BUG: error compressing '%s'\n", strings[j]); + hexDump("in", strings[j], strlen(strings[j])); + hexDump("out bad", &out, comprlen); + hexDump("out good", &out_good, comprlen_good); exit(1); } if (comprlevel < 0) { @@ -52,26 +314,262 @@ int main(void) { } j++; } + + smaz_free_trie(trie); + + printf("TEST PASSED :)\n"); +} + +void test_random() { + char in[512]; + char out[4096]; + char d[4096]; + int comprlen, decomprlen; + int j, ranlen = 0; + int times = 1000000; + struct SmazBranch *trie; + + g_seed = 0; + trie = smaz_build_trie(); + printf("Encrypting and decrypting %d test strings...\n", times); while(times--) { char charset[]="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvxyz/. "; - ranlen = random() % 512; + ranlen = fastrand() % 512; + /*printf("doing %d\n", times);*/ for (j = 0; j < ranlen; j++) { if (times & 1) - in[j] = charset[random() % (sizeof(charset)-1)]; + in[j] = charset[fastrand() % (sizeof(charset)-1)]; else - in[j] = (char)(random() & 0xff); + in[j] = (char)(fastrand() & 0xff); } - comprlen = smaz_compress(in,ranlen,out,sizeof(out)); - decomprlen = smaz_decompress(out,comprlen,d,sizeof(out)); + comprlen = smaz_compress(trie, in,ranlen,out,sizeof(out)); + /*comprlen = smaz_compress_ref(in,ranlen,out,sizeof(out));*/ + decomprlen = smaz_decompress(out,comprlen,d,sizeof(out)); if (ranlen != decomprlen || memcmp(in,d,ranlen)) { - printf("Bug! TEST NOT PASSED\n"); + printf("Bug! TEST NOT PASSED: %d\n", 1000000-times); + hexDump("in", &in, ranlen); + hexDump("out bad", &out, comprlen); + comprlen = smaz_compress_ref(in,ranlen,out,sizeof(out)); + hexDump("out good", &out, comprlen); exit(1); } - /* printf("%d -> %d\n", comprlen, decomprlen); */ } + + smaz_free_trie(trie); + printf("TEST PASSED :)\n"); +} + +void bench_random_old_smaz() { + char in[512]; + char out[4096]; + int j, ranlen = 0; + int times = 1000000; + struct timeval t1, t2; + + g_seed = 0; + + printf("Encrypting and decrypting %d test strings...\n", times); + gettimeofday(&t1, NULL); + + while(times--) { + char charset[]="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvxyz/. "; + ranlen = fastrand() % 512; + /*printf("doing %d\n", times);*/ + + for (j = 0; j < ranlen; j++) { + if (times & 1) + in[j] = charset[fastrand() % (sizeof(charset)-1)]; + else + in[j] = (char)(fastrand() & 0xff); + } + smaz_compress_ref(in,ranlen,out,sizeof(out)); + } + gettimeofday(&t2, NULL); + + printf("time = %u.%06u\n", (unsigned int)t1.tv_sec, (unsigned int)t1.tv_usec); + printf("time = %u.%06u\n", (unsigned int)t2.tv_sec, (unsigned int)t2.tv_usec); +} + +void bench_random_trie() { + char in[512]; + char out[4096]; + int j, ranlen = 0; + int times = 1000000; + struct SmazBranch *trie; + struct timeval t1, t2; + + g_seed = 0; + + trie = smaz_build_trie(); + + printf("Encrypting and decrypting %d test strings...\n", times); + gettimeofday(&t1, NULL); + + while(times--) { + char charset[]="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvxyz/. "; + ranlen = fastrand() % 512; + /*printf("doing %d\n", times);*/ + + for (j = 0; j < ranlen; j++) { + if (times & 1) + in[j] = charset[fastrand() % (sizeof(charset)-1)]; + else + in[j] = (char)(fastrand() & 0xff); + } + smaz_compress(trie, in,ranlen,out,sizeof(out)); + } + gettimeofday(&t2, NULL); + + printf("time = %u.%06u\n", (unsigned int)t1.tv_sec, (unsigned int)t1.tv_usec); + printf("time = %u.%06u\n", (unsigned int)t2.tv_sec, (unsigned int)t2.tv_usec); + + smaz_free_trie(trie); +} + +/* Reverse compression codebook, used for decompression */ +static char *Smaz_rcb[254] = { +" ", "the", "e", "t", "a", "of", "o", "and", "i", "n", "s", "e ", "r", " th", +" t", "in", "he", "th", "h", "he ", "to", "\r\n", "l", "s ", "d", " a", "an", +"er", "c", " o", "d ", "on", " of", "re", "of ", "t ", ", ", "is", "u", "at", +" ", "n ", "or", "which", "f", "m", "as", "it", "that", "\n", "was", "en", +" ", " w", "es", " an", " i", "\r", "f ", "g", "p", "nd", " s", "nd ", "ed ", +"w", "ed", "http://", "for", "te", "ing", "y ", "The", " c", "ti", "r ", "his", +"st", " in", "ar", "nt", ",", " to", "y", "ng", " h", "with", "le", "al", "to ", +"b", "ou", "be", "were", " b", "se", "o ", "ent", "ha", "ng ", "their", "\"", +"hi", "from", " f", "in ", "de", "ion", "me", "v", ".", "ve", "all", "re ", +"ri", "ro", "is ", "co", "f t", "are", "ea", ". ", "her", " m", "er ", " p", +"es ", "by", "they", "di", "ra", "ic", "not", "s, ", "d t", "at ", "ce", "la", +"h ", "ne", "as ", "tio", "on ", "n t", "io", "we", " a ", "om", ", a", "s o", +"ur", "li", "ll", "ch", "had", "this", "e t", "g ", "e\r\n", " wh", "ere", +" co", "e o", "a ", "us", " d", "ss", "\n\r\n", "\r\n\r", "=\"", " be", " e", +"s a", "ma", "one", "t t", "or ", "but", "el", "so", "l ", "e s", "s,", "no", +"ter", " wa", "iv", "ho", "e a", " r", "hat", "s t", "ns", "ch ", "wh", "tr", +"ut", "/", "have", "ly ", "ta", " ha", " on", "tha", "-", " l", "ati", "en ", +"pe", " re", "there", "ass", "si", " fo", "wa", "ec", "our", "who", "its", "z", +"fo", "rs", ">", "ot", "un", "<", "im", "th ", "nc", "ate", "><", "ver", "ad", +" we", "ly", "ee", " n", "id", " cl", "ac", "il", "