diff --git a/README b/README
deleted file mode 100644
index 3f53ba2..0000000
--- a/README
+++ /dev/null
@@ -1,71 +0,0 @@
-SMAZ - compression for very small strings
------------------------------------------
-
-Smaz is a simple compression library suitable for compressing very short
-strings. General purpose compression libraries will build the state needed
-for compressing data dynamically, in order to be able to compress every kind
-of data. This is a very good idea, but not for a specific problem: compressing
-small strings will not work.
-
-Smaz instead is not good for compressing general purpose data, but can compress
-text by 40-50% in the average case (works better with English), and is able to
-perform a bit of compression for HTML and urls as well. The important point is
-that Smaz is able to compress even strings of two or three bytes!
-
-For example the string "the" is compressed into a single byte.
-
-To compare this with other libraries, think that like zlib will usually not be able to compress text shorter than 100 bytes.
-
-COMPRESSION EXAMPLES
---------------------
-
-'This is a small string' compressed by 50%
-'foobar' compressed by 34%
-'the end' compressed by 58%
-'not-a-g00d-Exampl333' enlarged by 15%
-'Smaz is a simple compression library' compressed by 39%
-'Nothing is more difficult, and therefore more precious, than to be able to decide' compressed by 49%
-'this is an example of what works very well with smaz' compressed by 49%
-'1000 numbers 2000 will 10 20 30 compress very little' compressed by 10%
-
-In general, lowercase English will work very well. It will suck with a lot
-of numbers inside the strings. Other languages are compressed pretty well too,
-the following is Italian, not very similar to English but still compressible
-by smaz:
-
-'Nel mezzo del cammin di nostra vita, mi ritrovai in una selva oscura' compressed by 33%
-'Mi illumino di immenso' compressed by 37%
-'L'autore di questa libreria vive in Sicilia' compressed by 28%
-
-It can compress URLS pretty well:
-
-'http://google.com' compressed by 59%
-'http://programming.reddit.com' compressed by 52%
-'http://github.com/antirez/smaz/tree/master' compressed by 46%
-
-USAGE
------
-
-The lib consists of just two functions:
-
- int smaz_compress(char *in, int inlen, char *out, int outlen);
-
-Compress the buffer 'in' of length 'inlen' and put the compressed data into
-'out' of max length 'outlen' bytes. If the output buffer is too short to hold
-the whole compressed string, outlen+1 is returned. Otherwise the length of the
-compressed string (less then or equal to outlen) is returned.
-
- int smaz_decompress(char *in, int inlen, char *out, int outlen);
-
-Decompress the buffer 'in' of length 'inlen' and put the decompressed data into
-'out' of max length 'outlen' bytes. If the output buffer is too short to hold
-the whole decompressed string, outlen+1 is returned. Otherwise the length of the
-compressed string (less then or equal to outlen) is returned. This function will
-not automatically put a nul-term at the end of the string if the original
-compressed string didn't included a nulterm.
-
-
-CREDITS
--------
-
-Small was writte by Salvatore Sanfilippo and is released under the BSD license. Check the COPYING file for more information.
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..9269ab6
--- /dev/null
+++ b/README.md
@@ -0,0 +1,139 @@
+Smaz
+=========================================
+
+Compression for very small strings
+----------------------------------
+
+Smaz is a simple compression library suitable for compressing very short
+strings. General purpose compression libraries will build the state needed
+for compressing data dynamically, in order to be able to compress every kind
+of data. This is a very good idea, but not for a specific problem: compressing
+small strings will not work.
+
+Smaz instead is not good for compressing general purpose data, but can compress
+text by 40-50% in the average case (works better with English), and is able to
+perform a bit of compression for HTML and urls as well. The important point is
+that Smaz is able to compress even strings of two or three bytes!
+
+For example the string "the" is compressed into a single byte.
+
+To compare this with other libraries, think that like zlib will usually not be
+able to compress text shorter than 100 bytes.
+
+Compression Examples
+--------------------
+
+* 'This is a small string' compressed by 50%
+* 'foobar' compressed by 34%
+* 'the end' compressed by 58%
+* 'not-a-g00d-Exampl333' enlarged by 15%
+* 'Smaz is a simple compression library' compressed by 39%
+* 'Nothing is more difficult, and therefore more precious, than to be able to decide' compressed by 49%
+* 'this is an example of what works very well with smaz' compressed by 49%
+* '1000 numbers 2000 will 10 20 30 compress very little' compressed by 10%
+
+In general, lowercase English will work very well. It will suck with a lot
+of numbers inside the strings. Other languages are compressed pretty well too,
+the following is Italian, not very similar to English but still compressible
+by smaz:
+
+* 'Nel mezzo del cammin di nostra vita, mi ritrovai in una selva oscura' compressed by 33%
+* 'Mi illumino di immenso' compressed by 37%
+* 'L'autore di questa libreria vive in Sicilia' compressed by 28%
+
+It can compress URLS pretty well:
+
+* 'http://google.com' compressed by 59%
+* 'http://programming.reddit.com' compressed by 52%
+* 'http://github.com/antirez/smaz/tree/master' compressed by 46%
+
+Usage
+-----
+
+**Compression:**
+
+The compression function is:
+
+```cpp
+int smaz_compress(struct SmazBranch *trie, char *in, int inlen, char *out, int outlen);
+```
+
+This compresses the buffer 'in' of length 'inlen' and put the compressed data into
+'out' of max length 'outlen' bytes. If the output buffer is too short to hold
+the whole compressed string, outlen+1 is returned. Otherwise the length of the
+compressed string (less then or equal to outlen) is returned.
+
+The first parameter is the lookup trie used for compression. The default one can be generated with:
+
+```cpp
+struct SmazBranch *smaz_build_trie();
+```
+
+Alternatively, you can provide a custom codebook with:
+
+```cpp
+struct SmazBranch *smaz_build_custom_trie(char *codebook[254]);
+```
+
+*Note:* If you are using a custom codebook, be sure not to have any entries exceeding
+11 characters in length.
+
+The original reference implementation of Smaz compression is included for testing
+and benchmarking comparison purposes:
+
+```cpp
+int smaz_compress_ref(char *in, int inlen, char *out, int outlen);
+```
+
+**Decompression:**
+
+To decompress with the default codebook:
+
+```cpp
+int smaz_decompress(char *in, int inlen, char *out, int outlen);
+```
+
+Or if you are using a custom codebook:
+
+```cpp
+int smaz_decompress_custom(char *cb[254], char *in, int inlen, char *out, int outlen);
+```
+
+These decompress the buffer 'in' of length 'inlen' and put the decompressed data into
+'out' of max length 'outlen' bytes. If the output buffer is too short to hold
+the whole decompressed string, outlen+1 is returned. Otherwise the length of the
+compressed string (less then or equal to outlen) is returned. This function will
+not automatically put a null-term at the end of the string if the original
+compressed string didn't included a nulterm.
+
+smaz_test
+---------
+
+smaz_test.c contains some simple tests and comparitive benchmarks between the reference
+implementation and the trie implementation.
+
+The provided makefile should take care compilation. Running the tests will take up
+about a gig of RAM, as some tests pre-generate large numbers of strings.
+
+
+Trie speed improvement
+----------------------
+
+These are just some rough numbers generated by my machine.
+
+For very compressible data, the new implementation appears ~2.2x faster than the
+reference implementation.
+
+Basic english strings should see something around a ~2.6x speed improvement.
+
+For random textual strings you can get somewhere around a 4.9x speed increase.
+
+
+Credits
+-------
+
+Smaz was written by Salvatore Sanfilippo and is released under the 3 clause BSD license.
+Check the COPYING file for more information.
+
+Trie-based implementation by Richard Johnson, released under the same BSD license.
+
diff --git a/TODO b/TODO
deleted file mode 100644
index fe6a5ae..0000000
--- a/TODO
+++ /dev/null
@@ -1,4 +0,0 @@
-const-correct the source code
-release the ruby script to build new specialized dictionaries
-play well against currupted input in verbatim 253/254 codes memcpy()
-play with some form of entropy coding like Huffman or range coding
diff --git a/smaz.c b/smaz.c
index aa674c3..9f18681 100644
--- a/smaz.c
+++ b/smaz.c
@@ -1,4 +1,9 @@
+#include
#include
+#include
+#include
+
+#include "smaz.h"
/* Our compression codebook, used for compression */
static char *Smaz_cb[241] = {
@@ -76,7 +81,224 @@ static char *Smaz_rcb[254] = {
"e, ", " it", "whi", " ma", "ge", "x", "e c", "men", ".com"
};
-int smaz_compress(char *in, int inlen, char *out, int outlen) {
+#define SMAZ_END_LETTER 'z'
+
+void smaz_free_trie(struct SmazBranch *t) {
+ /*
+ if (t->children != NULL) {
+ int x = 0;
+ for (x = 0; x < SMAZ_LETTER_COUNT; x++) {
+ if (t->children[x] != NULL) {
+ smaz_free_trie(t->children[x]);
+ }
+ }
+ }
+ if (t->shortcut != NULL) {
+ free(t->shortcut);
+ }
+ */
+ free(t);
+}
+
+
+void smaz_add_to_branch(struct SmazBranch *t, char *remEntry, int value, struct SmazBranch *g_trie, int *g_branch_counter) {
+ int entryLen;
+ entryLen = strlen(remEntry);
+
+ if (t->use_shortcut == 0) {
+ t->shortcut_length = entryLen;
+ memcpy(t->shortcut, remEntry, entryLen);
+ t->value = value;
+ t->use_shortcut = 1;
+ return;
+ }
+
+ if (entryLen == 0 && t->shortcut_length == 0) {
+ t->value = value;
+ return;
+ } else {
+ int smallestLen = entryLen;
+ int x;
+
+ if (smallestLen > t->shortcut_length) {
+ smallestLen = t->shortcut_length;
+ }
+
+ for (x = 0; x < smallestLen && t->shortcut[x] == remEntry[x]; x++) { }
+
+ if (x < t->shortcut_length) {
+ int tkey;
+ struct SmazBranch *newTBranch;
+
+ tkey = (int)t->shortcut[x];
+
+ *g_branch_counter = *g_branch_counter+1;
+ assert(*g_branch_counter < 256);
+ newTBranch = &g_trie[*g_branch_counter];
+ memcpy(
+ newTBranch->children,
+ t->children,
+ SMAZ_LETTER_COUNT * sizeof(struct SmazBranch *)
+ );
+ memset(t->children, 0, SMAZ_LETTER_COUNT * sizeof(struct SmazBranch *));
+
+ newTBranch->value = t->value;
+
+ memcpy(&newTBranch->shortcut[0], &t->shortcut[x+1], (t->shortcut_length - x));
+
+ newTBranch->shortcut_length = strlen(newTBranch->shortcut);
+ newTBranch->use_shortcut = 1;
+
+ t->children[tkey] = newTBranch;
+ t->shortcut[x] = 0;
+ t->shortcut_length = strlen(t->shortcut);
+ t->value = -1;
+ } else {
+ /* the value of t remains */
+ }
+ if (x < entryLen) {
+ /* we can assign the v to a child */
+ int vkey;
+ char *vtail;
+
+ vkey = remEntry[x];
+ vtail = (char *)calloc((entryLen - x + 1), sizeof(char));
+ memcpy(vtail, &remEntry[x+1], (entryLen - x));
+
+ if (t->children[vkey] == NULL) {
+ struct SmazBranch *newVBranch;
+ *g_branch_counter = *g_branch_counter+1;
+ assert(*g_branch_counter < 256);
+ newVBranch = &g_trie[*g_branch_counter];
+ newVBranch->value = -1;
+ t->children[vkey] = newVBranch;
+ }
+ smaz_add_to_branch(t->children[vkey], vtail, value, g_trie, g_branch_counter);
+ free(vtail);
+ } else {
+ /* the value of v now takes up the position */
+ t->value = value;
+ }
+ }
+}
+
+struct SmazBranch *smaz_build_custom_trie(char *codebook[254]) {
+ struct SmazBranch *trie;
+ int x;
+
+ int *g_branch_counter = 0;
+ struct SmazBranch *g_trie;
+
+ g_trie = (struct SmazBranch *)calloc(256, sizeof(struct SmazBranch));
+ g_trie[0].value = -1;
+ g_branch_counter = (int *)calloc(1, sizeof(int));
+ *g_branch_counter = 1;
+
+ for (x = 0; x < 254; x++) {
+ smaz_add_to_branch(&g_trie[0], codebook[x], x, g_trie, g_branch_counter);
+ }
+
+ free(g_branch_counter);
+
+ return g_trie;
+}
+
+struct SmazBranch *smaz_build_trie() {
+ return smaz_build_custom_trie(Smaz_rcb);
+}
+
+int smaz_compress(struct SmazBranch *trie, char *in, int inlen, char *out, int outlen) {
+ int verblen = 0, _outlen = outlen;
+ char verb[256], *_out = out;
+
+ while(inlen) {
+ int needed = 0;
+ char *flush = NULL;
+ int length = 0;
+ struct SmazBranch *branch = NULL;
+ int remaining_length = inlen;
+
+ branch = trie;
+ while (remaining_length--) {
+ unsigned char nextChar;
+ struct SmazBranch **children;
+ struct SmazBranch *tmpBranch;
+ char *shortcut;
+ int shortcut_length;
+
+ nextChar = in[length];
+ if (nextChar > SMAZ_END_LETTER) {
+ break;
+ }
+ children = branch->children;
+ if (!(children && children[nextChar])) {
+ break;
+ }
+
+ tmpBranch = children[nextChar];
+ shortcut = tmpBranch->shortcut;
+ shortcut_length = tmpBranch->shortcut_length;
+ length++;
+ if (shortcut) {
+ if (length <= inlen && memcmp(shortcut, in+length, shortcut_length)) {
+ length--;
+ break;
+ }
+ length += shortcut_length;
+ }
+ branch = tmpBranch;
+ }
+ if (branch->value >= 0 && length <= inlen) {
+ /* Match found, prepare a verbatim bytes flush if needed */
+ if (verblen) {
+ needed = (verblen == 1) ? 2 : 2+verblen;
+ flush = out;
+ out += needed;
+ outlen -= needed;
+ }
+ /* Emit the byte */
+ if (outlen <= 0) return _outlen+1;
+ out[0] = branch->value;
+ out++;
+ outlen--;
+ inlen -= length;
+ in += length;
+ goto out;
+ }
+
+ /* Match not found - add the byte to the verbatim buffer */
+ verb[verblen] = in[0];
+ verblen++;
+ inlen--;
+ in++;
+out:
+ /* Prepare a flush if we reached the flush length limit, and there
+ * is not already a pending flush operation. */
+ if (!flush && (verblen == 256 || (verblen > 0 && inlen == 0))) {
+ needed = (verblen == 1) ? 2 : 2+verblen;
+ flush = out;
+ out += needed;
+ outlen -= needed;
+ if (outlen < 0) return _outlen+1;
+ }
+ /* Perform a verbatim flush if needed */
+ if (flush) {
+ if (verblen == 1) {
+ flush[0] = (signed char)254;
+ flush[1] = verb[0];
+ } else {
+ flush[0] = (signed char)255;
+ flush[1] = (signed char)(verblen-1);
+ memcpy(flush+2,verb,verblen);
+ }
+ flush = NULL;
+ verblen = 0;
+ }
+ }
+ return out-_out;
+}
+
+int smaz_compress_ref(char *in, int inlen, char *out, int outlen) {
unsigned int h1,h2,h3=0;
int verblen = 0, _outlen = outlen;
char verb[256], *_out = out;
@@ -105,6 +327,7 @@ int smaz_compress(char *in, int inlen, char *out, int outlen) {
* prepare a verbatim bytes flush if needed */
if (verblen) {
needed = (verblen == 1) ? 2 : 2+verblen;
+ /*printf("Verb good: %d\n", verblen);*/
flush = out;
out += needed;
outlen -= needed;
@@ -112,6 +335,7 @@ int smaz_compress(char *in, int inlen, char *out, int outlen) {
/* Emit the byte */
if (outlen <= 0) return _outlen+1;
out[0] = slot[slot[0]+1];
+ /*printf("Value: %d\n", *(unsigned char *)(&slot[slot[0]+1]));*/
out++;
outlen--;
inlen -= j;
@@ -155,6 +379,9 @@ int smaz_compress(char *in, int inlen, char *out, int outlen) {
}
int smaz_decompress(char *in, int inlen, char *out, int outlen) {
+ return smaz_decompress_custom(Smaz_rcb, in, inlen, out, outlen);
+}
+int smaz_decompress_custom(char *cb[254], char *in, int inlen, char *out, int outlen) {
unsigned char *c = (unsigned char*) in;
char *_out = out;
int _outlen = outlen;
@@ -179,7 +406,7 @@ int smaz_decompress(char *in, int inlen, char *out, int outlen) {
inlen -= 2+len;
} else {
/* Codebook entry */
- char *s = Smaz_rcb[*c];
+ char *s = cb[*c];
int len = strlen(s);
if (outlen < len) return _outlen+1;
diff --git a/smaz.h b/smaz.h
index ce9c35d..8288570 100644
--- a/smaz.h
+++ b/smaz.h
@@ -1,7 +1,23 @@
#ifndef _SMAZ_H
#define _SMAZ_H
-int smaz_compress(char *in, int inlen, char *out, int outlen);
+#define SMAZ_LETTER_COUNT ('z'+1)
+
+struct SmazBranch {
+ struct SmazBranch *children[SMAZ_LETTER_COUNT];
+ char shortcut[12];
+ int use_shortcut;
+ int shortcut_length;
+ int value;
+};
+
+struct SmazBranch *smaz_build_trie();
+struct SmazBranch *smaz_build_custom_trie(char *codebook[254]);
+void smaz_free_trie(struct SmazBranch *t);
+
+int smaz_compress_ref(char *in, int inlen, char *out, int outlen);
+int smaz_compress(struct SmazBranch *trie, char *in, int inlen, char *out, int outlen);
int smaz_decompress(char *in, int inlen, char *out, int outlen);
+int smaz_decompress_custom(char *cb[254], char *in, int inlen, char *out, int outlen);
#endif
diff --git a/smaz_test.c b/smaz_test.c
index 47c02d6..f74fcdc 100644
--- a/smaz_test.c
+++ b/smaz_test.c
@@ -1,48 +1,310 @@
#include
#include
#include
+#include
#include "smaz.h"
-int main(void) {
- char in[512];
- char out[4096];
- char d[4096];
- int comprlen, decomprlen;
- int j, ranlen;
- int times = 1000000;
- char *strings[] = {
- "This is a small string",
+void hexDump (char *desc, void *addr, int len) {
+ int i;
+ unsigned char buff[17];
+ unsigned char *pc = addr;
+
+ if (desc != NULL) {
+ printf ("%s:\n", desc);
+ }
+
+ for (i = 0; i < len; i++) {
+
+ if ((i % 16) == 0) {
+ if (i != 0) {
+ printf (" %s\n", buff);
+ }
+ printf (" %04x ", i);
+ }
+
+ printf (" %02x", pc[i]);
+
+ if ((pc[i] < 0x20) || (pc[i] > 0x7e)) {
+ buff[i % 16] = '.';
+ } else {
+ buff[i % 16] = pc[i];
+ }
+ buff[(i % 16) + 1] = '\0';
+ }
+
+ while ((i % 16) != 0) {
+ printf (" ");
+ i++;
+ }
+
+ printf (" %s\n", buff);
+}
+
+int g_seed = 0;
+
+int fastrand() {
+ g_seed = (214013 * g_seed + 2531011);
+ return (g_seed >> 16) & 0x7FFF;
+}
+
+char *strings[] = {
+ "ht",
"foobar",
"the end",
+ "nojQfTh",
+ "http://google.com",
+ "try it against urls",
+ "Mi illumino di immenso",
+ "http://programming.reddit.com",
+ "This is a small string",
"not-a-g00d-Exampl333",
+ "/media/hdb1/music/Alben/The Bla",
+ "and now a few italian sentences:",
"Smaz is a simple compression library",
- "Nothing is more difficult, and therefore more precious, than to be able to decide",
- "this is an example of what works very well with smaz",
+ "http://github.com/antirez/smaz/tree/master",
+ "L'autore di questa libreria vive in Sicilia",
"1000 numbers 2000 will 10 20 30 compress very little",
- "and now a few italian sentences:",
+ "this is an example of what works very well with smaz",
"Nel mezzo del cammin di nostra vita, mi ritrovai in una selva oscura",
- "Mi illumino di immenso",
- "L'autore di questa libreria vive in Sicilia",
- "try it against urls",
- "http://google.com",
- "http://programming.reddit.com",
- "http://github.com/antirez/smaz/tree/master",
- "/media/hdb1/music/Alben/The Bla",
+ "Nothing is more difficult, and therefore more precious, than to be able to decide",
+ "QtZpZuMhlzfgHFEGA.Kja/hsIayllFSAMFDl.fQ/bJdzfzCvxdclaIbzzWyhbOhCj.nydSJSbmPUzhOHYqszMhvIBqqsSluQkxLbcUuRVXmhS.CrCIBPpKXEPbyhLDLJNn.pVGFEdFmKDC VLAk.LWDqLOlmhyvviIzBOBWsWGQpIPJjftiEd updeZIZjBVrOmDPGJmcZZ CziiEeAhtvkUnYdaFuvKGvdmQnmGaZVtWCpaxpVozEWjc/HyGQFMaiMqjzKYmgPGzSxsFPuCjP JcHUinZvLWVPTSarCUUYQmSGGyPYfeXCEunngaxFxPleyZjNtClHCRdYdsxWkiopaZqU.kaINJmZiUmp",
+ "oTnBmdtIaFEFHFpgqkGlYdCtqIXFTKIPfJsdIotaZ/oUGWaKHmBzzMQyKteDKLXHedxalAfHzAQTgesqyLzo/.rjxQzbWZPzUbqdnuceRejfVz/xpDBfCGUUdlLYkSyt.uRv.dQaJEW.bPsJrQWjNBbKLFbdLmauPiCdEVHgXKIZazGSriVrjQs.H.itMHFJDuajeCqOtKZFJdyUtEEqbbj.s.FQkAyXHdjHoQxDWvnFfgBMLXtFKJZvnRMiUfAgMJbH/TsXzMSKdlOHkxAJPWD//QbmuNyQWAHVIevtohUfRbCktvHfSuopjQSTWl/fpV/tNMCCSWOINMGptyRBZNobtdL.KMzKqvnnu.A.jWgOMtLrrHpcCB.GLIREreLBK.BsYABRttLHo/QhDrZNSzJPZQR.nPJEJvHMX/sO/H.tksygrsDlCIzyJMR.O.scMfNcfKufJrbeJYcALDfxRYHKTPLmmUeTe",
NULL
};
- j=0;
+void test_compress_small_out_buff() {
+ char out[4096];
+ struct SmazBranch *trie;
+ int comprlen = 0;
+ /* skip over the first test string that will give us only 1 byte */
+ int j = 1;
+
+ trie = smaz_build_trie();
while(strings[j]) {
- int comprlevel;
+ comprlen = smaz_compress(
+ trie,
+ strings[j],
+ strlen(strings[j]),
+ out,
+ j
+ );
+ if (comprlen != j+1) {
+ printf("Error: Expected return size: %d, got %d\n", j+1, comprlen);
+ exit(1);
+ }
+ j++;
+ }
+
+ smaz_free_trie(trie);
+
+ printf("TEST PASSED :)\n");
+}
+
+void test_null_term() {
+ char comp_out[256];
+ char decomp_out[256];
+ char no_null_str[4] = "test";
+ char null_term_str[] = "test"; /* implicit null here */
+ int comprlen = 0;
+ int decomprlen = 0;
+ struct SmazBranch *trie;
+
+ trie = smaz_build_trie();
+ comprlen = smaz_compress(
+ trie,
+ no_null_str,
+ 4,
+ comp_out,
+ 256
+ );
+ decomprlen = smaz_decompress(
+ comp_out,
+ comprlen,
+ decomp_out,
+ 256
+ );
+ if (decomprlen != 4) {
+ printf("Error: Expected return size: %d, got %d\n", 4, decomprlen);
+ exit(1);
+ }
+ if (decomp_out[3] != 't') {
+ printf(
+ "Error: Incorrect final char on string: %c, expected %c\n",
+ decomp_out[3],
+ 't'
+ );
+ exit(1);
+ }
+ comprlen = smaz_compress(
+ trie,
+ null_term_str,
+ strlen(null_term_str)+1, /* include the null terminator this time. */
+ comp_out,
+ 256
+ );
+ decomprlen = smaz_decompress(
+ comp_out,
+ comprlen,
+ decomp_out,
+ 256
+ );
+ if (decomprlen != 5) {
+ printf("Error: Expected return size: %d, got %d\n", 5, decomprlen);
+ hexDump("out", &decomp_out, decomprlen);
+ exit(1);
+ }
+ if (decomp_out[4] != 0) {
+ printf( "Error: Incorrect final char on string: %c, expected \\0\n",
+ decomp_out[4]
+ );
+ exit(1);
+ }
+
+ smaz_free_trie(trie);
+
+ printf("TEST PASSED :)\n");
+}
+
+void bench_trie_smaz() {
+ FILE *infile;
+ char *in;
+ char *comp_out;
+ char *de_comp_out;
+ long numbytes;
+ int num_loops = 1000;
+
+ infile = fopen("war_of_the_worlds.txt", "r");
+ if (infile == NULL) {
+ printf("Missing war of the worlds text, you can download the text here: http://www.gutenberg.org/ebooks/36 and save it as war_of_the_worlds.txt\n");
+ exit(1);
+ }
+
+ fseek(infile, 0L, SEEK_END);
+ numbytes = ftell(infile);
+ printf("Processing %d bytes, %d times\n", (int)numbytes, num_loops);
+ fseek(infile, 0L, SEEK_SET);
+ in = (char*)calloc(numbytes, sizeof(char));
+ comp_out = (char*)calloc(numbytes, sizeof(char));
+ de_comp_out = (char*)calloc(numbytes, sizeof(char));
+ numbytes = fread(in, sizeof(char), numbytes, infile);
+ fclose(infile);
+
+ {
+ struct timeval t1, t2;
+ int x;
+ struct SmazBranch *trie;
+
+ trie = smaz_build_trie();
+
+ gettimeofday(&t1, NULL);
+ for (x = 0; x < num_loops; x++) {
+ smaz_compress(
+ trie,
+ in,
+ numbytes,
+ comp_out,
+ numbytes
+ );
+ }
+ gettimeofday(&t2, NULL);
+ printf("time = %u.%06u\n", (unsigned int)t1.tv_sec, (unsigned int)t1.tv_usec);
+ printf("time = %u.%06u\n", (unsigned int)t2.tv_sec, (unsigned int)t2.tv_usec);
+
+ smaz_free_trie(trie);
+ }
+
+ free(in);
+ free(comp_out);
+ free(de_comp_out);
+}
+
+void bench_old_smaz() {
+ FILE *infile;
+ char *in;
+ char *comp_out;
+ char *de_comp_out;
+ long numbytes;
+ int num_loops = 1000;
+
+ infile = fopen("war_of_the_worlds.txt", "r");
+ if (infile == NULL) {
+ printf("Missing war of the worlds text, you can download the text here: http://www.gutenberg.org/ebooks/36 and save it as war_of_the_worlds.txt\n");
+ exit(1);
+ }
+
+ fseek(infile, 0L, SEEK_END);
+ numbytes = ftell(infile);
+ printf("Processing %d bytes, %d times\n", (int)numbytes, num_loops);
+ fseek(infile, 0L, SEEK_SET);
+ in = (char*)calloc(numbytes, sizeof(char));
+ comp_out = (char*)calloc(numbytes, sizeof(char));
+ de_comp_out = (char*)calloc(numbytes, sizeof(char));
+ numbytes = fread(in, sizeof(char), numbytes, infile);
+ fclose(infile);
+
+ {
+ struct timeval t1, t2;
+ int x;
+
+ gettimeofday(&t1, NULL);
+ for (x = 0; x < num_loops; x++) {
+ smaz_compress_ref(
+ in,
+ numbytes,
+ comp_out,
+ numbytes
+ );
+ }
+ gettimeofday(&t2, NULL);
+ printf("time = %u.%06u\n", (unsigned int)t1.tv_sec, (unsigned int)t1.tv_usec);
+ printf("time = %u.%06u\n", (unsigned int)t2.tv_sec, (unsigned int)t2.tv_usec);
+ }
+
+ free(in);
+ free(comp_out);
+ free(de_comp_out);
+}
+
+void test_strings() {
+ char out[4096];
+ char out_good[4096];
+ char d[4096];
+ int comprlen, decomprlen;
+ struct SmazBranch *trie;
+ int j = 0;
+
+ trie = smaz_build_trie();
+
+ while(strings[j]) {
+ int comprlevel, comprlen_good;
+
+ comprlen = smaz_compress(
+ trie,
+ strings[j],
+ strlen(strings[j]),
+ out,
+ sizeof(out)
+ );
+
+ comprlen_good = smaz_compress_ref(
+ strings[j],
+ strlen(strings[j]),
+ out_good,
+ sizeof(out_good)
+ );
- comprlen = smaz_compress(strings[j],strlen(strings[j]),out,sizeof(out));
comprlevel = 100-((100*comprlen)/strlen(strings[j]));
decomprlen = smaz_decompress(out,comprlen,d,sizeof(d));
- if (strlen(strings[j]) != (unsigned)decomprlen ||
+
+ if (comprlen != comprlen_good ||
+ strlen(strings[j]) != (unsigned)decomprlen ||
memcmp(strings[j],d,decomprlen))
{
printf("BUG: error compressing '%s'\n", strings[j]);
+ hexDump("in", strings[j], strlen(strings[j]));
+ hexDump("out bad", &out, comprlen);
+ hexDump("out good", &out_good, comprlen_good);
exit(1);
}
if (comprlevel < 0) {
@@ -52,26 +314,262 @@ int main(void) {
}
j++;
}
+
+ smaz_free_trie(trie);
+
+ printf("TEST PASSED :)\n");
+}
+
+void test_random() {
+ char in[512];
+ char out[4096];
+ char d[4096];
+ int comprlen, decomprlen;
+ int j, ranlen = 0;
+ int times = 1000000;
+ struct SmazBranch *trie;
+
+ g_seed = 0;
+ trie = smaz_build_trie();
+
printf("Encrypting and decrypting %d test strings...\n", times);
while(times--) {
char charset[]="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvxyz/. ";
- ranlen = random() % 512;
+ ranlen = fastrand() % 512;
+ /*printf("doing %d\n", times);*/
for (j = 0; j < ranlen; j++) {
if (times & 1)
- in[j] = charset[random() % (sizeof(charset)-1)];
+ in[j] = charset[fastrand() % (sizeof(charset)-1)];
else
- in[j] = (char)(random() & 0xff);
+ in[j] = (char)(fastrand() & 0xff);
}
- comprlen = smaz_compress(in,ranlen,out,sizeof(out));
- decomprlen = smaz_decompress(out,comprlen,d,sizeof(out));
+ comprlen = smaz_compress(trie, in,ranlen,out,sizeof(out));
+ /*comprlen = smaz_compress_ref(in,ranlen,out,sizeof(out));*/
+ decomprlen = smaz_decompress(out,comprlen,d,sizeof(out));
if (ranlen != decomprlen || memcmp(in,d,ranlen)) {
- printf("Bug! TEST NOT PASSED\n");
+ printf("Bug! TEST NOT PASSED: %d\n", 1000000-times);
+ hexDump("in", &in, ranlen);
+ hexDump("out bad", &out, comprlen);
+ comprlen = smaz_compress_ref(in,ranlen,out,sizeof(out));
+ hexDump("out good", &out, comprlen);
exit(1);
}
- /* printf("%d -> %d\n", comprlen, decomprlen); */
}
+
+ smaz_free_trie(trie);
+
printf("TEST PASSED :)\n");
+}
+
+void bench_random_old_smaz() {
+ char in[512];
+ char out[4096];
+ int j, ranlen = 0;
+ int times = 1000000;
+ struct timeval t1, t2;
+
+ g_seed = 0;
+
+ printf("Encrypting and decrypting %d test strings...\n", times);
+ gettimeofday(&t1, NULL);
+
+ while(times--) {
+ char charset[]="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvxyz/. ";
+ ranlen = fastrand() % 512;
+ /*printf("doing %d\n", times);*/
+
+ for (j = 0; j < ranlen; j++) {
+ if (times & 1)
+ in[j] = charset[fastrand() % (sizeof(charset)-1)];
+ else
+ in[j] = (char)(fastrand() & 0xff);
+ }
+ smaz_compress_ref(in,ranlen,out,sizeof(out));
+ }
+ gettimeofday(&t2, NULL);
+
+ printf("time = %u.%06u\n", (unsigned int)t1.tv_sec, (unsigned int)t1.tv_usec);
+ printf("time = %u.%06u\n", (unsigned int)t2.tv_sec, (unsigned int)t2.tv_usec);
+}
+
+void bench_random_trie() {
+ char in[512];
+ char out[4096];
+ int j, ranlen = 0;
+ int times = 1000000;
+ struct SmazBranch *trie;
+ struct timeval t1, t2;
+
+ g_seed = 0;
+
+ trie = smaz_build_trie();
+
+ printf("Encrypting and decrypting %d test strings...\n", times);
+ gettimeofday(&t1, NULL);
+
+ while(times--) {
+ char charset[]="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvxyz/. ";
+ ranlen = fastrand() % 512;
+ /*printf("doing %d\n", times);*/
+
+ for (j = 0; j < ranlen; j++) {
+ if (times & 1)
+ in[j] = charset[fastrand() % (sizeof(charset)-1)];
+ else
+ in[j] = (char)(fastrand() & 0xff);
+ }
+ smaz_compress(trie, in,ranlen,out,sizeof(out));
+ }
+ gettimeofday(&t2, NULL);
+
+ printf("time = %u.%06u\n", (unsigned int)t1.tv_sec, (unsigned int)t1.tv_usec);
+ printf("time = %u.%06u\n", (unsigned int)t2.tv_sec, (unsigned int)t2.tv_usec);
+
+ smaz_free_trie(trie);
+}
+
+/* Reverse compression codebook, used for decompression */
+static char *Smaz_rcb[254] = {
+" ", "the", "e", "t", "a", "of", "o", "and", "i", "n", "s", "e ", "r", " th",
+" t", "in", "he", "th", "h", "he ", "to", "\r\n", "l", "s ", "d", " a", "an",
+"er", "c", " o", "d ", "on", " of", "re", "of ", "t ", ", ", "is", "u", "at",
+" ", "n ", "or", "which", "f", "m", "as", "it", "that", "\n", "was", "en",
+" ", " w", "es", " an", " i", "\r", "f ", "g", "p", "nd", " s", "nd ", "ed ",
+"w", "ed", "http://", "for", "te", "ing", "y ", "The", " c", "ti", "r ", "his",
+"st", " in", "ar", "nt", ",", " to", "y", "ng", " h", "with", "le", "al", "to ",
+"b", "ou", "be", "were", " b", "se", "o ", "ent", "ha", "ng ", "their", "\"",
+"hi", "from", " f", "in ", "de", "ion", "me", "v", ".", "ve", "all", "re ",
+"ri", "ro", "is ", "co", "f t", "are", "ea", ". ", "her", " m", "er ", " p",
+"es ", "by", "they", "di", "ra", "ic", "not", "s, ", "d t", "at ", "ce", "la",
+"h ", "ne", "as ", "tio", "on ", "n t", "io", "we", " a ", "om", ", a", "s o",
+"ur", "li", "ll", "ch", "had", "this", "e t", "g ", "e\r\n", " wh", "ere",
+" co", "e o", "a ", "us", " d", "ss", "\n\r\n", "\r\n\r", "=\"", " be", " e",
+"s a", "ma", "one", "t t", "or ", "but", "el", "so", "l ", "e s", "s,", "no",
+"ter", " wa", "iv", "ho", "e a", " r", "hat", "s t", "ns", "ch ", "wh", "tr",
+"ut", "/", "have", "ly ", "ta", " ha", " on", "tha", "-", " l", "ati", "en ",
+"pe", " re", "there", "ass", "si", " fo", "wa", "ec", "our", "who", "its", "z",
+"fo", "rs", ">", "ot", "un", "<", "im", "th ", "nc", "ate", "><", "ver", "ad",
+" we", "ly", "ee", " n", "id", " cl", "ac", "il", "", "rt", " wi", "div",
+"e, ", " it", "whi", " ma", "ge", "x", "e c", "men", ".com"
+};
+
+void bench_random_compressible() {
+ char **in;
+ char out[2048];
+ int x = 0;
+ int times = 500000;
+ struct SmazBranch *trie;
+ struct timeval t1, t2;
+
+ g_seed = 0;
+ trie = smaz_build_trie();
+
+ printf("Creating compressible strings\n");
+ in = (char **)calloc(times+1, sizeof(char *));
+ for (x = 0; x < times; x++) {
+ int charlen = 0;
+ in[x] = (char *)calloc(1024, sizeof(char));
+ /* 7 being the longest possible string */
+ while (charlen < (1024 - 7)) {
+ char *val = Smaz_rcb[fastrand() % 254];
+ memcpy(in[x]+charlen, val, strlen(val));
+ charlen += strlen(val);
+ }
+ }
+
+ printf("Encrypting and decrypting %d test strings...\n", times);
+ gettimeofday(&t1, NULL);
+
+ while (times--) {
+ smaz_compress(
+ trie,
+ in[times],
+ strlen(in[times]),
+ out,
+ sizeof(out)
+ );
+ }
+ gettimeofday(&t2, NULL);
+
+ printf("time = %u.%06u\n", (unsigned int)t1.tv_sec, (unsigned int)t1.tv_usec);
+ printf("time = %u.%06u\n", (unsigned int)t2.tv_sec, (unsigned int)t2.tv_usec);
+
+ for (x = 0; x < times; x++) {
+ free(in[x]);
+ }
+ free(in);
+ smaz_free_trie(trie);
+}
+
+void bench_random_compressible_old() {
+ char **in;
+ char out[2048];
+ int x = 0;
+ int times = 500000;
+ struct timeval t1, t2;
+
+ g_seed = 0;
+
+ printf("Creating compressible strings\n");
+ in = (char **)calloc(times+1, sizeof(char *));
+ for (x = 0; x < times; x++) {
+ int charlen = 0;
+ in[x] = (char *)calloc(1024, sizeof(char));
+ /* 7 being the longest possible string */
+ while (charlen < (1024 - 7)) {
+ char *val = Smaz_rcb[fastrand() % 254];
+ memcpy(in[x]+charlen, val, strlen(val));
+ charlen += strlen(val);
+ }
+ }
+
+ printf("Encrypting and decrypting %d test strings...\n", times);
+ gettimeofday(&t1, NULL);
+
+ while (times--) {
+ smaz_compress_ref(
+ in[times],
+ strlen(in[times]),
+ out,
+ sizeof(out)
+ );
+ }
+ gettimeofday(&t2, NULL);
+
+ printf("time = %u.%06u\n", (unsigned int)t1.tv_sec, (unsigned int)t1.tv_usec);
+ printf("time = %u.%06u\n", (unsigned int)t2.tv_sec, (unsigned int)t2.tv_usec);
+
+ for (x = 0; x < times; x++) {
+ free(in[x]);
+ }
+ free(in);
+}
+
+int main(void) {
+
+ printf("\n\nTesting result when using too smaller buffer:\n-------------\n");
+ test_compress_small_out_buff();
+ printf("\n\nTesting null terminators stay there:\n-------------\n");
+ test_null_term();
+ printf("\n\nTesting a bunch of predefined strings:\n-------------\n");
+ test_strings();
+ printf("\n\nTesting a bunch of randomly generated strings:\n-------------\n");
+ test_random();
+ printf("\n\nBenchmarking old smaz with very compressible data:\n-------------\n");
+ bench_random_compressible_old();
+ printf("\n\nBenchmarking new smaz with very compressible data:\n-------------\n");
+ bench_random_compressible();
+ printf("\n\nBenchmarking old smaz on war of the worlds:\n-------------\n");
+ bench_old_smaz();
+ printf("\n\nBenchmarking new smaz on war of the worlds:\n-------------\n");
+ bench_trie_smaz();
+ printf("\n\nBenchmarking old smaz on random data:\n-------------\n");
+ bench_random_old_smaz();
+ printf("\n\nBenchmarking new smaz on random data:\n-------------\n");
+ bench_random_trie();
+ printf("\n\nDone.\n");
+
return 0;
}