As outlined in the original post, I wrote a text compression routine using huffman encoding on words which achieved a higher compression ratio than gzip while being more usable for decompression of segments. I’ve finally packaged the code together under the MIT license. I ask that those who find it useful credit me with the library and submit back changes (though neither is required under the license). Anyway, on to the details of the library:
- High compression ratios for text content.
- Fast decompression of segments of the whole without decompressing the entire content.
- Provides a basis for doing searches while bypassing the string reconstruction.
Some design notes:
- The compression technique requires some metadata storage as outlined in the comments. One is an NSData object and the others are integer arrays. They are not all raw data in order to provide the user with the flexibility in method of storage.
- The compressed words list is not serialized using a property list because it doesn’t provide as high of a compression ratio.
- The decompression doesn’t perform validity checks on its input data. This is done to enhance the speed of decompression.
Note: I have been continuing work on this library in my own uses and have not been updating it here. Given the little interest here, I’ll only update it upon request. If you want the latest version contact me (leaving a comment here will work), and I’ll update it. The code for the text compression can be downloaded here: Text Compression.zip