Abstract:
We present a fast compression and decompression technique for natural language texts. The novelties are that (1) decompression of arbitrary portions of the text can be done very efficiently, (2) exact search for words and phrases can be done on the compressed text directly, using any known sequential pattern-matching algorithm, and (3) word-based approximate and extended search can also be done efficiently without any decoding. The compression scheme uses a semistatic word-based model and a Huffman code where the coding alphabet is byte-oriented rather than bit-oriented. We compress typical English texts to about 30% of their original size, against 40% and 35% for Compress and Gzip, respectively. Compression time is close to that of Compress and approximately half the time of Gzip, and decompression time is lower than that of Gzip and one third of that of Compress. We present three algorithms to search the compressed text. They allow a large number of variations over the basic word and phrase search capability, such as sets of characters, arbitrary regular expressions, and approximate matching. Separators and stopwords can be discarded at search time without significantly increasing the cost. When searching for simple words, the experiments show that running our algorithms on a compressed text is twice as fast as running the best existing software on the uncompressed version of the same text. When searching complex or approxi-This mate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, decompressing only for displaying purposes. 1. INTRODUCTION In this article we present an efficient compression technique for natural language texts that allows fast and flexible searching of words and phrases. To search for simple words and phrases, the patterns are compressed, and the search proceeds without any decoding of the compressed text. Searching words and phrases that match complex expressions and/or allowing errors can be done on the compressed text at almost the same cost of simple searches. The reduced size of the compressed text makes the overall searching time much smaller than on plain uncompressed text. The compression and decompression speeds and the amount of compression achieved are very good when compared to well-known algorithms in the literature [Ziv and Lempel 1977; 1978]. The compression scheme presented in this article is a variant of the word-based Huffman code [Bentley et al. 1986; Moffat 1989; Witten et al. 1999]. The Huffman codeword assigned to each text word is a sequence of whole bytes, and the Huffman tree has degree either 128 (which we call " tagged Huffman code ") or 256 (which we call " plain Huffman code "), instead of 2. In tagged Huffman coding each byte uses seven bits for the Huffman code and one bit to signal the beginning of a codeword. As we show later, using bytes instead of bits does not significantly degrade the amount of compression. In practice, byte processing is much faster than bit processing because bit shifts and masking operations are not necessary at compression, decompression, and search times. The decompression can start at any point in the compressed file. In particular, the compression scheme allows fast decompression of fragments that contain the search results, which is an important feature in information retrieval systems. Notice that our compression scheme is designed for large natural language texts containing at least 1MB to achieve an attractive amount of compression. Also, the search algorithms are word-oriented, as the pattern is a sequence of elements to be matched to a sequence of text words. Each pattern element can be a simple word or a complex expression, and the search can be exact or allowing errors in the match. In this context, we present three search algorithms.