You are here
COMPRESSED PATTERN MATCHING FOR TEXT AND IMAGES
 Date Issued:
 2005
 Abstract/Description:
 The amount of information that we are dealing with today is being generated at an everincreasing rate. On one hand, data compression is needed to efficiently store, organize the data and transport the data over the limitedbandwidth network. On the other hand, efficient information retrieval is needed to speedily find the relevant information from this huge mass of data using available resources. The compressed pattern matching problem can be stated as: given the compressed format of a text or an image and a pattern string or a pattern image, report the occurrence(s) of the pattern in the text or image with minimal (or no) decompression. The main advantages of compressed pattern matching versus the naïve decompressthensearch approach are: First, reduced storage cost. Since there is no need to decompress the data or there is only minimal decompression required, the disk space and the memory cost is reduced. Second, less search time. Since the size of the compressed data is smaller than that of the original data, a searching performed on the compressed data will result in a shorter search time. The challenge of efficient compressed pattern matching can be met from two inseparable aspects: First, to utilize effectively the full potential of compression for the information retrieval systems, there is a need to develop searchaware compression algorithms. Second, for data that is compressed using a particular compression technique, regardless whether the compression is searchaware or not, we need to develop efficient searching techniques. This means that techniques must be developed to search the compressed data with no or minimal decompression and with not too much extra cost. Compressed pattern matching algorithms can be categorized as either for text compression or for image compression. Although compressed pattern matching for text compression has been studied for a few years and many publications are available in the literature, there is still room to improve the efficiency in terms of both compression and searching. None of the search engines available today make explicit use of compressed pattern matching. Compressed pattern matching for image compression, on the other hand, has been relatively unexplored. However, it is getting more attention because lossless compression has become more important for the everincreasing large amount of medical images, satellite images and aerospace photos, which requires the data to be losslessly stored. Developing efficient information retrieval techniques from the losslessly compressed data is therefore a fundamental research challenge. In this dissertation, we have studied compressed pattern matching problem for both text and images. We present a series of novel compressed pattern matching algorithms, which are divided into two major parts. The first major work is done for the popular LZW compression algorithm. The second major work is done for the current lossless image compression standard JPEGLS. Specifically, our contributions from the first major work are: 1. We have developed an "almostoptimal" compressed pattern matching algorithm that reports all pattern occurrences. An earlier "almostoptimal" algorithm reported in the literature is only capable of detecting the first occurrence of the pattern and the practical performance of the algorithm is not clear. We have implemented our algorithm and provide extensive experimental results measuring the speed of our algorithm. We also developed a faster implementation for socalled "simple patterns". The simple patterns are patterns that no unique symbol appears more than once. The algorithm takes advantage of this property and runs in optimal time. 2. We have developed a novel compressed pattern matching algorithm for multiple patterns using the AhoCorasick algorithm. The algorithm takes O(mt+n+r) time with O(mt) extra space, where n is the size of the compressed file, m is the total size of all patterns, t is the size of the LZW trie and r is the number of occurrences of the patterns. The algorithm is particularly efficient when being applied on archival search if the archives are compressed with a common LZW trie. All the above algorithms have been implemented and extensive experiments have been conducted to test the performance of our algorithms and to compare with the best existing algorithms. The experimental results show that our compressed pattern matching algorithm for multiple patterns is competitive among the best algorithms and is practically the fastest among all approaches when the number of patterns is not very large. Therefore, our algorithm is preferable for general string matching applications. LZW is one of the most efficient and popular compression algorithms used extensively and both of our algorithms require no modification on the compression algorithm. Our work, therefore, has great economical and market potential Our contributions from the second major work are: 1 We have developed a new global context variation of the JPEGLS compression algorithm and the corresponding compressed pattern matching algorithm. Comparing to the original JPEGLS, the global context variation is searchaware and has faster encoding and decoding speeds. The searching algorithm based on the globalcontext variation requires partial decompression of the compressed image. The experimental results show that it improves the search speed by about 30% comparing to the decompressthensearch approach. Based on our best knowledge, this is the first twodimensional compressed pattern matching work for the JPEGLS standard. 2 We have developed a twopass variation of the JPEGLS algorithm and the corresponding compressed pattern matching algorithm. The twopass variation achieves searchawareness through a common compression technique called semistatic dictionary. Comparing to the original algorithm, the compression of the new algorithm is equally well but the encoding takes slightly longer. The searching algorithm based on the twopass variation requires no decompression at all and therefore works in the fully compressed domain. It runs in time O(nc+mc+nm+m^2) with extra space O(n+m+mc), where n is the number of columns of the image, m is the number of rows and columns of the pattern, nc is the compressed image size and mc is the compressed pattern size. The algorithm is the first known twodimensional algorithm that works in the fully compressed domain.
Title:  COMPRESSED PATTERN MATCHING FOR TEXT AND IMAGES. 
21 views
9 downloads 

Name(s): 
Tao, Tao, Author Mukherjee, Amar, Committee Chair University of Central Florida, Degree Grantor 

Type of Resource:  text  
Date Issued:  2005  
Publisher:  University of Central Florida  
Language(s):  English  
Abstract/Description:  The amount of information that we are dealing with today is being generated at an everincreasing rate. On one hand, data compression is needed to efficiently store, organize the data and transport the data over the limitedbandwidth network. On the other hand, efficient information retrieval is needed to speedily find the relevant information from this huge mass of data using available resources. The compressed pattern matching problem can be stated as: given the compressed format of a text or an image and a pattern string or a pattern image, report the occurrence(s) of the pattern in the text or image with minimal (or no) decompression. The main advantages of compressed pattern matching versus the naïve decompressthensearch approach are: First, reduced storage cost. Since there is no need to decompress the data or there is only minimal decompression required, the disk space and the memory cost is reduced. Second, less search time. Since the size of the compressed data is smaller than that of the original data, a searching performed on the compressed data will result in a shorter search time. The challenge of efficient compressed pattern matching can be met from two inseparable aspects: First, to utilize effectively the full potential of compression for the information retrieval systems, there is a need to develop searchaware compression algorithms. Second, for data that is compressed using a particular compression technique, regardless whether the compression is searchaware or not, we need to develop efficient searching techniques. This means that techniques must be developed to search the compressed data with no or minimal decompression and with not too much extra cost. Compressed pattern matching algorithms can be categorized as either for text compression or for image compression. Although compressed pattern matching for text compression has been studied for a few years and many publications are available in the literature, there is still room to improve the efficiency in terms of both compression and searching. None of the search engines available today make explicit use of compressed pattern matching. Compressed pattern matching for image compression, on the other hand, has been relatively unexplored. However, it is getting more attention because lossless compression has become more important for the everincreasing large amount of medical images, satellite images and aerospace photos, which requires the data to be losslessly stored. Developing efficient information retrieval techniques from the losslessly compressed data is therefore a fundamental research challenge. In this dissertation, we have studied compressed pattern matching problem for both text and images. We present a series of novel compressed pattern matching algorithms, which are divided into two major parts. The first major work is done for the popular LZW compression algorithm. The second major work is done for the current lossless image compression standard JPEGLS. Specifically, our contributions from the first major work are: 1. We have developed an "almostoptimal" compressed pattern matching algorithm that reports all pattern occurrences. An earlier "almostoptimal" algorithm reported in the literature is only capable of detecting the first occurrence of the pattern and the practical performance of the algorithm is not clear. We have implemented our algorithm and provide extensive experimental results measuring the speed of our algorithm. We also developed a faster implementation for socalled "simple patterns". The simple patterns are patterns that no unique symbol appears more than once. The algorithm takes advantage of this property and runs in optimal time. 2. We have developed a novel compressed pattern matching algorithm for multiple patterns using the AhoCorasick algorithm. The algorithm takes O(mt+n+r) time with O(mt) extra space, where n is the size of the compressed file, m is the total size of all patterns, t is the size of the LZW trie and r is the number of occurrences of the patterns. The algorithm is particularly efficient when being applied on archival search if the archives are compressed with a common LZW trie. All the above algorithms have been implemented and extensive experiments have been conducted to test the performance of our algorithms and to compare with the best existing algorithms. The experimental results show that our compressed pattern matching algorithm for multiple patterns is competitive among the best algorithms and is practically the fastest among all approaches when the number of patterns is not very large. Therefore, our algorithm is preferable for general string matching applications. LZW is one of the most efficient and popular compression algorithms used extensively and both of our algorithms require no modification on the compression algorithm. Our work, therefore, has great economical and market potential Our contributions from the second major work are: 1 We have developed a new global context variation of the JPEGLS compression algorithm and the corresponding compressed pattern matching algorithm. Comparing to the original JPEGLS, the global context variation is searchaware and has faster encoding and decoding speeds. The searching algorithm based on the globalcontext variation requires partial decompression of the compressed image. The experimental results show that it improves the search speed by about 30% comparing to the decompressthensearch approach. Based on our best knowledge, this is the first twodimensional compressed pattern matching work for the JPEGLS standard. 2 We have developed a twopass variation of the JPEGLS algorithm and the corresponding compressed pattern matching algorithm. The twopass variation achieves searchawareness through a common compression technique called semistatic dictionary. Comparing to the original algorithm, the compression of the new algorithm is equally well but the encoding takes slightly longer. The searching algorithm based on the twopass variation requires no decompression at all and therefore works in the fully compressed domain. It runs in time O(nc+mc+nm+m^2) with extra space O(n+m+mc), where n is the number of columns of the image, m is the number of rows and columns of the pattern, nc is the compressed image size and mc is the compressed pattern size. The algorithm is the first known twodimensional algorithm that works in the fully compressed domain.  
Identifier:  CFE0000471 (IID), ucf:46366 (fedora)  
Note(s): 
20050501 Ph.D. Engineering and Computer Science, School of Computer Science Doctorate This record was generated from author submitted information. 

Subject(s): 
compression pattern matching JPEGLS LZW information retrieval compressed pattern matching 

Persistent Link to This Record:  http://purl.flvc.org/ucf/fd/CFE0000471  
Restrictions on Access:  public  
Host Institution:  UCF 