loading...
Invited Paper: Pattern Matching Using n-Gram Sampling of Cumulative Algebraic Signatures: Preliminary Results
Krakow, Poland September 04-September 08
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/DEXA.2006.8017th International Conference on Data ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Witold Litwin, Dauphine University, France
Riad Mokadem, Dauphine University, France
Philippe Rigaux, Dauphine University, France
Schwarz Thomas, Santa Clara University, California
We propose a novel string (pattern) matching algorithm called n-gram search. We intend it for the records stored once and searched many times in a database or a file, especially those organized in a Scalable Distributed Data Structure, (SDDS), over a grid or a structured P2P net. We presume that the records are encoded into their cumulative algebraic signatures, providing incidental confidentiality of stored data. The search starts with pre-processing the pattern, calculating the logarithmic algebraic signature (LAS) of the pattern and the LASs of every n-gram in it. The value of n \geqslant 1 is a parameter that one may tune. The search attempts to match the LASs of n-grams in the pattern towards dynamically calculated LASs, sampled over n-grams in the records. A mismatch generates a shift of up to K-n symbols towards next sample, where K is the pattern length. The whole process is parallel over the SDDS servers and does not require any local decoding. For an M-symbol long record, the unsuccessful search, measured as number of match attempts, costs O ((M-K) / (Kn+ 1)). The 2-grams should typically suffice, leading to O ((M-K) / (K-1)). We show that the algorithm particularly efficient for larger strings and records, i.e., with e-documents or DNA data. Preliminary results show then the n-gram search about (K - n + 1) faster than our previous algorithms and among the fastest known, e.g., probably often faster than Boyer-Moore. % MathType!MTEF!2!1!+- % feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9 % vqaqpepm0xbba9pwe9Q8fs0-yqaqpepae9pg0FirpepeKkFr0xfr-x % fr-xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeyyzImlaaa!37B3! \[ \geqslant \]
Index Terms:
SDDS, grid, structured P2P, scalable distributed pattern matching, algebraic signatures.
Citation:
Witold Litwin, Riad Mokadem, Philippe Rigaux, Schwarz Thomas, "Invited Paper: Pattern Matching Using n-Gram Sampling of Cumulative Algebraic Signatures: Preliminary Results," dexa, pp.251-255, 17th International Conference on Database and Expert Systems Applications (DEXA'06), 2006
Usage of this product signifies your acceptance of the Terms of Use.


Suggestions