compression - Trying to figure out a basic pattern finding algorithm -


long story short, have data need find patterns in. data (each character represents immutable chunk): dabababacdacdacdab

i want able break data following kinds of chunks:

d (repeated 1x) ab (repeated 3x) (1x) cda (3x) b (1x)

i'm familiar basic run length encoding, don't know how when length of "thing" may variable (in example, cda 3 chunks of data, d 1 chunk). making sense? appreciated.

the main difficulty here "greediness" of algorithm , ambiguity can produce. given example, rule tell program not represent string as:

d x 1 ababab x 1 <<<<<----- here's problem cda x 1 b x 1

or worse yet: d x 1 abababcdab x 1

you see mean. you need establish set of rules algo follow, , code kinda write itself. t gain insight on this, might try looking @ of regular expression parsing code in grep.c, although may more advanced need.

as start, consider algorithm that: 1. limits how far ahead scan (i.e. set maximum substring length) 2. favors longer substrings, subject #1

for example, take "aaaaaaaaaaaaaaaa" (which 16 a's). be: x 16 aa x 8 aaaa x 4 aaaaaaaa x 2 aaaaaaaaaaaaaaaa x 1

if set maximum match length of 4, , favor longest match, answer unambiguous: aaaa x 4. can write algorithm.


Comments

Popular posts from this blog

linux - Using a Cron Job to check if my mod_wsgi / apache server is running and restart -

actionscript 3 - TweenLite does not work with object -

jQuery Ajax Render Fragments OR Whole Page -