compression - Trying to figure out a basic pattern finding algorithm -
long story short, have data need find patterns in. data (each character represents immutable chunk): dabababacdacdacdab
i want able break data following kinds of chunks:
d (repeated 1x) ab (repeated 3x) (1x) cda (3x) b (1x)
i'm familiar basic run length encoding, don't know how when length of "thing" may variable (in example, cda 3 chunks of data, d 1 chunk). making sense? appreciated.
the main difficulty here "greediness" of algorithm , ambiguity can produce. given example, rule tell program not represent string as:
d x 1 ababab x 1 <<<<<----- here's problem cda x 1 b x 1
or worse yet: d x 1 abababcdab x 1
you see mean. you need establish set of rules algo follow, , code kinda write itself. t gain insight on this, might try looking @ of regular expression parsing code in grep.c, although may more advanced need.
as start, consider algorithm that: 1. limits how far ahead scan (i.e. set maximum substring length) 2. favors longer substrings, subject #1
for example, take "aaaaaaaaaaaaaaaa" (which 16 a's). be: x 16 aa x 8 aaaa x 4 aaaaaaaa x 2 aaaaaaaaaaaaaaaa x 1
if set maximum match length of 4, , favor longest match, answer unambiguous: aaaa x 4. can write algorithm.
Comments
Post a Comment