hola all
i have checked the papers by dr Gonnet et al, and want
to clear up a number of things re: the algorithms that
monsieur champs rightly draws our attention to. i
believe these points should attenuate somewhat the enthusiasm for the dream, idealist tho i am.
1) the speedups pertaining to regular expressions achieved by Gonnet et al distinguish
static
from
dynamic text. this is particularly important for us, as the speedups attained are attained because the text is 'preprocessed', or indexed, into a patricia trie, prior to regex matching.
these speedups can be significant if the text you are using is (like a genome) static. if you are dealing with dynamic text, i assume any speedups you might have achieved lost in the frequent reindexing that would be required for the methods of gonnet et al to work. gonnet's example is that of matching against the OED; think to yourself, how many of my regexen are used against this corpus. if the answer is yes, then vote for the new dream by all means. if your answers are more varied, you may prefer to read... point 2:
2) regardless of the time aspect of the problem in 1) the patricia tries used take up significant amounts of memory. i quote from Gonnet: 'for some applications, patricia tries still need too much space'. Do you want to build this unpredictability into perl's robust, pretty fast, hand-tailored regex engine? i would rather not, even if someone else does all the hard work ;-).and also...
3) the paper explicitly states that ' finding an algorithm with logarithmic search time for any RE query is still an open problem'. so was
monsiuer champs's friend was wrong? undoubtedly not - for the special text (singular) he considered. the metrics to decide whether *your* text will also do this are... alas, not there, at the mo. to the best of my knowledge. the worst case for the algorithm based on searching a patricia trie (i distinguish this from text) is linear. the algorithm performs less well in the presence of repeats - interesting, given the prevalence of repeats in genomic data. and <gasp>, lastly...
4) i think it's worth distinguishing the above from approximate matching techniques, some of which are, to the best of my knowledge, *not* very fast (esp. the dynamic prog ramming based ones, which give you the best results imho).
nb: i think it's great that the subject above was brought up, it's interesting & thought provoking. i find myself
compelled to be conservative with what i know & love, however. match that! ;-)
...wufnik
-- in the world of the mules there are no rules --
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.