The main problem I have with these solutions is that the search are carried out serially. And in the the earlier thread (heh, I can't find it either, PTAV doesn't work and I've pounded on Super Search too long) on the subject, someone pointed out the dangers of replacement ordering. Mapping woman to girl and man to boy just might leave you with woboy.

update: found the original thread: multiple (different) substitutions on same $foo, thanks to jcwren fixing a bug in PTAV.

There are algorithms that perform single-pass (modulo setup) searching for multiple targets in a string. They can provide dramatic gains when you're matching many patterns, and they solve the ordering problem above. I remember Aho-Corasick and Knuth-Morris-Pratt from school, and a search on the Web turned up one I hadn't heard of, namely Commentz-Walter. (Google for these terms, with additional terms, including pattern string multiple search).

I think Sedgewick covers the first two algorithms (but my copy's at work). Otherwise a reference implementation appears to be available via ftp here: http://compilers.iecc.com/comparch/article/94-06-032.

From what I recall, these algorithms only search, they don't replace. As you scan the string, you copy over the unmatched runs to the result string, and each time you hit a match you figure out what to replace it with. This part is tricky. If you're searching a for a regexp, you can't use what you matched as a key to a hash lookup, to find what you want to replace. That is, for a hypothetical string rice flies like sand and performing the following replacements (bear with me on the mangled syntax):

/(i[a-z]*e)/ => "ubb$1" /([a-z])([ld])/ => "${1}o$2"

The resulting string would (if I am not mistaken) become rubbice folubbies lubbike sanod.

You can't use a hash lookup here, because the regexp /(i[a-z]*e)/ matches ice, ie and ike. In other words, meta-characters prevent you from using the result as a key. If you don't use meta-characters, then you can. Otherwise you need to determine the index within the list of sought patterns instead and then apply the replacement.

I investigated this approach some years ago. I needed to replace several hundred patterns in 5-10 megabyte files. At the time no modules existed on CPAN to do this. I lost interest in the approach due to time constraints (I just suffered the pain of figuring out the ordering and applying the replacements serially), but if such a module does exist (and has an elegant, intuitive interface -- this is definitely hard), I'd love to hear about it.


In reply to Re: Chaining string ops (multiple searches in linear time) by grinder
in thread Chaining string ops by traveler

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.