Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Chaining string ops (multiple searches in linear time)

by grinder (Bishop)
on Aug 28, 2003 at 20:10 UTC ( [id://287517]=note: print w/replies, xml ) Need Help??


in reply to Chaining string ops

The main problem I have with these solutions is that the search are carried out serially. And in the the earlier thread (heh, I can't find it either, PTAV doesn't work and I've pounded on Super Search too long) on the subject, someone pointed out the dangers of replacement ordering. Mapping woman to girl and man to boy just might leave you with woboy.

update: found the original thread: multiple (different) substitutions on same $foo, thanks to jcwren fixing a bug in PTAV.

There are algorithms that perform single-pass (modulo setup) searching for multiple targets in a string. They can provide dramatic gains when you're matching many patterns, and they solve the ordering problem above. I remember Aho-Corasick and Knuth-Morris-Pratt from school, and a search on the Web turned up one I hadn't heard of, namely Commentz-Walter. (Google for these terms, with additional terms, including pattern string multiple search).

I think Sedgewick covers the first two algorithms (but my copy's at work). Otherwise a reference implementation appears to be available via ftp here: http://compilers.iecc.com/comparch/article/94-06-032.

From what I recall, these algorithms only search, they don't replace. As you scan the string, you copy over the unmatched runs to the result string, and each time you hit a match you figure out what to replace it with. This part is tricky. If you're searching a for a regexp, you can't use what you matched as a key to a hash lookup, to find what you want to replace. That is, for a hypothetical string rice flies like sand and performing the following replacements (bear with me on the mangled syntax):

/(i[a-z]*e)/ => "ubb$1" /([a-z])([ld])/ => "${1}o$2"

The resulting string would (if I am not mistaken) become rubbice folubbies lubbike sanod.

You can't use a hash lookup here, because the regexp /(i[a-z]*e)/ matches ice, ie and ike. In other words, meta-characters prevent you from using the result as a key. If you don't use meta-characters, then you can. Otherwise you need to determine the index within the list of sought patterns instead and then apply the replacement.

I investigated this approach some years ago. I needed to replace several hundred patterns in 5-10 megabyte files. At the time no modules existed on CPAN to do this. I lost interest in the approach due to time constraints (I just suffered the pain of figuring out the ordering and applying the replacements serially), but if such a module does exist (and has an elegant, intuitive interface -- this is definitely hard), I'd love to hear about it.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://287517]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (6)
As of 2024-04-19 06:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found