Your skill will accomplish what the force of many cannot |
|
PerlMonks |
Re: Regexp and OCRby dk (Chaplain) |
on Jun 22, 2009 at 12:43 UTC ( [id://773596]=note: print w/replies, xml ) | Need Help?? |
I'm thinking aloud, apologies if it is something uninteresting, but here are my thoughts. In your example, f.ex. input can be "Crjstjan" and regex matching it "[CL]r[ij]s[it][ij]a[hn]", which I assume burdens the regex engine somewhat to look for alterations. I wonder it would be faster to make a pre-run on $text so it is first transformed in a text containing the alterations verbatim, e.g.
make all instances of "Cristian" and "Crjstjan" to look like
"<CL>r<ij><it><ij>a<hn>" (<> instead of [] just for the sake of visual difference) in the first place.
Possibly that helps with the speed, and if it does indeed, then things may become interesting. First, the regex matching the names can be then a simple concatenation of lexems like above, and second, its not necessarily that the second regex run would be needed at all, a trivial hash replacement would be enough, something along the following:
( I know it is naive, I've seen that you match sentences, not individual words, but still ). Again, if the alterations only consist of max 4 characters, I'm thinking that instead of composing them into "<ab>" structure, one can make them into a single unicode character f.ex. (pack("U1"), (ord("a") << 8) + ord("b")), and thus possibly gaining some extra milliseconds.
In Section
Seekers of Perl Wisdom
|
|