Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: Regexp and OCR

by dk (Chaplain)
on Jun 22, 2009 at 12:43 UTC ( [id://773596]=note: print w/replies, xml ) Need Help??


in reply to Regexp and OCR

I'm thinking aloud, apologies if it is something uninteresting, but here are my thoughts. In your example, f.ex. input can be "Crjstjan" and regex matching it "[CL]r[ij]s[it][ij]a[hn]", which I assume burdens the regex engine somewhat to look for alterations. I wonder it would be faster to make a pre-run on $text so it is first transformed in a text containing the alterations verbatim, e.g. make all instances of "Cristian" and "Crjstjan" to look like "<CL>r<ij><it><ij>a<hn>" (<> instead of [] just for the sake of visual difference) in the first place.

Possibly that helps with the speed, and if it does indeed, then things may become interesting. First, the regex matching the names can be then a simple concatenation of lexems like above, and second, its not necessarily that the second regex run would be needed at all, a trivial hash replacement would be enough, something along the following:

my %replace = ( "<CL>r<ij><it><ij>a<hn>" => "Christian", ... ); $text =~ s/\b(\w+)\b/exists($replace{$1}) ? $replace{$1} : $1/ge;
( I know it is naive, I've seen that you match sentences, not individual words, but still ).

Again, if the alterations only consist of max 4 characters, I'm thinking that instead of composing them into "<ab>" structure, one can make them into a single unicode character f.ex. (pack("U1"), (ord("a") << 8) + ord("b")), and thus possibly gaining some extra milliseconds.

Replies are listed 'Best First'.
Re^2: Regexp and OCR
by sflitman (Hermit) on Jun 27, 2009 at 22:12 UTC
    I'd suggest coding up the two and running them with Benchmark and post the results. For my problem, this wouldn't work because the OCR text is already variable, so replacing Cristian in that text would kind of be the same problem as identifying Cristian as account number 14222 in the first place.

    SSF

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://773596]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (5)
As of 2024-03-28 20:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found