Re: Regexp and OCR

I'm thinking aloud, apologies if it is something uninteresting, but here are my thoughts. In your example, f.ex. input can be "Crjstjan" and regex matching it "[CL]r[ij]s[it][ij]a[hn]", which I assume burdens the regex engine somewhat to look for alterations. I wonder it would be faster to make a pre-run on $text so it is first transformed in a text containing the alterations verbatim, e.g. make all instances of "Cristian" and "Crjstjan" to look like "<CL>r<ij><it><ij>a<hn>" (<> instead of [] just for the sake of visual difference) in the first place.

Possibly that helps with the speed, and if it does indeed, then things may become interesting. First, the regex matching the names can be then a simple concatenation of lexems like above, and second, its not necessarily that the second regex run would be needed at all, a trivial hash replacement would be enough, something along the following:

my %replace = (
 "<CL>r<ij><it><ij>a<hn>" => "Christian",
 ...
);
$text =~ s/\b(\w+)\b/exists($replace{$1}) ? $replace{$1} : $1/ge;
[download]

( I know it is naive, I've seen that you match sentences, not individual words, but still ).

Again, if the alterations only consist of max 4 characters, I'm thinking that instead of composing them into "<ab>" structure, one can make them into a single unicode character f.ex. (pack("U1"), (ord("a") << 8) + ord("b")), and thus possibly gaining some extra milliseconds.

Comment on Re: Regexp and OCR Select or Download Code

Replies are listed 'Best First'.
Re^2: Regexp and OCR by sflitman (Hermit) on Jun 27, 2009 at 22:12 UTC
I'd suggest coding up the two and running them with Benchmark and post the results. For my problem, this wouldn't work because the OCR text is already variable, so replacing Cristian in that text would kind of be the same problem as identifying Cristian as account number 14222 in the first place. SSF	[reply]