comment on

I'm thinking aloud, apologies if it is something uninteresting, but here are my thoughts. In your example, f.ex. input can be "Crjstjan" and regex matching it "[CL]r[ij]s[it][ij]a[hn]", which I assume burdens the regex engine somewhat to look for alterations. I wonder it would be faster to make a pre-run on $text so it is first transformed in a text containing the alterations verbatim, e.g. make all instances of "Cristian" and "Crjstjan" to look like "<CL>r<ij><it><ij>a<hn>" (<> instead of [] just for the sake of visual difference) in the first place.

Possibly that helps with the speed, and if it does indeed, then things may become interesting. First, the regex matching the names can be then a simple concatenation of lexems like above, and second, its not necessarily that the second regex run would be needed at all, a trivial hash replacement would be enough, something along the following:

my %replace = (
 "<CL>r<ij><it><ij>a<hn>" => "Christian",
 ...
);
$text =~ s/\b(\w+)\b/exists($replace{$1}) ? $replace{$1} : $1/ge;
[download]

( I know it is naive, I've seen that you match sentences, not individual words, but still ).

Again, if the alterations only consist of max 4 characters, I'm thinking that instead of composing them into "<ab>" structure, one can make them into a single unicode character f.ex. (pack("U1"), (ord("a") << 8) + ord("b")), and thus possibly gaining some extra milliseconds.

In reply to Re: Regexp and OCR by dk
in thread Regexp and OCR by sflitman

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.