comment on

... There are in the order of 20,000 texts and 50,000-100,000 patterns,

In general it's much faster to match one large regex than many regexes many times.

That means you could try to assemble a regex of $x original regexes into one.

Now it seems you have to know which regex matched, which means you have to distinguish them. In perl 5.10.0 or above you can use named captures. If you can't require such a new perl version, you can try something like this instead:

our $which_matched;

sub assemble_regex {
    my %regexes = @_;
    return join '|',
           map { q[(?:$regexes{$_})(?{\$which_matched='$_'})]}
           keys %regexes;

}
[download]

This assumes that keys in %regexes don't contain single quotes and trailing backslashes.

If many of the patterns are constant strings, consider upgrading to perl 5.10.0 - it greatly speeds up matching of many constant alternatives.

In reply to Re: Efficient regex matching with qr//; Can I do better? by moritz
in thread Efficient regex matching with qr//; Can I do better? by kruppy

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.