comment on

Rata,

ok, it's clear that solution 1) has exponential runtime and space requirements (size of the charset doesn't even matter - (well until JavaFan points out that a charset of 1 char would...) )

I basically went for solution 2) before, but only with some half-baked heuristic constructs, so probably a more high-level description of the problem is in place: I'm working on a chatbot demo which has to catch some phrases.

Thats a simple pattern matching just before any parsing, and delivers decent speed. And the whole point of this discussion and my endeavour is, that the set of the regular expressions gets extended and it is not feasible (at least I haven't come up with a feasible way) to sort them by "do-what-i-mean" manually.

Doing a trivial "sort by length" has actually worked very well for a set of up to 500-600 rules. But now I start getting false matches ONLY because some "more generic" regular expressions are handled before the more specific ones.

Under the assumption, that (a|b|c) is more generic than (a), it is clear, that the simple "by length" ordering will result in the wrong order, because I want to process the "specific" rules first, and the "generic" last.

What has helped:

removing e.g. named matches from length considerations, making e.g. (?<name>...) 9 chars shorter
Having an alternation like (a|b|c) being considered as just having length of 1, same for character classes
actually subtracting a length of 1 for every wildcard.

Maybe the best solution could be to give every rule an identifier (which has happened anyway), and add some information "try this before ruleX". Which is starting to look like a parser. :-)

Bye
PetaMem All Perl: MT, NLP, NLU

In reply to Re^2: Analysis of Regular Expressions by PetaMem
in thread Analysis of Regular Expressions by PetaMem

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.