Samn has asked for the wisdom of the Perl Monks concerning the following question:
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Regex simplification
by Popcorn Dave (Abbot) on Aug 26, 2002 at 03:05 UTC | |
For the example text you posted, this does work.
What it is saying is, look for a single dash surronded by spaces. Then the next alphanumeric characters up until the next space are stored in $1. That's where the parenthesis come in with your match. If you have more than one set of parens, then your matches are stored in $2, $3, etc... I am going on the assumption here that all your data is in that format. If not, then hopefully that will give you a start in the right direction. Good luck! Some people fall from grace. I prefer a running start... | [reply] [d/l] |
by Cody Pendant (Prior) on Aug 26, 2002 at 03:22 UTC | |
because you might get false-positive matches the other way. --
| [reply] [d/l] [select] |
by Thelonius (Priest) on Aug 26, 2002 at 03:53 UTC | |
This is good, but it \S is shorter than [^\s], so: Although to get a little closer to the original specification, I'd put:
| [reply] [d/l] [select] |
|
Re: Regex simplification
by Django (Pilgrim) on Aug 26, 2002 at 04:29 UTC | |
To gain performance I would use those non-backtracking subpatterns "(?> )"
I've also specified the pattern as exactly as possible, because this will also fail earlier and thus speed up the engine. | [reply] [d/l] |
|
Re: Regex simplification
by ehdonhon (Curate) on Aug 26, 2002 at 04:39 UTC | |
One thing that might speed up your code is to use a compiled regex:
| [reply] [d/l] |
|
Re: Regex simplification
by Arien (Pilgrim) on Aug 26, 2002 at 08:27 UTC | |
Extracting the lines that match for an array of lines using the Perl function grep (as opposed to the program) is no more complicated than this: my @matches = grep /PATTERN/, @lines;Now, since you will be extracting the usernames from these matches as well, you might as well do that while matching, as explained by Popcorn Dave. Don't use "dot start" (.*) in your regex (although some regexes above do), because it will cause unnecessary backtracking. Dot matches anything but a newline by default and the star indicates "zero or more of the preceeding". So, when trying to match a line and getting to "dot star" this will match to the end of the line and after that the dot will let go, bit by bit, anything necessary for an overall match. Things will get worse when "dot star" makes more appearances in the regex. As far as the regex goes, it seems from your code that this will do just fine: /<!-- USER \d+ - (\S+) -->/iThat is, match <!-- USER followed by a space, some number, a space, a minus, a space, one or more occurences of a non-whitespace, a space, and finally -->. All this case-insensitively. Although non-backtracking subpatterns admittedly will help you somewhat in making your code faster, I would not use them if they're not really needed: they would just obscure what is happening. Putting it all together, you would end up with something like this:
You may see people doing the same thing like this: my @users = map { /<!-- USER \d+ - (\S+) -->/i ? $1 : () } @lines;What is happening here is that for each element of @lines you check if the line matches your regex. If so, you add the value of $1 (the username) to the list of @users; if not, you add an empty list (ie. nothing) to @users. This might come in handy when reading other peoples' code. Hope this helps. — Arien Edit: Also, if you know what you are looking for can only appear at the start of the line you can speed things up by anchoring your regex (using ^) like this: /^<!-- USER \d+ - (\S+) -->/i | [reply] [d/l] [select] |
|
Re: Regex simplification
by mephit (Scribe) on Aug 26, 2002 at 20:00 UTC | |
Anyway, that's my (Not-So-)Good Idea for the day. Update I just ran some benchmarks on a few of the methods suggested. Here's my code and results:
-- There are 10 kinds of people -- those that understand binary, and those that don't. | [reply] [d/l] [select] |