Samn has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regex simplification
by Popcorn Dave (Abbot) on Aug 26, 2002 at 03:05 UTC
    If you're going to do this using a regex you're going to need to use m's parentheses matching capabilities.

    For the example text you posted, this does work.

    #!/usr/bin/perl -w use strict; $line = '<!-- USER 20 - donkey_pusher_6 -->'; print $1 if $line =~ m/\s-{1}\s(\w.+)\s/i;

    What it is saying is, look for a single dash surronded by spaces. Then the next alphanumeric characters up until the next space are stored in $1. That's where the parenthesis come in with your match. If you have more than one set of parens, then your matches are stored in $2, $3, etc...

    I am going on the assumption here that all your data is in that format. If not, then hopefully that will give you a start in the right direction.

    Good luck!

    Some people fall from grace. I prefer a running start...

      I'd do it like this:
      $line = '<!-- USER 20 - donkey_pusher_6 -->'; print $1 if $line =~ m/<!-- USER \d+ - ([^\s]+)/i;
      because you might get false-positive matches the other way.
      --
      ($_='jjjuuusssttt annootthheer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;
        print $1 if $line =~ m/<!-- USER \d+ - ([^\s]+)/i;
        This is good, but it \S is shorter than [^\s], so:
        print $1 if $line =~ m/<!-- USER \d+ - (\S+)/i;
        Although to get a little closer to the original specification, I'd put:
        my $user=undef; for (@site) { if (/<!--.*USER.* (\S+) -->/) { print "joe:\n"; $user = $1; last; } } print "user = $user\n";
Re: Regex simplification
by Django (Pilgrim) on Aug 26, 2002 at 04:29 UTC

    To gain performance I would use those non-backtracking subpatterns "(?> )"
    . From the Camels third edition (p.206):
    "...if you're going to fail, it's best to fail quickly and get on with your life."

    my @users; foreach (@site) { / ^ (?>\s*) <!-- (?>\s+) USER (?>\s+) (?>\d+) (?>\s+) - (?>\s+) (\S+ +?) (?>\s+) --> (?>\s*) $ /ix and push @users, $1; } print @users;

    I've also specified the pattern as exactly as possible, because this will also fail earlier and thus speed up the engine.

Re: Regex simplification
by ehdonhon (Curate) on Aug 26, 2002 at 04:39 UTC

    One thing that might speed up your code is to use a compiled regex:

    my $search = qr/--\s*USER\s+\d+\s*-\s*(\w+)/; foreach $line (@site) { next unless ( $line =~ $search ); $user = $1; }
Re: Regex simplification
by Arien (Pilgrim) on Aug 26, 2002 at 08:27 UTC

    Extracting the lines that match for an array of lines using the Perl function grep (as opposed to the program) is no more complicated than this:

    my @matches = grep /PATTERN/, @lines;

    Now, since you will be extracting the usernames from these matches as well, you might as well do that while matching, as explained by Popcorn Dave.

    Don't use "dot start" (.*) in your regex (although some regexes above do), because it will cause unnecessary backtracking. Dot matches anything but a newline by default and the star indicates "zero or more of the preceeding". So, when trying to match a line and getting to "dot star" this will match to the end of the line and after that the dot will let go, bit by bit, anything necessary for an overall match. Things will get worse when "dot star" makes more appearances in the regex.

    As far as the regex goes, it seems from your code that this will do just fine:

    /<!-- USER \d+ - (\S+) -->/i

    That is, match <!-- USER followed by a space, some number, a space, a minus, a space, one or more occurences of a non-whitespace, a space, and finally -->. All this case-insensitively.

    Although non-backtracking subpatterns admittedly will help you somewhat in making your code faster, I would not use them if they're not really needed: they would just obscure what is happening.

    Putting it all together, you would end up with something like this:

    my @users; foreach (@lines) { /<!-- USER \d+ - (\S+) -->/i and push @users, $1; }

    You may see people doing the same thing like this:

    my @users = map { /<!-- USER \d+ - (\S+) -->/i ? $1 : () } @lines;

    What is happening here is that for each element of @lines you check if the line matches your regex. If so, you add the value of $1 (the username) to the list of @users; if not, you add an empty list (ie. nothing) to @users. This might come in handy when reading other peoples' code.

    Hope this helps.

    — Arien

    Edit: Also, if you know what you are looking for can only appear at the start of the line you can speed things up by anchoring your regex (using ^) like this:

    /^<!-- USER \d+ - (\S+) -->/i
Re: Regex simplification
by mephit (Scribe) on Aug 26, 2002 at 20:00 UTC
    Hmm, isn't substr usually faster than a regex? If so, how about the following approach:

    • Use rindex to find the indeces of the last and second-to-last spaces, as the OP requires.
    • Find the difference between those indeces to get the length of the desired string, and use that value (along with the index of the second-to-last space) in a <substr> call to get the required data
    Well, I'm sure that it would work, but would it be faster? I'll probably benchmark this myself sometime when I have the time to create data and code to test.

    Anyway, that's my (Not-So-)Good Idea for the day.

    Update I just ran some benchmarks on a few of the methods suggested. Here's my code and results:

    my $str = '<!-- USER 20 - donkey_pusher_6 -->'; my $data; my $re = qr/--\s*USER\s+\d+\s*-\s*(\w+)/; my ($start, $end); sub by_re_noback { ($data) = ($str =~ / ^ (?>\s*) <!-- (?>\s+) USER (?>\s+) (?>\d+) (?> +\s+) - (?>\s+) (\S+?) (?>\s+) --> (?>\s*) $ /ix); } sub by_re { ($data) = ($line =~ m/<!-- USER \d+ - (\S+)/i); } sub by_re_comp { ($data) = ($str =~ $re); } sub by_substr { $end = rindex($str, ' '); $start = rindex($str, ' ', $end - 1); $data = substr($str, $start + 1, $end - $start); } timethese (100000, { subst => \&by_substr, re_comp => \&by_re_comp, re => \&by_re, re_noback => \&by_re_noback, }); --results-- Benchmark: timing 100000 iterations of re, re_comp, re_noback, subst.. +. re: 1 wallclock secs ( 0.46 usr + 0.00 sys = 0.46 CPU) @ 21 +7391.30/s (n=100000) re_comp: 4 wallclock secs ( 4.35 usr + 0.00 sys = 4.35 CPU) @ 22 +988.51/s (n=100000) re_noback: 6 wallclock secs ( 6.27 usr + 0.00 sys = 6.27 CPU) @ 15 +948.96/s (n=100000) subst: 1 wallclock secs ( 1.40 usr + 0.00 sys = 1.40 CPU) @ 71 +428.57/s (n=100000)

    --

    There are 10 kinds of people -- those that understand binary, and those that don't.