Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

scan, match and extract

by seismofish (Novice)
on May 19, 2022 at 11:13 UTC ( [id://11143998]=perlquestion: print w/replies, xml ) Need Help??

seismofish has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, Monks

I frequently have to scan files for lines matching various patterns and extract data from those lines. I use code like

my $re_dn = qr{^(\w+) (\d+) (\d{2}:\d{2}:\d{2}) .*: eth(\d): link down +}; my $re_dn = qr{^(\w+) (\d+) (\d{2}:\d{2}:\d{2}) .*: eth(\d): link up, +(\d+)Mbps, ([^,]+), lpa (\w+)}; while(<>) { if ( /$qr_dn/ ) { my( $mon, $day, $time, $interface ) = ($1,$2,$3,$4); # process next; } if ( /$qr_up/ ) { my( $mon, $day, $time, $interface, $rate, $duplex, $lpa ) += ($1,$2,$3,$4,$5,$6,$7); # process next; } }
but I hate constructs like ($1,$2,$3,$4,$5,$6,$7)

I'd love to know a more elegant idiom. Will someone enlighten me?

Replies are listed 'Best First'.
Re: scan, match and extract
by Fletch (Bishop) on May 19, 2022 at 11:26 UTC

    Perhaps named references and using $+{foo} instead? Not getting a good direct link but search for “named capture group” in perlre e.g  (?<NAME>pattern), and %+ in perlvar.

    Edit: Easier to get the direct link not on ipad: named capture groups.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: scan, match and extract
by hippo (Bishop) on May 19, 2022 at 13:05 UTC

    Here is an SSCCE showing the storage of the capture groups in an array. Then you don't need a list of numbered scalars.

    use strict; use warnings; use Test::More tests => 2; my $re_dn = qr{^(\w+) (\d+) (\d{2}:\d{2}:\d{2}) .*: eth(\d): link down +}; my $re_up = qr{^(\w+) (\d+) (\d{2}:\d{2}:\d{2}) .*: eth(\d): link up, +(\d+)Mbps, ([^,]+), lpa (\w+)}; my @log = ( 'May 19 13:58:01 foo: eth0: link down', 'May 19 13:58:11 foo: eth0: link up, 100Mbps, bar, lpa quux' ); for (@log) { my @matches; if (@matches = /$re_dn/) { my ( $mon, $day, $time, $interface ) = @matches; is $time, '13:58:01'; next; } if (@matches = /$re_up/) { my ( $mon, $day, $time, $interface, $rate, $duplex, $lpa ) = @ +matches; is $time, '13:58:11'; next; } }

    Increase the number of tests as you wish to confirm that all the groups are returning the correct results.


    🦛

      When coding like this, I usually omit @matches and proceed directly with assigning to the scalars in list context. I think the following code is equivalent. I guess this is just a fine style point. In any event, the need to explicitly reference $1,$2,$3,etc. is rare.

      use strict; use warnings; use Test::More tests => 2; my $re_dn = qr{^(\w+) (\d+) (\d{2}:\d{2}:\d{2}) .*: eth(\d): link down +}; my $re_up = qr{^(\w+) (\d+) (\d{2}:\d{2}:\d{2}) .*: eth(\d): link up, +(\d+)Mbps, ([^,]+), lpa (\w+)}; my @log = ( 'May 19 13:58:01 foo: eth0: link down', 'May 19 13:58:11 foo: eth0: link up, 100Mbps, bar, lpa quux' ); for (@log) { if (my ( $mon, $day, $time, $interface ) = /$re_dn/) { is $time, '13:58:01'; next; } if (my ( $mon, $day, $time, $interface, $rate, $duplex, $lpa ) = / +$re_up/) { is $time, '13:58:11'; next; } }
      Update:
      It could be that something other than a "one step" regex might be appropriate?. It appears that ": link down" or ": link up" are key things, and I am guessing that that there a heck of a lot of lines in this file that don't have those key words. It could be that if speed is important, do a strcmp() index() looking for either of those phrases on the line, or just ": link". If the line contains one of those strings, it could also be that a simple split on whitespace+,[\s;,] would be faster than a regex. Or at least easier to write.

      I parse a lot of printouts that are just intended for humans to read. There is kind of an "art" to that. And the result is always inherently unstable because there is no defined spec. This looks to be a lot more constrained because this appears to be some kind of standard program generated log.

      Update 2:
      In my opinion, the names $re_dn and $re_up are one step "too clever". I probably would have used $re_link_down and $re_link_up. Using some extra characters in the name costs essentially nothing in compile speed. And I guess if you have a really fancy editor or IDE with auto complete, nothing much in typing either. This is a style issue. I mention it because details like this can matter when you re-visit code that you wrote X years ago.

      use strict; use warnings; my @log = ( 'May 18 12:01:01 bar: eht1 BS', 'May 19 13:58:01 foo: eth0: link down', 'May 19 13:58:11 foo: eth0: link up, 100Mbps, bar, lpa quux' ); foreach my $line (@log) { next unless (index($line, ": link")>=0); # skips BS # this is very fast # even running twice for # link up or link down is +fast print "$line\n"; # focus on link mentioning lines here } #May 19 13:58:01 foo: eth0: link down #May 19 13:58:11 foo: eth0: link up, 100Mbps, bar, lpa quux
Re: scan, match and extract
by LanX (Saint) on May 19, 2022 at 11:46 UTC
    I would combine the already suggested named capture groups with interpolated regexes.

    my $month = qr{(?<month>\w+)}; # ...etc my $re_dn = qr{^$month (\d+) (\d{2}:\d{2}:\d{2}) .*: eth(\d): link dow +n}; my $re_dn = qr{^$month (\d+) (\d{2}:\d{2}:\d{2}) .*: eth(\d): link up, + (\d+)Mbps, ([^,]+), lpa (\w+)};

    also consider using the /x modifier for better readability.

    if you are concerned about DRY and error avoidance, you can also use a pattern hash and automatically wrap the names later

    # sketch, totally untested # define groups my %p = ( month => '\w+', ... ); # wrap named groups later while ( my ($k,$v) = each %p ) { $v = qr{(?<$k>)$v}; } # compose patterns my $re_dn = qr{^$p{month} $p{day} ...};

    on a side note: you can also chain elsif to avoid always writing next in every if block

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

Re: scan, match and extract
by tybalt89 (Monsignor) on May 19, 2022 at 14:35 UTC
    my( $mon, $day, $time, $interface, $rate, $duplex, $lpa ) = @{^CAPTURE +};
Re: scan, match and extract
by kcott (Archbishop) on May 19, 2022 at 14:43 UTC

    G'day seismofish,

    Welcome to the Monastery (at least with respect to posting).

    You can use constants to avoid:

    • The need to reference capture variables; e.g. ($1,$2,$3,$4,$5,$6,$7).
    • The need to declare and define a list of variables; e.g. ( $mon, $day, $time, $interface, $rate, $duplex, $lpa ).
    • The use of named captures requiring a multitude of ?<NAME> added to your regexes.

    Here's an example:

    #!/usr/bin/env perl use strict; use warnings; use constant { MON => 0, DAY => 1, TIME => 2, INTERFACE => 3, RATE => 4, DUPLEX => 5, LPA => 6, }; use Test::More tests => 12; my $re_dn = qr{^(\w+) (\d+) (\d{2}:\d{2}:\d{2}) .*: eth(\d): link down +}; my $re_up = qr{^(\w+) (\d+) (\d{2}:\d{2}:\d{2}) .*: eth(\d): link up, +(\d+)Mbps, ([^,]+), lpa (\w+)}; my @logs = ( 'May 19 13:58:01 foo: eth0: link down', 'May 19 13:58:11 foo: eth0: link up, 100Mbps, bar, lpa quux', '# some comment in the logfile', ); for (@logs) { my @matches; if (@matches = /$re_dn/) { is $matches[MON], 'May'; is $matches[DAY], 19; is $matches[TIME], '13:58:01'; is $matches[INTERFACE], 0; } elsif (@matches = /$re_up/) { is $matches[MON], 'May'; is $matches[DAY], 19; is $matches[TIME], '13:58:11'; is $matches[INTERFACE], 0; is $matches[RATE], 100; is $matches[DUPLEX], 'bar'; is $matches[LPA], 'quux'; } else { is 0+@matches, 0; } }

    Output:

    1..12 ok 1 ok 2 ok 3 ok 4 ok 5 ok 6 ok 7 ok 8 ok 9 ok 10 ok 11 ok 12

    — Ken

Re: scan, match and extract
by BillKSmith (Monsignor) on May 19, 2022 at 20:48 UTC
    Matching parts of the line separately has several advantages. No matching is duplicated. Individual regexes are shorter and extract fewer parameters. Uninteresting lines are discarded sooner. All variables are properly scoped.
    use strict; use warnings; my $time_stamp = qr{^(\w+) (\d+) (\d{2}:\d{2}:\d{2})}; while (<DATA>) { next if !(my($mon, $day, $time) = /$time_stamp/); next if !(my($interface, $updown) = $' =~ /.*:\seth(\d): link (up| +down)/); if ( $updown eq 'down' ) { #... # Process down print $time, "\n"; } else { my( $rate, $duplex, $lpa ) = $' =~ /,(\d+)Mbps, ([^,]+), lpa ( +\w+)/; next if !defined($lpa); # Invalid 'up' probably never happens. #... # Process up print $lpa, "\n"; } } __DATA__ May 03 10:15:21 some text: eth3: link down May 04 10:16:23 some text: eth3: link up,562Mbps, foos, lpa fum no time

    UPDATE: Refer to @LAST_MATCH_START and @LAST_MATCH_END for documentation on $'.

    Bill

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11143998]
Approved by marto
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2024-04-24 21:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found