Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,
I have a data file with the following format: 3642 01:19:55 01-Jan-2001 134 51909 51750 1.509667E-07 Now to extract the time, date and last column, I thought I'd use: while(<DATA>) { /^(\d+) (\d\d:\d\d:\d\d) (\d\d-\D+-\d{4}) (\d+) (\d+\s+\d+) (\d.\d(E +[+-])\d+) /; print "$2 $3 $6\n"; } But I can't get the regx to get the "E-\d\d" term!!! Please help! BTW, it this a more efficient means to get data, or is split better? Regards, Stacy.

Replies are listed 'Best First'.
Re: Regular expression
by Masem (Monsignor) on May 01, 2001 at 17:30 UTC
    Make it easier on yourself, and use split to grab each item, since you already have the spaces there:
    my ( $id, $time, $date, $col1, $col2, $col3, $col4 ) = split / /, $line;
    Then you can concentrate on any other checks that you want to do to make sure the number if valid. Note that in your expression to get the E number, your are using '.', which you need to escape if you want to match a decimal point, otherwise it will simply match any character. Try somethign like: \d\.\d*E[-+]\d\d (particular if this is coming from a fortran or c output code).


    Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
Re: Regular expression
by Anonymous Monk on May 01, 2001 at 17:40 UTC
    I was asking about efficiency between split and using a regx: Here is the stuff: Using split: while(<DATA>) { ($time,$date,$data1,$data2) = (split)[1,2,6,7]; print "$time $date $data1 $data2", "\n"; } the code took:26 wallclock secs ( 2.17 usr + 0.55 sys = 2.72 CPU) Using an REGX: while(<DATA>) { /^(\d+) (\d\d:\d\d:\d\d) (\d\d-\w+-\d{4}) (\d+) (\d+\s+\d+) (\d.\d*E +[+-]\d+) (\d.\d*E[+-]\d+)/; print "$2 $3 $6 $7\n"; } the code took:25 wallclock secs ( 1.98 usr + 0.73 sys = 2.71 CPU) Much of a muchness really. Regards, Stacy.
      Hmmm... If I print to a file instead of STDOUT, I get the code down to 13-14 seconds for both methods ... Regards, Stacy.
Re: Regular expression
by suaveant (Parson) on May 01, 2001 at 17:47 UTC
    No one really told you why you query for the last number failed, but Masem sorta did... the reason it failed was that \d.\d matches one number, then any one character, then one number... as Masem said, you need to use \. in order to escape . to match . instead of one of any character, but you also needed \d* or \d+ after the \. to match more than one number. You were only matching the 5 after the . instead of 509667... so the regex couldn't match the E
                    - Ant
      So what if there is a '-' in front of the data in the last column:
      /^(\d+) (\d\d:\d\d:\d\d) (\d\d-\w+-\d{4}) (\d+) (\d+\s+\d+) ([-]\d+\. +\d+E[-+]\d+)/; It don't work...
        -? allows for one or no -

        ? is one or none, * is zero or more, + is 1 or more
                        - Ant

Re: Regular expression
by DeaconBlues (Monk) on May 01, 2001 at 19:55 UTC

    Depending the context of parsing the log, you might find it easier to use AWK!! Woohoo!

    Something like:

    { print $2"|"$3"|"$7 }

    You would run something like this

    awk '{print $2"|"$3"|"$7}' web.log | perl parselog.pl

    Then your perl would be something like this

    while (<>) { chomp; my ($time, $date, $expo) = split /\|/; print "$time, $date, $expo\n"; }

    I have recently starting using AWK to parse through delimited files. It's nice. Sorry about suggesting a non-perl solution. :-) I think it might be *NIX only too, but I am not sure.

Re: Regular expression
by le (Friar) on May 01, 2001 at 17:34 UTC
    Maybe this will work:
    while (<DATA>) { print "$1 $2 $3\n" if /^\S+\s+(\S+)\s+(\S+)\s+\S+\s+\S+\s+\S+\s+(\ +S+)$/; }
    Might be not too efficient.
Re: Regular expression
by converter (Priest) on May 01, 2001 at 19:07 UTC
    If all your records have the same format, split is probably the way to go:
    while (<DATA>) { (undef, $time, $date, undef, undef, undef, $num) = split; }