batcater98 has asked for the wisdom of the Perl Monks concerning the following question:

I have a flat file with rows and rows of data, in this file I need to skip most lines, but others I need to extract data from. Below is an example of a line that I need to pull data from and what data I need. Does anyone have an idea of what to match on and how to pull that data?

DATA:

e:\logfiles\beardstownbase.log [3] Thu 22Jun06 08:07:19 - (006415) Sent file d:\data\58bn5904.dat successfully (25.0 Kb/sec - 859216 bytes)

It will look like this every time only the information I need will change other items to key from will remain the same.

I need to key off of the following the .dat & successfully when I find those two on the same line I want to extract the following using a Regex. Other lines Skip.

From example Data I would want to pull:

beardstownbase,Thu 22Jun06 08:07:19,58bn5904,859216 bytes

knowing that each of these feilds will change, but surrounding data will not change. Ideas?

I could see keying off of the //'s and the -' or ('s...

Replies are listed 'Best First'.
Re: Regex Help pulling Data from a string
by chargrill (Parson) on Dec 21, 2006 at 20:56 UTC

    There are probably CPAN modules lying around that can parse logfiles for you, but my first inclination would be to key off spaces (using split):

    #!/usr/bin/perl use strict; use warnings; my @records; for my $logfile_line( <DATA> ){ next unless $logfile_line =~ m/logfiles.*\.log.*successfully/; my( $logfile, $day, $date, $time, $datfile, $size, $units ) = ( split( /\s/, $logfile_line ) )[ 0, 2, 3, 4, 9, 14, 15 ]; $logfile =~ s/.*\\(\w+)\.log/$1/; $datfile =~ s/.*\\(w+)\.dat/$1/; $units =~ s/\)$//; push @records, [ $logfile, "$day $date $time", $datfile, "$size $uni +ts" ]; } __DATA__ e:\logfiles\beardstownbase.log [3] Thu 22Jun06 08:07:19 - (006415) Sen +t file d:\data\58bn5904.dat successfully (25.0 Kb/sec - 859216 bytes)

    Or something similar. Note that you're left with an AoA. And that this is completely untested. Update: Tested slightly, corrected list indices and scrubbed data. Update2: Missed the part about skipping lines that don't match. Adjust the "next unless" to suit, since I don't know what non-matching lines will/could possibly look like.



    --chargrill
    s**lil*; $*=join'',sort split q**; s;.*;grr; &&s+(.(.)).+$2$1+; $; = qq-$_-;s,.*,ahc,;$,.=chop for split q,,,reverse;print for($,,$;,$*,$/)
Re: Regex Help - Large regex example, and larger Parse::RecDescent attempt
by imp (Priest) on Dec 22, 2006 at 06:39 UTC
    The appropriate solution to this problem depends on how precise the pattern matching needs to be. How much post-extraction processing you are willing to do matters as well, e.g. do you need '58bn5904' or are you content with 'd:\data\58bn5904.dat'.

    To give you an idea of how ugly the regex could become:

    use strict; use warnings; # Example line: # e:\logfiles\beardstownbase.log [3] Thu 22Jun06 08:07:19 - (006415) S +ent file d:\data\58bn5904.dat successfully (25.0 Kb/sec - 859216 byte +s) # Desired: # beardstownbase,Thu 22Jun06 08:07:19,58bn5904,859216 bytes my $re_date = qr< (?:Sun|Mon|Tue|Wed|Thu|Fri|Sat) \s \d{1,2} # Day of month (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) # Month \d{2} # Two digit year \s \d{2}:\d{2}:\d{2} >x; my $pattern = qr< e:\\logfiles\\(.*?) # Capture(1) filename \s \[\d+\] # Bracketed number \s ($re_date) # Capture(2) date \s - \s \(\d+\) # number in parens \s Sent \s file \s d:\\data\\(.*?)\.dat # Capture(3) file basename \s successfully \s \( [0-9.]+ \s [A-Z]b /sec [ ] - [ ] (\d+ \s bytes) # Capture(4) bytes text \) >x; while (my $line = <DATA>) { if ($line =~ /$pattern/) { my ($logfile, $date, $file_basename, $bytes) = ($1,$2,$3,$4); printf "(%s) (%s) (%s) (%s)\n", $logfile,$date,$file_basename, + $bytes; } } __DATA__ e:\logfiles\beardstownbase.log [3] Thu 22Jun06 08:07:19 - (006415) Sen +t file d:\data\58bn5904.dat successfully (25.0 Kb/sec - 859216 bytes)
    I have been meaning to learn Parse::RecDescent for ages, so tonight I took some time to try and solve your problem with it. It is likely the wrong tool for this job, and definitely a poor implementation - I would welcome any feedback for people with stronger parse-fu.
    use strict; use warnings; use Parse::RecDescent; $::RD_HINT=5; my $grammar = <<'GRAMMAR'; { use strict; use warnings; } logfile : 'e:\\logfiles\\' /[-A-Za-z0-9_.]+/ { $item[2] } date : m{ (?:Mon|Tue|Wed|Thu|Fri|Sat|Sun) \s \d\d (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d\d }x time : /\d{2}:\d{2}:\d{2}/ sentfile: <skip:''> 'd:\\data\\' /[-A-Za-z0-9_]+/ '.dat' { $item[3] } rate : /\d+\.\d [A-Za-z]+\/sec/ bytecount : /\d+ bytes/ parse : logfile /\[\d+\]/ date time /- \(\d+\) Sent file / sentfile <skip:'[- \t()]*'> ( /successfully/ rate ) bytecount { [ @item{qw(logfile date time sentfile bytecount)}] } GRAMMAR # Expect: beardstownbase,Thu 22Jun06 08:07:19,58bn5904,859216 bytes my $parser = Parse::RecDescent->new($grammar); use Data::Dumper; while (my $line = <DATA>) { last unless $line =~ /\S/; my @fields = $parser->parse($line); if (@fields) { print Dumper \@fields; } } __DATA__ e:\logfiles\beardstownbase.log [3] Thu 22Jun06 08:07:19 - (006415) Sen +t file d:\data\58bn5904.dat successfully (25.0 Kb/se
    Output:
    $VAR1 = [ [ 'beardstownbase.log', 'Thu 22Jun06', '08:07:19', '58bn5904', '859216 bytes' ] ];