Regex Help pulling Data from a string

batcater98 has asked for the wisdom of the Perl Monks concerning the following question:

I have a flat file with rows and rows of data, in this file I need to skip most lines, but others I need to extract data from. Below is an example of a line that I need to pull data from and what data I need. Does anyone have an idea of what to match on and how to pull that data?

DATA:

e:\logfiles\beardstownbase.log [3] Thu 22Jun06 08:07:19 - (006415) Sent file d:\data\58bn5904.dat successfully (25.0 Kb/sec - 859216 bytes)

It will look like this every time only the information I need will change other items to key from will remain the same.

I need to key off of the following the .dat & successfully when I find those two on the same line I want to extract the following using a Regex. Other lines Skip.

From example Data I would want to pull:

beardstownbase,Thu 22Jun06 08:07:19,58bn5904,859216 bytes

knowing that each of these feilds will change, but surrounding data will not change. Ideas?

I could see keying off of the //'s and the -' or ('s...

Comment on Regex Help pulling Data from a string Select or Download Code

Replies are listed 'Best First'.
Re: Regex Help pulling Data from a string by chargrill (Parson) on Dec 21, 2006 at 20:56 UTC
There are probably CPAN modules lying around that can parse logfiles for you, but my first inclination would be to key off spaces (using split): #!/usr/bin/perl use strict; use warnings; my @records; for my $logfile_line( <DATA> ){ next unless $logfile_line =~ m/logfiles.\.log.successfully/; my( $logfile, $day, $date, $time, $datfile, $size, $units ) = ( split( /\s/, $logfile_line ) )[ 0, 2, 3, 4, 9, 14, 15 ]; $logfile =~ s/.\$\w+)\.log/$1/; $datfile =~ s/.\\(w+)\.dat/$1/; $units =~ s/$$//; push @records, [ $logfile, "$day $date $time", $datfile, "$size $uni +ts" ]; } __DATA__ e:\logfiles\beardstownbase.log [3] Thu 22Jun06 08:07:19 - (006415) Sen +t file d:\data\58bn5904.dat successfully (25.0 Kb/sec - 859216 bytes) [download] Or something similar. Note that you're left with an AoA. And that this is completely untested. Update: Tested slightly, corrected list indices and scrubbed data. Update2: Missed the part about skipping lines that don't match. Adjust the "next unless" to suit, since I don't know what non-matching lines will/could possibly look like. --chargrill `s*lil; $=join'',sort split q; s;.;grr; &&s+(.(.)).+$2$1+; $; = qq-$_-;s,.,ahc,;$,.=chop for split q,,,reverse;print for($,,$;,$,$/)` [download]	[reply] [d/l] [select]
Re: Regex Help - Large regex example, and larger Parse::RecDescent attempt by imp (Priest) on Dec 22, 2006 at 06:39 UTC
The appropriate solution to this problem depends on how precise the pattern matching needs to be. How much post-extraction processing you are willing to do matters as well, e.g. do you need '58bn5904' or are you content with 'd:\data\58bn5904.dat'. To give you an idea of how ugly the regex could become: use strict; use warnings; # Example line: # e:\logfiles\beardstownbase.log [3] Thu 22Jun06 08:07:19 - (006415) S +ent file d:\data\58bn5904.dat successfully (25.0 Kb/sec - 859216 byte +s) # Desired: # beardstownbase,Thu 22Jun06 08:07:19,58bn5904,859216 bytes my $re_date = qr< (?:Sun\|Mon\|Tue\|Wed\|Thu\|Fri\|Sat) \s \d{1,2} # Day of month (?:Jan\|Feb\|Mar\|Apr\|May\|Jun\|Jul\|Aug\|Sep\|Oct\|Nov\|Dec) # Month \d{2} # Two digit year \s \d{2}:\d{2}:\d{2} >x; my $pattern = qr< e:\\logfiles\$.?) # Capture(1) filename \s \[\d+\] # Bracketed number \s ($re_date) # Capture(2) date \s - \s \(\d+$ # number in parens \s Sent \s file \s d:\\data\$.?)\.dat # Capture(3) file basename \s successfully \s \( [0-9.]+ \s [A-Z]b /sec [ ] - [ ] (\d+ \s bytes) # Capture(4) bytes text $ >x; while (my $line = <DATA>) { if ($line =~ /$pattern/) { my ($logfile, $date, $file_basename, $bytes) = ($1,$2,$3,$4); printf "(%s) (%s) (%s) (%s)\n", $logfile,$date,$file_basename, + $bytes; } } __DATA__ e:\logfiles\beardstownbase.log [3] Thu 22Jun06 08:07:19 - (006415) Sen +t file d:\data\58bn5904.dat successfully (25.0 Kb/sec - 859216 bytes) [download] I have been meaning to learn Parse::RecDescent for ages, so tonight I took some time to try and solve your problem with it. It is likely the wrong tool for this job, and definitely a poor implementation - I would welcome any feedback for people with stronger parse-fu. use strict; use warnings; use Parse::RecDescent; $::RD_HINT=5; my $grammar = <<'GRAMMAR'; { use strict; use warnings; } logfile : 'e:\\logfiles\\' /[-A-Za-z0-9_.]+/ { $item[2] } date : m{ (?:Mon\|Tue\|Wed\|Thu\|Fri\|Sat\|Sun) \s \d\d (?:Jan\|Feb\|Mar\|Apr\|May\|Jun\|Jul\|Aug\|Sep\|Oct\|Nov\|Dec) \d\d }x time : /\d{2}:\d{2}:\d{2}/ sentfile: <skip:''> 'd:\\data\\' /[-A-Za-z0-9_]+/ '.dat' { $item[3] } rate : /\d+\.\d [A-Za-z]+\/sec/ bytecount : /\d+ bytes/ parse : logfile /\[\d+\]/ date time /- $\d+$ Sent file / sentfile <skip:'[- \t()]*'> ( /successfully/ rate ) bytecount { [ @item{qw(logfile date time sentfile bytecount)}] } GRAMMAR # Expect: beardstownbase,Thu 22Jun06 08:07:19,58bn5904,859216 bytes my $parser = Parse::RecDescent->new($grammar); use Data::Dumper; while (my $line = <DATA>) { last unless $line =~ /\S/; my @fields = $parser->parse($line); if (@fields) { print Dumper \@fields; } } __DATA__ e:\logfiles\beardstownbase.log [3] Thu 22Jun06 08:07:19 - (006415) Sen +t file d:\data\58bn5904.dat successfully (25.0 Kb/se [download] Output: `$VAR1 = [ [ 'beardstownbase.log', 'Thu 22Jun06', '08:07:19', '58bn5904', '859216 bytes' ] ];` [download]	[reply] [d/l] [select]