in reply to perl regex extraction from a large input file

Although Perlmonks is not a code writing service, sometimes it is just easier to explain what you have to do by simply writing the code, especially when your requirements are not entirely clear.

I understand it that you have one large file with data in a certain format which you have to parse in order to get some kind of summarized results, i.e. the number of passed and failed "Case-URL" and "Req-URL" items.

Everytime you hear "large file" you should think of reading/parsing/handling the file on a record-by-record basis. That will minimize your memory requirements.

It also means that you have to determine the record format, more especially, the record delimiter. Sometimes the record delimiter can be as simple as a CR/LF, sometimes it is longer. In this case it is "__________________________________________________________\n"

What you have to do is to read the file record-by-record. Fortunately Perl can do that easily: all you have to do is tell Perl what is the record delimiter and assign that to the $/ variable.

Then you can read the file a record at a time and through the magic of regular expressions extract the data you need and update the variables with the count of the data found.

The following is one of the ways to do this:

# use Modern::Perl; use Data::Dump qw/dump/; local $/ = "__________________________________________________________ +\n"; my %results; while ( my $record = <DATA> ) { my $pass = $record =~ m/\Q***Passed***\E/ ? 'Passed' : 'Failed'; for my $line ( split /\n/, $record ) { next unless $line =~ /^\[/; my ( $case_req, $url ) = split /\s+-\s+/, $line; $results{$pass}{$case_req}{$url}++; } } say dump(%results); __DATA__ Execution start time 09/13/2013 02:43:55 pm [Case-Url] - www.google.com [Req-URL ] - www.qtp.com ***Passed*** __________________________________________________________ [Case-Url] - www.yahoo.com [Req-URL ] - www.msn.com ***Passed*** __________________________________________________________ [Case-Url] - www.google.com [Req-URL ] - www.qtp.com ***Failed***
Output:
( "Passed", { "[Case-Url]" => { "www.google.com" => 1, "www.yahoo.com" => 1 }, "[Req-URL ]" => { "www.msn.com" => 1, "www.qtp.com" => 1 }, }, "Failed", { "[Case-Url]" => { "www.google.com" => 1 }, "[Req-URL ]" => { "www.qtp.com" => 1 }, }, )

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

My blog: Imperial Deltronics

Replies are listed 'Best First'.
Re^2: perl regex extraction from a large input file
by hdb (Monsignor) on Sep 14, 2013 at 09:51 UTC

    The way you store the "Case-Url" and the "Req-URL" as hashes causes the loss of the link between the two. I don't know whether that is important or not. A rather simple approach should do in this case:

    while(<DATA>){ print "$1 - " if /\[.*\] - (.*)/i; print "$1\n" if /[*]+([^*]+)/; }
      Yes indeed. That is the problem with "fuzzy" requirements, there may (or not) be certain data-elements that need to be taken into account.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      My blog: Imperial Deltronics

        You could store the data as an array rather than hash, only change the one line to

        push @{$results{$pass}{$case_req}}, $url;