comment on

Although Perlmonks is not a code writing service, sometimes it is just easier to explain what you have to do by simply writing the code, especially when your requirements are not entirely clear.

I understand it that you have one large file with data in a certain format which you have to parse in order to get some kind of summarized results, i.e. the number of passed and failed "Case-URL" and "Req-URL" items.

Everytime you hear "large file" you should think of reading/parsing/handling the file on a record-by-record basis. That will minimize your memory requirements.

It also means that you have to determine the record format, more especially, the record delimiter. Sometimes the record delimiter can be as simple as a CR/LF, sometimes it is longer. In this case it is "__________________________________________________________\n"

What you have to do is to read the file record-by-record. Fortunately Perl can do that easily: all you have to do is tell Perl what is the record delimiter and assign that to the $/ variable.

Then you can read the file a record at a time and through the magic of regular expressions extract the data you need and update the variables with the count of the data found.

The following is one of the ways to do this:

#
use Modern::Perl;
use Data::Dump qw/dump/;

local $/ = "__________________________________________________________
+\n";

my %results;

while ( my $record = <DATA> ) {
    my $pass = $record =~ m/\Q***Passed***\E/ ? 'Passed' : 'Failed';
    for my $line ( split /\n/, $record ) {
        next unless $line =~ /^\[/;
        my ( $case_req, $url ) = split /\s+-\s+/, $line;
        $results{$pass}{$case_req}{$url}++;
    }
}
say dump(%results);

__DATA__
Execution start time 09/13/2013 02:43:55 pm

[Case-Url] - www.google.com


[Req-URL ] - www.qtp.com


***Passed***
__________________________________________________________

[Case-Url] - www.yahoo.com


[Req-URL ] - www.msn.com


***Passed***

__________________________________________________________

[Case-Url] - www.google.com


[Req-URL ] - www.qtp.com


***Failed***
[download]

Output:

(
  "Passed",
  {
    "[Case-Url]" => { "www.google.com" => 1, "www.yahoo.com" => 1 },
    "[Req-URL ]" => { "www.msn.com" => 1, "www.qtp.com" => 1 },
  },
  "Failed",
  {
    "[Case-Url]" => { "www.google.com" => 1 },
    "[Req-URL ]" => { "www.qtp.com" => 1 },
  },
)
[download]

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

My blog: Imperial Deltronics

In reply to Re: perl regex extraction from a large input file by CountZero
in thread perl regex extraction from a large input file by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.