Parsing text sections

betacentauri has asked for the wisdom of the Perl Monks concerning the following question:

O Ye Monks, I plea to be enlighten'd

I need to parse a set of files written by report programs. Each file has some sections delimited by before- and after- lines. Every section's contents will receive a different treatment. An example is given below. What I want to have is a *very* simple method to extend the parser, so as to account for future file formats. While future formats may have different section selector strategies, I want to have this pretty general case set down first.

----------------------------------------------------------------------
+-----
@--- Callsites: 2 ----------------------------------------------------
+-----
----------------------------------------------------------------------
+-----
 ID Lev File/Address        Line Parent_Funct             MPI_Call
  1   0 0x8048ad5                [unknown]                Reduce
  2   0 0x8048a3b                [unknown]                Bcast
----------------------------------------------------------------------
+-----
@--- Aggregate Time (top twenty, descending, milliseconds) -----------
+-----
----------------------------------------------------------------------
+-----
Call                 Site       Time    App%    MPI%     COV
Bcast                   2        9.5   24.71   65.75    0.59
Reduce                  1       4.95   12.87   34.25    1.35
----------------------------------------------------------------------
+-----
[download]

What I have devised up to now is something like the attached. That should parse the presented sample. Adding parsers should be a matter of writing proper RE for start and end lines, and then deciding what to do with each line belonging to the section. This parsing template would go in some config file or whatever.

As you can see, delimiter lines must not enter into process(). While my foreach loop may not be the most clever thing, it sorta works for me (suggestions gladly accepted).

#!/usr/bin/perl -w

package Parser;
use strict;

sub new {
        bless {};
}

sub add {
    my $self = shift;
    my $hash = shift;
    push @{$self->{parsers}}, $hash;
}

sub parse {
     my $self = shift;
    my $file = shift;
    open FILE,'<',$file || die "Can't open $file";
    
    while(<FILE>) {
        foreach my $p (@{$self->{parsers}}) {
            if(/$p->{start}/ ... /$p->{end}/) {
                 $p->{process}($_) unless 
                    (/$p->{start}/ || /$p->{end}/);
            }
        }
    }
    close FILE;
}

1;

sub parse1 {
    my $text = shift;
    print "Into parse1\n";
    print $text;
}

sub parse2 {
    my $text = shift;
    print "Into parse2\n";
    print $text;
}

my $p = Parser::new();

$p->add({ start => "^ ID Lev File/Address",
    end => "---",
    process => \&parse1});

$p->add({ start => "^Call *Site *Time *App",
    end => "---",
    process => \&parse2});

$p->parse($file);
[download]

Now, this works OK when only one parser is add()ed. However, with two parsers, I get the following output for the sample enclosed:

Into parse2
 ID Lev File/Address        Line Parent_Funct             MPI_Call
Into parse1
  1   0 0x8048ad5                [unknown]                Reduce
Into parse2
  1   0 0x8048ad5                [unknown]                Reduce
Into parse1
  2   0 0x8048a3b                [unknown]                Bcast
Into parse2
  2   0 0x8048a3b                [unknown]                Bcast
Into parse1
Bcast                   2         20   35.85   64.45    0.76
Into parse2
Bcast                   2         20   35.85   64.45    0.76
Into parse1
Reduce                  1         11   19.78   35.55    1.31
Into parse2
Reduce                  1         11   19.78   35.55    1.31
[download]

So:

This is obviously not the proper way to do it. Can somebody point out for me why I get every line processed by both subs?
Feel free to correct or comment on any enhancements to logic or performance issues
Is there a Perl module doing what I need (I bet there must be yet I wasn't able to find it)?

Thank you very much in advance

Comment on Parsing text sections Select or Download Code

Replies are listed 'Best First'.
Re: Parsing text sections by GrandFather (Saint) on Jun 27, 2010 at 12:24 UTC
If I understand correctly what you are trying to achieve then the following may help: use strict; use warnings; package Parser; my %parsers; sub new { my ($class) = @_; return bless {}, $class; } sub parse { my ($self, $file) = @_; local $/ = "------------------------------------------------------ +---------------------\n\@---"; while (my $record = <$file>) { chomp $record; $record =~ s/-+$//; next if ! length $record; next if $record !~ s/^\s*(\w+)([^-]+)[-\n]+//; my ($type, $tail) = ($1, $2); die "Can't deal with $type record. Header is '$type$tail'\n" if ! exists $parsers{$type}; $parsers{$type}->($self, $record); } } sub registerParser { my ($type, $parser) = @_; $parsers{$type} = $parser; } package Callsites; use parent -norequire, 'ParserBase'; Parser::registerParser (__PACKAGE__, \&Callsites::parser); sub parser { my ($host, $record) = @_; print __PACKAGE__, " parser processing:\n$record\n"; } package Aggregate; Parser::registerParser (__PACKAGE__, \&Aggregate::parser); sub parser { my ($host, $record) = @_; print __PACKAGE__, " parser is processing:\n$record\n"; } package main; my $data = <<'D'; ---------------------------------------------------------------------- +----- @--- Callsites: 2 ---------------------------------------------------- +----- ---------------------------------------------------------------------- +----- ID Lev File/Address Line Parent_Funct MPI_Call 1 0 0x8048ad5 [unknown] Reduce 2 0 0x8048a3b [unknown] Bcast ---------------------------------------------------------------------- +----- @--- Aggregate Time (top twenty, descending, milliseconds) ----------- +----- ---------------------------------------------------------------------- +----- Call Site Time App% MPI% COV Bcast 2 9.5 24.71 65.75 0.59 Reduce 1 4.95 12.87 34.25 1.35 ---------------------------------------------------------------------- +----- D my $parser = Parser->new (); open my $inFile, '<', \$data; $parser->parse ($inFile); close $inFile; [download] Prints: `Callsites parser processing: ID Lev File/Address Line Parent_Funct MPI_Call 1 0 0x8048ad5 [unknown] Reduce 2 0 0x8048a3b [unknown] Bcast Aggregate parser is processing: Call Site Time App% MPI% COV Bcast 2 9.5 24.71 65.75 0.59 Reduce 1 4.95 12.87 34.25 1.35` [download] Note the use of perlvar $/ to break the input into records. I've assumed that the record type can be inferred from the first word of the record header. If that's not the case you'll have to generate a unique parser identifier for each record type in some other fashion. The important element however is for the parser class to use a lookup hash for the parser types and for each parser type to register itself. Of course the implementations for the different parser types could be in separate modules - I've lumped em all together in one file to make the sample code more cohesive. Passing $host into the parser implementation subs allows parsed information to be passed back in some fashion. I'm a little unclear about what you want to do with the parsed data,but I'm sure you can figure that out. True laziness is hard work	[reply] [d/l] [select]
Re: Parsing text sections by jwkrahn (Abbot) on Jun 27, 2010 at 14:11 UTC
`sub parse { my $self = shift; my $file = shift; open FILE,'<',$file \|\| die "Can't open $file";` [download] Your program will only die if the file name is omitted when you call `parse()` or if the file name is either `""` or `"0"` because of the high precedence of the `\|\|` operator. You need to either use the low precedence `or` operator: `open FILE,'<',$file or die "Can't open $file";` [download] Or use parentheses: `open( FILE,'<',$file ) \|\| die "Can't open $file";` [download] You should also include the `$!` variable in your error message so you know why the program failed: `open FILE,'<',$file or die "Can't open $file: $!";` [download]	[reply] [d/l] [select]
Re: Parsing text sections by toolic (Bishop) on Jun 27, 2010 at 14:08 UTC
Grandfather has provided a specific solution. I do not know if CPAN offers a solution because this does not look like a standard format. Consider this radically different approach. Store your data in a standard format, such as XML, then: parse the data using an XML parser module from CPAN (such as XML::Twig) generate reports from the parsed data, using whatever custom format you need	[reply]
Re: Parsing text sections by Marshall (Canon) on Jun 28, 2010 at 13:37 UTC
Here is yet another way of doing the parsing... #!/usr/bin/perl -w use strict; while (<DATA>) { next unless /^@/; my $header = ($_ =~ m/([\w:\s]+)/)[0]; #tweak as necessary $header =~ s/\s$//; $header =~ s/^\s//; my @lines = get_data_4_header($header); print " lines for \"$header\" \n"; print @lines; print "\n"; } sub get_data_4_header { my $header = shift; #not used here, but could be if needed.. my @result=(); <DATA>; #throw away a line (i.e., the next --- line) while ( ($_ = <DATA>) !~ m/^-/) { push @result,$_; } return @result; } =Prints: lines for "Callsites: 2" ID Lev File/Address Line Parent_Funct MPI_Call 1 0 0x8048ad5 [unknown] Reduce 2 0 0x8048a3b [unknown] Bcast lines for "Aggregate Time" Call Site Time App% MPI% COV Bcast 2 9.5 24.71 65.75 0.59 Reduce 1 4.95 12.87 34.25 1.35 =cut __DATA__ ------------------------------------------------------------------- @--- Callsites: 2 ------------------------------------------------- ------------------------------------------------------------------ ID Lev File/Address Line Parent_Funct MPI_Call 1 0 0x8048ad5 [unknown] Reduce 2 0 0x8048a3b [unknown] Bcast ------------------------------------------------------------------ @--- Aggregate Time (top twenty, descending, milliseconds) --- ----------------------------------------------------------------- Call Site Time App% MPI% COV Bcast 2 9.5 24.71 65.75 0.59 Reduce 1 4.95 12.87 34.25 1.35 ---------------------------------------------------------------- [download]	[reply] [d/l]
Re^2: Parsing text sections by betacentauri (Sexton) on Jul 04, 2010 at 11:14 UTC
Dear Monks, Thank you very much for your useful responses. One marginal question: in the interest of terseness, is there a preferred, canonical way to perform the extraction of a set of lines, omitting delimiter lines such as I am intending to do? Just reading them and then throwing them away doesn't look too perlish. I see recipes in online books but not this particular case.	[reply]
Re^3: Parsing text sections by Marshall (Canon) on Jul 05, 2010 at 10:32 UTC
Just reading them and then throwing them away doesn't look too perlish. The data file is stored on disk as just a linear sequence of bytes. The disk/file system doesn't know anything about "skip the first 2 lines" or "skip blank lines" or "skip delimiter lines". All it knows about is reading bytes sequentially from the file. So one way or the other, the header lines and \n indicating blank lines and the delimiters have to be read from the disk. It is not possible to "not read the delimiter lines". Somebody has to decide which lines to "throw away" and that somebody is the user. The only question is what kind of technique and/or Perl module that the user wants to use. There isn't a single "right" answer. That's why you got a couple of responses with different ways. There are some, what I consider "less good" ways which weren't offered as possibilities. The basic job is to decide whether you are inside the record or not? This means that there has to be some state information to know that a new record has started and when that record has ended. One way is like I showed Re: Parsing text sections, call a subroutine when the record starts and have that subroutine finish reading the record. The fact that you are in the subroutine means that a record has started. A flag like "INSIDE_RECORD?", true/false is not needed as it is implicit by the fact of being in the "finish the record" subroutine. This is a common coding pattern for this task and would be seen in other languages like C. I didn't show the code for calling the sub-parser, but obviously you would call that based upon what I called the "header" (the record type info from @ line). BTW, it wasn't needed here, but if what "ends the record" is the start of a new record, instead of "unreading" that line in various ways, another way is to set a "noread" flag: while ($noread && ($line=<IN>)). This keeps $line for another iteration of the loop. If you are designing the format, avoiding this "start of new record means end of previous record" saves grief. In this particular case having records separated only by an "----@ type" line would have made the record parsing more problematic. You should note that regexes in Perl can be variables!! This is way cool and applicable to all techniques. The second way is to use flags to indicate whether or not you are inside the record. You can do the logic for this yourself which I would consider a "not as good" way. Or as Grandfather did, use the triple dot, or "flip-flop" operator. Read his node about it: Flipin good, or a total flop?. Read the other posts on how to exclude the lines that trigger the record in various ways. This very special Perl operator essentially sets up flags for you to keep track of where you are. This is a cool critter and it takes some experimentation to understand it. If you read carefully the above, you will see that it also keeps track of the line number within the record which can sometimes be very helpful. So this was a long post to say: Yes, all the lines have to be read from the file and the "bad ones" thrown away. This node shows 2 ways to do that, one of which is very Perl specific. Which way you prefer is up to you and often depends upon hard to quantify factors like who is going to be maintaining this code?, etc.	[reply]