Implementing a parsing rule in a huge data file

thezip has asked for the wisdom of the Perl Monks concerning the following question:

Hello good Monks,

I am having difficulty parsing a huge data file. Since it is a huge file, I can only read line-by-line (with the exception of the small buffer I'm using).

The rule I am trying to implement has these preconditions (please refer to the @sourcedata array below):

A literal "1" occurs at the beginning of the line, (and is the only char on that line)
... and is immediately followed by *any* number of newlines
... and is terminated by the literal "_____ 2"

IFF these conditions are met, then the newlines between the "1" and "_____ 2" lines are removed. Any other condition just prints everything (including the buffer.)

For some reason, I cannot execute the loop where I am buffering the intermediate newlines.

Also please note that for ease of this discussion, I have described this question in terms of arrays, rather than file I/O -- this is not germaine to the solution I am seeking.

Here's some sample data

#!/usr/bin/perl

use strict;
use warnings;

# NOTE: The @sourcedata array is the representation of 
# the data as if it were read from a file by:
#
#   open(FH, $sourcefilename) || die ...
#   my @sourcedata = <FH>;
#   close FH;
#
# Since the source file is huge, I need to process the
# file line by line

# NOTE: I updated this array to reflect an array of lines
my @sourcedata = (
"\n",
"1\n",
"\n",
"\n",
"b\n",
"\n",
"1\n",
"\n",
"\n",
"\n",
"\n",
"_____ 2\n",
"\n",
"\n",
"\n"
);

# The desired result of processing a small data sample:
my @desiredoutput = qq(

1

b

1
_____ 2    # NOTE: preceding newlines have been collapsed



);
[download]

Here's my source code...


my @buffer = ();
my $length = scalar @sourcedata;

for (my $I = 0; $I < $length; $I++) {
  my $line = $sourcedata[$I];
  if ($line =~ /^1$/) {
    push(@buffer, $line);
    $I++;
    $line = $sourcedata[$I];

    # Here's the loop I can't seem to execute:
    while ($line =~ /^\n$/ && $I != $length) {
      print "Buffering...\n";
      push(@buffer, $line);
      $I++;
      last if $I == $length;
      $line = $sourcedata[$I];
    }
    if ($line =~ /_____ 2/) {
      # print only the first and last items in the buffer,
      # effectively removing the empty lines
      print shift(@buffer), pop(@buffer);
      print $line;
    }
    else {
      print join(@buffer);
    }
  }
  else {
    print $line;
  }
  @buffer = ();
}
[download]

Where do you want *them* to go today?

Comment on Implementing a parsing rule in a huge data file Select or Download Code

Replies are listed 'Best First'.
Re: Implementing a parsing rule in a huge data file by BrowserUk (Patriarch) on Dec 08, 2006 at 01:48 UTC
Update: simplified a bit more. A bit simpler than the other suggestions, and it works. `use strict; while( <DATA> ) { if( /^1$/ ) { my $n = 0; ++$n while defined( $_ = <DATA> ) and /^\n$/; $n = 1 if /_____ 2/; print '1', "\n" x $n; } print; } __DATA__ 1 b 1 _____ 2` [download] There are two keys to the simplicity. You don't have to only read lines at the top of the loop. There is no need to buffer the newlines, just count them. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^2: Implementing a parsing rule in a huge data file by thezip (Vicar) on Dec 08, 2006 at 16:39 UTC
I like this solution -- fewer parts to maintain and easier to look at six months down the road. Thanks! Where do you want them* to go today?*	[reply]
Re: Implementing a parsing rule in a huge data file by ikegami (Patriarch) on Dec 08, 2006 at 00:41 UTC
A simple state machine will do. You are looking for either the start tag or for the end tag, so process each line in one of those contexts. #!/usr/bin/perl use strict; use warnings; my $START = "1"; my $END = "2"; my $looking_for_end = 0; my @buffer; while (<DATA>) { if ($looking_for_end) { # Start tag -> Abort and restart if ($_ eq "$START\n") { print(@buffer); @buffer = ($_); } # End tag -> Remove newlines elsif (substr($_, -(length($END)+1)) eq "$END\n") { print(grep { $_ ne "\n" } @buffer); print($_); @buffer = (); $looking_for_end = 0; } # Newline -> Add to buffer (in case we abort) elsif ($_ eq "\n") { push(@buffer, $_); } # Unexpected -> Abort else { print(@buffer); print($_); @buffer = (); $looking_for_end = 0; } } else { # Start tag -> Start buffering and looking for end tag. if ($_ eq "$START\n") { @buffer = ($_); $looking_for_end = 1; } else { print($_); } } } __DATA__ 1 b 1 _____ 2 [download]	[reply] [d/l]
Re: Implementing a parsing rule in a huge data file by derby (Abbot) on Dec 08, 2006 at 00:39 UTC
First, by using qq(), your @sourcedata array is only 1 element. From what I can tell, you need to start buffering when you see 1 and if you see 1 again, just output the buffer. If you see ______2, trim then output the buffer. You'll probably want to empty the buffer after the loop too. #!/usr/bin/perl use strict; use warnings; # NOTE: The @sourcedata array is the representation of # the data as if it were read from a file by: # # open(FH, $sourcefilename) \|\| die ... # my @sourcedata = <FH>; # close FH; # # Since the source file is huge, I need to process the # file line by line # NOTE: newlines are significant my $sourcedata = qq( 1 b 1 _____ 2 ); my @sourcedata = split( /\n/, $sourcedata ); my $buffer = []; my $in_1 = 0; foreach my $line ( @sourcedata ) { if( $line =~ /^1$/ ) { if( $in_1 ) { # all ready in 1 so just output buffer; out_buffer( $buffer ); $buffer = []; } else { $in_1 = 1; } } if( $line =~ /^_____ 2$/ && $in_1 ) { trim_buffer( $buffer ); out_buffer( $buffer ); $buffer = []; $in_1 = 0; } if( $in_1 ) { push( @$buffer, $line ); } else { print $line,"\n"; } } # empty buffer out_buffer( $buffer ); # subs sub out_buffer { my $buffer = shift; foreach my $b ( @$buffer ) { print $b, "\n"; } } sub trim_buffer { my $buffer = shift; my $len = scalar( @$buffer ) - 1; for( my $i = $len; $i >= 0; $i-- ) { if( ! $buffer->[$i] ) { splice( @$buffer, $i, 1 ); } } } [download] -derby	[reply] [d/l]
Re: Implementing a parsing rule in a huge data file by andyford (Curate) on Dec 08, 2006 at 00:06 UTC
Sorry gotta go catch my bus, but here's a guessful clue: How can I process large files efficiently? non-Perl: Andy Ford	[reply]