thezip has asked for the wisdom of the Perl Monks concerning the following question:

Hello good Monks,

I am having difficulty parsing a huge data file. Since it is a huge file, I can only read line-by-line (with the exception of the small buffer I'm using).

The rule I am trying to implement has these preconditions (please refer to the @sourcedata array below):
  1. A literal "1" occurs at the beginning of the line, (and is the only char on that line)
  2. ... and is immediately followed by *any* number of newlines
  3. ... and is terminated by the literal "_____ 2"

IFF these conditions are met, then the newlines between the "1" and "_____ 2" lines are removed. Any other condition just prints everything (including the buffer.)

For some reason, I cannot execute the loop where I am buffering the intermediate newlines.

Also please note that for ease of this discussion, I have described this question in terms of arrays, rather than file I/O -- this is not germaine to the solution I am seeking.

Here's some sample data

#!/usr/bin/perl use strict; use warnings; # NOTE: The @sourcedata array is the representation of # the data as if it were read from a file by: # # open(FH, $sourcefilename) || die ... # my @sourcedata = <FH>; # close FH; # # Since the source file is huge, I need to process the # file line by line # NOTE: I updated this array to reflect an array of lines my @sourcedata = ( "\n", "1\n", "\n", "\n", "b\n", "\n", "1\n", "\n", "\n", "\n", "\n", "_____ 2\n", "\n", "\n", "\n" ); # The desired result of processing a small data sample: my @desiredoutput = qq( 1 b 1 _____ 2 # NOTE: preceding newlines have been collapsed );

Here's my source code...

my @buffer = (); my $length = scalar @sourcedata; for (my $I = 0; $I < $length; $I++) { my $line = $sourcedata[$I]; if ($line =~ /^1$/) { push(@buffer, $line); $I++; $line = $sourcedata[$I]; # Here's the loop I can't seem to execute: while ($line =~ /^\n$/ && $I != $length) { print "Buffering...\n"; push(@buffer, $line); $I++; last if $I == $length; $line = $sourcedata[$I]; } if ($line =~ /_____ 2/) { # print only the first and last items in the buffer, # effectively removing the empty lines print shift(@buffer), pop(@buffer); print $line; } else { print join(@buffer); } } else { print $line; } @buffer = (); }
Where do you want *them* to go today?

Replies are listed 'Best First'.
Re: Implementing a parsing rule in a huge data file
by BrowserUk (Patriarch) on Dec 08, 2006 at 01:48 UTC

    Update: simplified a bit more.

    A bit simpler than the other suggestions, and it works.

    use strict; while( <DATA> ) { if( /^1$/ ) { my $n = 0; ++$n while defined( $_ = <DATA> ) and /^\n$/; $n = 1 if /_____ 2/; print '1', "\n" x $n; } print; } __DATA__ 1 b 1 _____ 2

    There are two keys to the simplicity.

    1. You don't have to only read lines at the top of the loop.
    2. There is no need to buffer the newlines, just count them.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      I like this solution -- fewer parts to maintain and easier to look at six months down the road.

      Thanks!

      Where do you want *them* to go today?
Re: Implementing a parsing rule in a huge data file
by ikegami (Patriarch) on Dec 08, 2006 at 00:41 UTC

    A simple state machine will do. You are looking for either the start tag or for the end tag, so process each line in one of those contexts.

    #!/usr/bin/perl use strict; use warnings; my $START = "1"; my $END = "2"; my $looking_for_end = 0; my @buffer; while (<DATA>) { if ($looking_for_end) { # Start tag -> Abort and restart if ($_ eq "$START\n") { print(@buffer); @buffer = ($_); } # End tag -> Remove newlines elsif (substr($_, -(length($END)+1)) eq "$END\n") { print(grep { $_ ne "\n" } @buffer); print($_); @buffer = (); $looking_for_end = 0; } # Newline -> Add to buffer (in case we abort) elsif ($_ eq "\n") { push(@buffer, $_); } # Unexpected -> Abort else { print(@buffer); print($_); @buffer = (); $looking_for_end = 0; } } else { # Start tag -> Start buffering and looking for end tag. if ($_ eq "$START\n") { @buffer = ($_); $looking_for_end = 1; } else { print($_); } } } __DATA__ 1 b 1 _____ 2
Re: Implementing a parsing rule in a huge data file
by derby (Abbot) on Dec 08, 2006 at 00:39 UTC

    First, by using qq(), your @sourcedata array is only 1 element. From what I can tell, you need to start buffering when you see 1 and if you see 1 again, just output the buffer. If you see ______2, trim then output the buffer. You'll probably want to empty the buffer after the loop too.

    #!/usr/bin/perl use strict; use warnings; # NOTE: The @sourcedata array is the representation of # the data as if it were read from a file by: # # open(FH, $sourcefilename) || die ... # my @sourcedata = <FH>; # close FH; # # Since the source file is huge, I need to process the # file line by line # NOTE: newlines are significant my $sourcedata = qq( 1 b 1 _____ 2 ); my @sourcedata = split( /\n/, $sourcedata ); my $buffer = []; my $in_1 = 0; foreach my $line ( @sourcedata ) { if( $line =~ /^1$/ ) { if( $in_1 ) { # all ready in 1 so just output buffer; out_buffer( $buffer ); $buffer = []; } else { $in_1 = 1; } } if( $line =~ /^_____ 2$/ && $in_1 ) { trim_buffer( $buffer ); out_buffer( $buffer ); $buffer = []; $in_1 = 0; } if( $in_1 ) { push( @$buffer, $line ); } else { print $line,"\n"; } } # empty buffer out_buffer( $buffer ); # subs sub out_buffer { my $buffer = shift; foreach my $b ( @$buffer ) { print $b, "\n"; } } sub trim_buffer { my $buffer = shift; my $len = scalar( @$buffer ) - 1; for( my $i = $len; $i >= 0; $i-- ) { if( ! $buffer->[$i] ) { splice( @$buffer, $i, 1 ); } } }
    -derby
Re: Implementing a parsing rule in a huge data file
by andyford (Curate) on Dec 08, 2006 at 00:06 UTC