Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am reading in a file fomatted like so:

====header info====
10 to 50 line of text and
numbers with irregular
formatting
====header info====
10 to 50 lines of text and...

(This continues for thousands of lines.)

How can I define a record separator to read in everything between "====header info====" and the last line before the next "====header info====" as one record?

Here's an example:-

my @records; { local $/ = /====/; #obviously this won't work... open FILE1, "$ARGV[0]" or die "Cannot open data file.\n$!"; while ( my $record = <FILE1> ) { my @chunks = split /\n/, $record; push @records, [@chunks]; } close FILE1; }

I then need to do stuff with @records.

What's the best way to define a quick-and-easy record separator that uses a regular expression?

Thanks,
-Chris

Replies are listed 'Best First'.
Re: Record separator question
by Limbic~Region (Chancellor) on Jan 20, 2004 at 23:40 UTC
    Chris,
    Take a look at perldoc perlvar.

    Remember: the value of $/ is a string, not a regex. awk has to be better for something. :-)

    Ok - so my suggestion would be to slurp in the entire thing and then split on your regex.

    local $/ = undef; my $record = <INPUT>; my @chunks = split /regex/ , $record; for my $chunk ( @chunks ) { my @lines = split /\n/ , $chunk; }
    Hope that helps - L~R

    Update: If for some reason you need to keep the header information - do something like this:

Re: Record separator question
by Zaxo (Archbishop) on Jan 20, 2004 at 23:50 UTC

    If you define

    { local $/ = '===='; open local(*FILE1), '<', shift or die $!;
    then you can discard a first (empty) read, and get the header info and data in two further reads,
    local $_ = <FILE1>; while (<FILE1>) { my @header_data = header_extract($_); my $data = <FILE1>; # second read my @data = data_extract($data); # ... } close DATA1 or die $!; redo if @ARGV; }

    After Compline,
    Zaxo

Re: Record separator question
by Roger (Parson) on Jan 21, 2004 at 00:18 UTC
    I would keep my (business) logic as simple as possible:
    use strict; use warnings; while ($_ = getrecord()) { print "----RECORD----\n$_\n"; } my $saved_header; sub getrecord { my $text = $saved_header; $saved_header = ''; while (<DATA>) { if (/^====[^=]+====$/) { $saved_header = $_, last if $text; } $text .= $_; } return $text; } __DATA__ ====header info==== 0 10 to 50 line of text and numbers with ==== irregular ==== formatting ====header info==== 10 to 50 lines of text and...

Re: Record separator question
by Cody Pendant (Prior) on Jan 20, 2004 at 23:56 UTC

    I tend to do stuff like this by thinking of two "modes".

    If $_ =~ m/^=====/ then we're in header_mode.

    If not, we're in other_mode.

    In header_mode, we do X, in other_mode we do Y.

    It's kind of clunky, but it sorts out the logic wonderfully in my head.



    ($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss')
    =~y~b-v~a-z~s; print
Re: Record separator question
by graff (Chancellor) on Jan 21, 2004 at 02:43 UTC
    Bear in mind that when $/ is some particular string, perl reads input data up to and including that string (or until EOF, if the string does not occur). The record separator, whatever it is, is retained as the last component of the record just read in. Note also that "chomp" works be removing $/ from the end of a string (if the string happens to contain $/ at the end).

    Given that records in your file begin with the line "====header info====\n", you could set this whole string as your record separator, and just accept the fact that the first "record" you read in will contain nothing else but this line, and that all subsequent records will have this line as the end of the record string, not the beginning.

    Something like the following would do what you want, assuming that you're okay with actually removing these record separators and keeping just the stuff in between:

    $/ = "====header info====\n"; while (<>) { chomp; next unless ( /\S/ ); .... }
    (I tried this out on your sample text, and it did the right thing, even with the last record, which did not have "====header info====" at the end.)
Re: Record separator question
by duff (Parson) on Jan 21, 2004 at 03:58 UTC

    I'm not exactly sure what you want because I can interpret your text several ways. Here's a couple of snippets of code though. The first assumes that you really want something like this:

    ====header info==== header header ====header info==== data data ====header info==== header header ====header info==== data ....

    ... and so on. While the second assumes that you just want the stuff that's in between the ====header info==== lines while discarding the header lines themselves. The second one is what I believe most people interpret your text to mean, but I thought I'd mention the first one just in case (also, it's a rare chance that I get to use the ... (yes, that's 3 dots!) flip-flop operator ;-)

    #!/usr/bin/perl # snippet number 1 while (<DATA>) { if (/^====header/.../^====header/) { print "header: $_"; next; } print "data: $_"; next; } __DATA__ ====header info==== 10 to 50 line of text and numbers with irregular formatting ====header info==== 10 to 50 lines of text and... More text more text ====header info==== 10 to 50 line of text and numbers with irregular formatting ====header info==== 10 to 50 lines of text and... More text more text
    #!/usr/bin/perl # snippet number 2 my (@records,@tmp); while (<DATA>) { chomp; if (/^====header/) { next unless @tmp; push @records, [ @tmp ]; @tmp = (); next; } push @tmp, $_; } push @records, [ @tmp ] if @tmp; print "@$_\n" for @records; __DATA__ ====header info==== 10 to 50 line of text and numbers with irregular formatting ====header info==== 10 to 50 lines of text and... More text more text ====header info==== 10 to 50 line of text and numbers with irregular formatting ====header info==== 10 to 50 lines of text and... More text more text