Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I need to extract data from a text file in the following format :

anything anything DATE anything
anything anything SOMETHING
anything anything anything SOMETHINGELSE

And format it in the following way
SOMETHING,DATE,SOMETHINGELSE
My problem is I have to KEY my search off of SOMETHING and then grab the DATE from the line above and SOMETHINGELSE from the line below it.... help? pretty please!

-clueless

Replies are listed 'Best First'.
Re: Lines above and Below a matched string
by BrowserUk (Patriarch) on Jun 26, 2003 at 18:48 UTC

    If the file is small, slurp the whole file into a scalar and use a regex of the general form

    my $data = do{ local (*ARGV, $/) = 'text.file'; <>; }; print "$2,$1,$3" if $data =~ m[(DATE).*?(SOMETHING).*?(SOMETHINGELSE)]s;

    The /s modifier allows . to match newlines, so the .*? will span lines. Add /g if you expect to find more than one occurance.

    If your file is too big to slurp, then you could use a sliding buffer on the file. See Re: split and sysread() for some sample code.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


Re: Lines above and Below a matched string
by gjb (Vicar) on Jun 26, 2003 at 19:30 UTC

    Just keep track of the last three lines you read, and match against the one before the last you read.

    my $line1 = <>; chomp($line1); my $line2 = <>; chomp($line2); while (<>) { chomp($_); my $line3 = $_; if ($line2 =~/(SOMETHING)/) { my $something = $1; $line1 =~ /(DATE)/; my $date = $1; $line3 =~ /(SOMETHINGELSE)/; my $somethingElse = $1; print "$something,$date,$somethingElse\n"; } $line1 = $line2; $line2 = $line3; }

    Note: this is untested code. It will fail on files shorter than three lines. And no, it's not elegant, but it's simplee.

    Hope this helps, -gjb-

    Update: This is a sliding buffer just as in BrowserUK's approach, but this should work for large files too.

Re: Lines above and Below a matched string
by svsingh (Priest) on Jun 26, 2003 at 19:57 UTC
    Here's something a little different. I know you want to match on SOMETHING, but since you'll also need to match on DATE and SOMETHINGELSE (I'm assuming), why not match all three in sequence?

    To me, this is a little simpler than the buffer ideas. It probably has some flaws, but it seems to be working.

    my $tmp; while ($tmp = <DATA>) { if ($tmp =~ /(DATE\d)/) { my %h = (); $h{'date'} = $1; $tmp = <DATA>; if ($tmp =~ /(SOMETHING\d)/) { $h{'match'} = $1; $tmp = <DATA>; if ($tmp =~ /(SOMETHINGELSE\d)/) { $h{'more'} = $1; print "$h{'match'},$h{'date'},$h{'more'}\n"; } else { redo; } } else { redo; } } } __DATA__ anything anything DATE1 anything anything anything SOMETHING1 anything anything anything SOMETHINGELSE1 anything anything DATE2 anything anything anything DATE3 anything anything anything SOMETHING2 anything anything anything SOMETHINGELSE2 anything anything DATE4 anything anything anything SOMETHING3 anything anything DATE5 anything anything anything SOMETHING4 anything anything anything SOMETHINGELSE3

    That returns:

    SOMETHING1,DATE1,SOMETHINGELSE1 SOMETHING2,DATE3,SOMETHINGELSE2 SOMETHING4,DATE5,SOMETHINGELSE3

    Hope this helps.

Re: Lines above and Below a matched string
by Itatsumaki (Friar) on Jun 26, 2003 at 18:37 UTC

    One way is to read your file in chunks of three lines. If line 2 matches the format you need, then you go ahead and process lines 1 & 3, otherwise go to the next chunk of three.

    -Tats
      Kinda yes and kinda no. If line 2 doesn't match, you have then make line three the first line of your buffer and read in two lines.

      -derby

      derby pointed out the problem with my approach. I think BrowserUK's idea of sliding buffers is probably best, but you can salvage my approach by opening three file-handles and having them in different frames. In code:

      open(IN1, "<$infile"); open(IN2, "<$infile"); open(IN3, "<$infile"); my $temp; $temp = <IN2>; # remove one dummy line from 2nd file-handle $temp = <IN3>; $temp = <IN3>; # remove two dummy lien from 3rd file-handle # Now process all three file-handles separately # in chunks of three lines

      And there was a recent node on how to read a file in chunks of n lines, that may help.

      -Tats