RP: Finding the second line an item appears on

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, i am still having trouble with this...

I made a post yesterday (read file twice) asking if anyone could help me read each line in a file until i found a word for the second time (the word will never appear on the same line twice).

I had an excellent response and spent most of yesterday and today implementing each suggestion and playing around with any variants. A small flaw all of the suggestions, due to the wording of the question i submitted, is that they do indeed cope for a second table title, but only if there are two and not one.

The idea behind finding these words are that they are table titles. I find the first table with 'formulae' as a title and process it, extracting the data i require. There is a problem i overlooked, as i had only hard copies of these files im writing the script for, and failed to notice that some tables that are too long for a page, will continue on the next, with the table title & column headers above it.
So in fact, in some files the script has to process 2 tables with the same title and column headers to get all the data. The files contain other tables as well but i only need the formulae table, they have different titles, which is why i chose to find 'formulae'.

The original code i had was simple & only found one table title (i was assuming any tables that exceeded the page length would just continue on the next page, not having another title & set of column headers). I now need to cater for tables that continue on the next page (i.e. 2 tables with the same title).


do { $_ = <> } until /formulae/;

<> for (1..3); # skips 3 lines from title

# read input file
while (<>) {

    last if /^\+/;

    ## extract data from table ##

}
[download]

Thanks for all your help and the fantastic response i had yesterday, it helped a lot in other parts of the script i needed to write. Steve.

update (broquaint): added link to previous thread

Comment on RP: Finding the second line an item appears on Download Code

Replies are listed 'Best First'.
Re: RP: Finding the second line an item appears on by tachyon (Chancellor) on Oct 28, 2003 at 11:57 UTC
One very useful trick with Perl is to set the input record separator in such a way that instread of reading line by line you read RECORD by RECORD where each record may have many lines. Once you have discrete records you can generally manipulate them very easily. Using a data set like what you showed yesterday and setting the input record separator to TWO newlines (only seen between tables) we get: # set imput record separator to end of table string $/ = "\n\n"; my %hash; # now we are reading data a table at a time while(<DATA>) { my ( undef, $header, $data ) = split /\+\-+\+\s/, $_; $header =~ s/\s\\|\s*//g; @data = $data =~ m/\\|([^\\|]+)\\|/g; push @{$hash{$header}}, @data; } use Data::Dumper; print Dumper \%hash; __DATA__ +---------+ \| formula \| +---------+ \| dat1 \| \| dat2 \| \| dat3 \| +---------+ +---------+ \| formula \| +---------+ \| dat4 \| \| dat5 \| \| dat6 \| +---------+ +---------+ \| flubber \| +---------+ \| dat11 \| \| dat22 \| \| dat33 \| +---------+ +---------+ \| dubber \| +---------+ \| dat111 \| \| dat222 \| \| dat333 \| +---------+ +---------+ \| dubber \| +---------+ \| dat1111 \| \| dat2222 \| \| dat3333 \| +---------+ __END__ $VAR1 = { '' => [], 'flubber' => [ ' dat11 ', ' dat22 ', ' dat33 ' ], 'formula' => [ ' dat1 ', ' dat2 ', ' dat3 ', ' dat4 ', ' dat5 ', ' dat6 ' ], 'dubber' => [ ' dat111 ', ' dat222 ', ' dat333 ', ' dat1111 ', ' dat2222 ', ' dat3333 ' ] }; [download] cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l]

Replies are listed 'Best First'.

Re: RP: Finding the second line an item appears on
by tachyon (Chancellor) on Oct 28, 2003 at 11:57 UTC

One very useful trick with Perl is to set the input record separator in such a way that instread of reading line by line you read RECORD by RECORD where each record may have many lines. Once you have discrete records you can generally manipulate them very easily. Using a data set like what you showed yesterday and setting the input record separator to TWO newlines (only seen between tables) we get:

# set imput record separator to end of table string
$/ = "\n\n";

my %hash;

# now we are reading data a table at a time
while(<DATA>) {
    my ( undef, $header, $data ) = split /\+\-+\+\s*/, $_;
    $header =~ s/\s*\|\s*//g;
    @data = $data =~ m/\|([^\|]+)\|/g;
    push @{$hash{$header}}, @data;
}

use Data::Dumper;
print Dumper \%hash;

__DATA__
+---------+
| formula |
+---------+
| dat1    |
| dat2    |
| dat3    |
+---------+

+---------+
| formula |
+---------+
| dat4    |
| dat5    |
| dat6    |
+---------+

+---------+
| flubber |
+---------+
| dat11   |
| dat22   |
| dat33   |
+---------+

+---------+
| dubber  |
+---------+
| dat111  |
| dat222  |
| dat333  |
+---------+

+---------+
| dubber  |
+---------+
| dat1111 |
| dat2222 |
| dat3333 |
+---------+

__END__
$VAR1 = {
          '' => [],
          'flubber' => [
                         ' dat11   ',
                         ' dat22   ',
                         ' dat33   '
                       ],
          'formula' => [
                         ' dat1    ',
                         ' dat2    ',
                         ' dat3    ',
                         ' dat4    ',
                         ' dat5    ',
                         ' dat6    '
                       ],
          'dubber' => [
                        ' dat111  ',
                        ' dat222  ',
                        ' dat333  ',
                        ' dat1111 ',
                        ' dat2222 ',
                        ' dat3333 '
                      ]
        };
[download]

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

[reply]
[d/l]