Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, I have a problem with data i am trying to retrieve from a text file.

The contents of files looks like below with the data i require in a table.
The text before and after the table is just general information but varies from file to file that i read, there can be other tables after the one i am tring to read but this must be ignored.
I feel the only way to get this data is to search for words 'Bus Timetable' as this not repeated any where else in the files. But to get the data i require, i need to look at the forth line from 'Bus Timetable'. I then need to some how get the FROM DEPT TO & ARR in the order they are in table, in the following way- FROMDEPT:TOARR ( i.e. twn0900:apt0911 ). I think to stop retrieving the data it can stop when it sees the next + as this is on the last line of the table.

---TEXT BEFORE--- +---------------+ | Bus Timetable | +---------------+----+------------+-----------+------+ | BUS SERVICE DAY tm | FROM DEPT | TO ARR | 24/7 | +--------------------+------------+-----------+------+ | C4 metro mon 15 | twn 0900 | Apt 1011 | yes | | C6 intl mon 45 | LDN 1000 | XTR 1426 | no | | B2 susx mon 20 | cly 1034 | btn 1118 | no | | A0 xxxxx xxx xx | xxx xxxx | xxx xxxx | xxx | +--------------------+------------+-----------+------+ ---TEXT AFTER---

I have noticed that the file produced can continue the table on the next page but puts the name of table and the headings for the columns as before. So i would like the program to read all tables with 'Bus Timetable' heading but ignore any other tables and text.

So far i have not written any code as i am not sure how to do this as i have been learning perl only 4 weeks now. I think i understand what i want the code to do and that it is possible to do what i have asked. I know how to search for a word but i dont not know how to look a number of lines from the result of search and to look at the table. I guess it should count the number of '|' and get data between the 2 and 4 one?

If any one can help me i will be very thankful

Thank you all so much, Kwun Chang.

Replies are listed 'Best First'.
Re: read text table
by ctilmes (Vicar) on Oct 20, 2003 at 10:44 UTC
    The task depends highly on the type of data you can see -- based on what you have shown, there are very specific columns in which you find the data you want, so unpack would be a useful extractor. (You also might consider split or regular expressions).
    #!/usr/bin/perl use warnings; use strict; do { $_ = <> } until /Bus Timetable/; <> for (1..3); # skip titles 3 lines while (<>) { last if /^\+/; my ($from, $dept, $to, $arr) = unpack('x23a3x3a4x3a3x2a4', $_); print "$from$dept:$to$arr\n"; }
Re: read text table
by Skeeve (Parson) on Oct 20, 2003 at 10:40 UTC
    Maybe something like this will help...
    # Warning: untested code... while (<>) { # Find the timetable and count the +---...---+ lines if ($hit= /Bus timetable/ .. $eot_count==3) { if ($hit==1 || $hit=~/e0/i) { # reset counters $eot_count=0; $line=0; } elsif (/\+\-{20}\+\-{12}\+\-{11}\+\-{6}\+/) { # count +----...---+ lines ++$eot_count; } # count lines inside the table if ($eot_count==2) { if (++$line==4) { # found fourth line... } } } }
Re: read text table
by allolex (Curate) on Oct 20, 2003 at 11:00 UTC

    This isn't very pretty, but it demonstrates the use of unpack to solve your problem:

    #!/usr/bin/perl use strict; use warnings; my $template = 'x2A4A8A4A3x2A6A5x2A5A5x2A4'; my ($bus,$service,$day,$time,$from,$departure,$to,$arrival,$avail); while (<DATA>) { next if m/^\+/; next unless /^|/; ($bus,$service,$day,$time,$from,$departure,$to,$arrival,$avail +) = unpack($template,$_); next unless $bus =~ /[A-Z]\d/; print "Bus $bus of the $service service, arrives at $to at $ar +rival, from $from, leaving at $departure on $day. ($avail)\n"; } __DATA__ ---TEXT BEFORE--- +---------------+ | Bus Timetable | +---------------+----+------------+-----------+------+ | BUS SERVICE DAY tm | FROM DEPT | TO ARR | 24/7 | +--------------------+------------+-----------+------+ | C4 metro mon 15 | twn 0900 | Apt 1011 | yes | | C6 intl mon 45 | LDN 1000 | XTR 1426 | no | | B2 susx mon 20 | cly 1034 | btn 1118 | no | | A0 xxxxx xxx xx | xxx xxxx | xxx xxxx | xxx | +--------------------+------------+-----------+------+ ---TEXT AFTER---

    Output

    Bus C4 of the metro service, arrives at Apt at 1011, from twn, leaving + at 0900 on mon. (yes) Bus C6 of the intl service, arrives at XTR at 1426, from LDN, leaving +at 1000 on mon. (no) Bus B2 of the susx service, arrives at btn at 1118, from cly, leaving +at 1034 on mon. (no) Bus A0 of the xxxxx service, arrives at xxx at xxxx, from xxx, leaving + at xxxx on xxx. (xxx)

    You can find more information on unpack by doing perldoc -f unpack at your command line.

    Hope that helps.

    --
    Allolex

    Update 2003-10-20 15:46:09 CEST: This node inspired me to look at all of the options in perldoc -f pack and handle the delimiters as null bytes.

    Perl and Linguistics
    http://world.std.com/~swmcd/steven/perl/linguistics.html
    http://www.linuxjournal.com/article.php?sid=3394
    http://www.wall.org/~larry/keynote/keynote.html

Re: read text table
by Roger (Parson) on Oct 20, 2003 at 11:20 UTC
    Here's a simple program to do what you have described. It is an overkill that it captures all data from the table. But you could do other things with the captured data I suppose. :-)
    use strict; use Data::Dumper; my @cols = qw/ BUS SERVICE DAY tm FROM DEPT TO ARR 24_7 /; my @table = (); my $capture; while (<DATA>) { chomp; $capture=1, next if (/Bus Timetable/); $capture=0, last if length($_) == 0 && $capture; if ($capture) { next if /^(?:\| BUS|\+)/; # ignore lines begin with + and BUS s/\|//g; # strip out | characters s/^\s+//g; # strip leading zero's my @rec = split /\s+/, $_; my %rec = map { $cols[$_] => $rec[$_] } 0 .. $#cols; push @table, \%rec; } } # debug print out print Dumper(\@table); # to print out from:to pairs foreach (@table) { printf "%s%s:%s%s\n", $_->{FROM}, $_->{DEPT}, $_->{TO}, $_->{ARR}; } __DATA__ ---TEXT BEFORE--- +-----------------+ | Train Timetable | +-----------------+--+------------+-----------+------+ | TRN SERVICE DAY tm | FROM DEPT | TO ARR | 24/7 | +--------------------+------------+-----------+------+ | T4 metro mon 15 | twn 0900 | Apt 1011 | yes | | T6 intl mon 45 | LDN 1000 | XTR 1426 | no | | T2 susx mon 20 | cly 1034 | btn 1118 | no | | T0 xxxxx xxx xx | xxx xxxx | xxx xxxx | xxx | +--------------------+------------+-----------+------+ +---------------+ | Bus Timetable | +---------------+----+------------+-----------+------+ | BUS SERVICE DAY tm | FROM DEPT | TO ARR | 24/7 | +--------------------+------------+-----------+------+ | C4 metro mon 15 | twn 0900 | Apt 1011 | yes | | C6 intl mon 45 | LDN 1000 | XTR 1426 | no | | B2 susx mon 20 | cly 1034 | btn 1118 | no | | A0 xxxxx xxx xx | xxx xxxx | xxx xxxx | xxx | +--------------------+------------+-----------+------+ ---TEXT AFTER---
    The output is as required -
    $VAR1 = [ { 'DEPT' => '0900', 'FROM' => 'twn', 'ARR' => '1011', 'DAY' => 'mon', 'SERVICE' => 'metro', '24/7' => 'yes', 'tm' => '15', 'TO' => 'Apt', 'BUS' => 'C4' }, ... ... twn0900:Apt1011 LDN1000:XTR1426 cly1034:btn1118 xxxxxxx:xxxxxxx
Re: read text table
by Anonymous Monk on Oct 20, 2003 at 11:31 UTC
    Hi, ctilmes, i have tried your code and it works great for the table i have given but maybe i should have said that the entries in the table can be of any length, so the destinations can be town instead of twn and the time is sometime 9am instead of 0900. Could you also explain the regulars expressions you have used as i dont understand them too well

    Thank you for your replies, Kwun

      If you want 4 character towns, change the unpack line to this:
      my ($from, $dept, $to, $arr) = unpack('x23a4x2a4x3a4x1a4', $_);
      unpack is different from regular expressions. The xn skips over characters, the an captures the character string:
      x23: skip 23 characters a4: get 4 characters x2: skip 2 a4: get 4 ...
      The 4 a4 unpack directives capture the 4 parts of the string you are interested in. Add this to strip the extra spaces:
      s/\s+//g for ($from, $dept, $to, $arr);
      Here's the whole thing:
        Thanks thats excellent

        Thanks to everyone else who has replied also, you've helped me a lot

Re: read text table
by Anonymous Monk on Oct 20, 2003 at 13:11 UTC
    ctilmes, I have found a small problem in the code that when there is an empty entry for a field it misses it and gets the next entry in the table for that field. Like in first line it will give me mon for SERVICE field. Do you have idea to solve this?

    +---------------+ | Bus Timetable | +---------------+----+-----------------+------------+------+ | BUS SERVICE DAY tm | FROM DEPT | TO ARR | 24/7 | +--------------------+-----------------+------------+------+ | C4 mon 15 | twn 0900 | Apt 1011 | yes | | intl mon | LoDoN 1000 | XTR 1426 | | | B2 sussex 20 | cly 1034 | brgtn 1118 | no | | A0 | manchester 1212 | xxx xxxx | xxx | +--------------------+-----------------+------------+------+
      Hello Kwun,

      The following code will work PROVIDED there is always an entry in the FROM, DEPT, TO, ARR fields. It looks for the second pipe symbol, (|), and gets the rest of the line after it. Hope this helps.

      Chris

      #!/usr/bin/perl use strict; use warnings; while (<DATA>) { next unless /Bus Timetable/; <DATA> for 1..3; my $line; until (($line = <DATA>) =~ /^\+/) { # Find index of the first '|' after the first one. my $idx = index($line, "|", 1); # From that position (+2), capture the remaining string $line = substr($line, $idx+2); # Remove the '|' characters $line =~ s/\|//g; # Capture the first four fields my ($fr, $dep, $to, $arr) = split ' ', $line; print "$fr$dep:$to$arr\n"; } } __DATA__ ---TEXT BEFORE--- +---------------+ | Bus Timetable | +---------------+----+------------+-----------+------+ | BUS SERVICE DAY tm | FROM DEPT | TO ARR | 24/7 | +--------------------+------------+-----------+------+ | C4 metro mon 15 | twn 0900 | Apt 1011 | yes | | C6 intl mon 45 | LDN 1000 | XTR 1426 | no | | B2 susx mon 20 | cly 1034 | btn 1118 | no | | A0 xxxxx xxx xx | xxx xxxx | xxx xxxx | xxx | +--------------------+------------+-----------+------+ ---TEXT AFTER--- +---------------+ | Bus Timetable | +---------------+----+-----------------+------------+------+ | BUS SERVICE DAY tm | FROM DEPT | TO ARR | 24/7 | +--------------------+-----------------+------------+------+ | C4 mon 15 | twn 0900 | Apt 1011 | yes | | intl mon | LoDoN 1000 | XTR 1426 | | | B2 sussex 20 | cly 1034 | brgtn 1118 | no | | A0 | manchester 1212 | xxx xxxx | xxx | +--------------------+-----------------+------------+------+ **** RESULTS ******* *********************** twn0900:Apt1011 LDN1000:XTR1426 cly1034:btn1118 xxxxxxx:xxxxxxx twn0900:Apt1011 LoDoN1000:XTR1426 cly1034:brgtn1118 manchester1212:xxxxxxx