in reply to Re: Module for parsing tables from plain text document
in thread Module for parsing tables from plain text document

There are lots of ways to do it if I want to count characters for each of the tables I need to deal with. What I'd like is something that looks at the table and uses heuristics to figure out the column widths and names. For all the tables I'm dealing with in the first instance the tables are machine generated so the columns are unlikely to change within a table, but they do change between tables.

Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
  • Comment on Re^2: Module for parsing tables from plain text document

Replies are listed 'Best First'.
Re^3: Module for parsing tables from plain text document
by NERDVANA (Priest) on Jan 09, 2023 at 03:58 UTC
    I wrote something similar for PDF once, and also wrote Data::TableReader, but I never got around to making PDF into one of the decoders. For PDF, it made sense to look at start X addresses for segments of text, and identify it as a column if there were roughly as many text fragments starting at an X as there are estimated number of lines. Text has less granularity, so I think if I were going to try writing it for text, I would iterate lines of text and make a history of which columns have a vertical run of whitespace, and at the EOF or first blank line, see which runs of whitespace lasted from the first to the last line. Concatenate adjacent whitespace columns, and then report the space inbetween as the data columns.

    It would be really awesome if you wanted to contribute a Decoder for Data::TableReader :-)

      Could you please show an example how to parse the OP's table?

      I find this example particularly challenging, since

      • it has nested columns
      • multiple subdivided head captions
      • especially "Longitude" is overlapping the "empty column" limiting its data entries below.

      Cheers Rolf
      (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
      Wikisyntax for the Monastery

        Let me throw in some assumptions here (having to deal with quite a few other text-based formats at work):

        • You can skip the headers, because they are standardized across files
        • Everything before the first number (or minus sign) is the location name
        • Data columns always contain a value
        • The only column that can contain spaces is the location name

        That means, we can just collapse spaces. We have to handle the location name special, but after that can use split to recover the columns:

        #!/usr/bin/env perl use strict; use warnings; use Data::Dumper; use Carp; my @sites; open(my $ifh, '<', 'eclipse.txt') or croak($!); # Skip header for(1..5) { my $tmp = <$ifh>; } while((my $line = <$ifh>)) { chomp $line; next if($line eq ''); # Ignore empty lines my %entry; $line =~ s/\ +/ /g; # Collapse spaces if($line =~ /^(.*?)\s[-\d]/) { $entry{location} = $1; # Remove location name $line =~ s/^.*?\s([-\d])/$1/; # Split along spaces my @parts = split/\ /, $line; foreach my $name (qw[long1 long2 lat1 lat2 elevation h m s PA +Alt]) { $entry{$name} = shift @parts; } push @sites, \%entry; } } close $ifh; print Dumper(\@sites);

        That results in an array of hashes:

        $VAR1 = [ { 's' => '59', 'elevation' => '0', 'long2' => '45.', 'lat2' => '55.', 'lat1' => '-36', 'location' => 'Auckland', 'm' => '33', 'h' => '4', 'long1' => '174', 'PA' => '313', 'Alt' => '13' }, { 'h' => '4', 'm' => '40', 'PA' => '326', 'Alt' => '11', 'long1' => '173', 'lat2' => '35.', 'long2' => '55.', 's' => '34', 'elevation' => '30', 'location' => 'Blenheim', 'lat1' => '-41' }, { 'h' => '4', 'm' => '42', 'PA' => '327', 'Alt' => '9', 'long1' => '175', 'lat2' => '35.', 'long2' => '25.', 's' => '28', 'elevation' => '0', 'location' => 'Cape Palliser', 'lat1' => '-41' }, ...

        PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP