Re: Module for parsing tables from plain text document

GrandFather:

I've got a chunk of code that does this, but I've not turned it into a module because it's a bit temperamental. Rather, the code isn't temperamental, but the problem keeps changing for different projects. Consequently, for each project I find myself either tweaking the code a bit or the table a bit to make it load up.

I'm at work right now, so I don't have it handy, but I can dig it up this evening if you want it. The gist of it, though, is to:

Find the split between the column headers and the data
Find the column widths
Read the records
While reading the records accumulate stats to help find the data type

To simplify the first two tasks, I tweak the data and add a line of dashes to the table (as the automatic method I used to use is too finicky). While reading the file, I keep lines before the dash bar (back to the first non-empty line) to build the field keys.

The ugly bit(s) are that there are so many special cases I wind up with for different projects. If you leave the special cases out, it's all fairly straightforward:

$ cat pm_11149401.pl
#!env perl
use strict;
use warnings;
use Data::Dumper;

### Find the table start and column header lines

my ($dashes, @tmp);
while (<DATA>) {

  # We've found the end of the column headings when we find a line of 
+dashes and
  # blanks with at least eight sequential dashes
  $dashes=$_, last if /^[-\s]*-{8}[-\s]+$/;

  push @tmp, $_;

  # The data we'll build the column headers / keys from is only from l
+ines
  # immediately before the dash bar
  @tmp=(), next if /^\s*$/;
}
die "No dash bar found!" unless defined $dashes;

### Build the column descriptions

# First need the starting position and width of each column

my $col=0;
my @coldefs;
while ($dashes ne '' and $dashes =~ /^(\s*)(-*)/) {

  # skip blanks
  $col += length($1);

  if (length $2) {
    push @coldefs, { beg=>$col, len=>length($2) };
    $col += length($2);
  }

  $dashes = substr($dashes, length($1)+length($2));
}

# Build the column keys
for my $tmp (@tmp) {
  for my $ar (@coldefs) {
    my $chunk = substr($tmp, $ar->{beg}, $ar->{len});
    $chunk =~ s/(^\s+|\s+$)//g;
    $chunk =~ s/[^-a-zA-Z0-9_]+/_/g;
    $ar->{key} .= $chunk;
  }
}

# Parse the table
my @records;
while (<DATA>) {
  last if /^\s*$/;

  my $hr = {};
  for my $ar (@coldefs) {
    my $chunk = substr($_, $ar->{beg}, $ar->{len});
    $chunk =~ s/(^\s+|\s+$)//g;
    $hr->{$ar->{key}} = $chunk;
  }
  push @records, $hr;
}

print Dumper(\@records);

__DATA__
Annular-Total  Eclipse  of  2023 Apr 20 - multisite predictions

                                               1st Contact
Site               Longitude Latitude  Elvn      U.T.     PA Alt
                     o   '    o   '       m     h  m  s     o  o
-----------------  -------- --------- ------   --------   --- --
Auckland           174 45.  -36 55.       0     4 33 59   313 13
Blenheim           173 55.  -41 35.      30     4 40 34   326 11
Cape Palliser      175 25.  -41 35.       0     4 42 28   327  9
Cape Reinga        172 45.  -34 25.      50     4 30 11   307 17
Carterton          175 35.  -41  5.       0     4 40 35   324 10
Dannevirke         176  5.  -40 15.     200     4 39  9   321 10
East Cape          178 35.  -37 45.       0     4 37 58   315 10
Featherston        175 25.  -41  5.      40     4 40 36   325 10
Gisborne           178  5.  -38 45.       0     4 38 29   317 10
Great Barrier Is   175 25.  -36 15.       0     4 34 15   312 13

$ perl pm_11149401.pl
$VAR1 = [
          {
            'Elvnm' => '0',
            'lto' => '13',
            'Longitudo_' => '174 45.',
            'Site' => 'Auckland',
            '1st_ContU_T_h_m_s' => '4 33 59',
            'Latitudeo_' => '-36 55.',
            'PAo' => '313'
          },
          {
            'lto' => '11',
            'Elvnm' => '30',
            'Site' => 'Blenheim',
            'Longitudo_' => '173 55.',
            'PAo' => '326',
            'Latitudeo_' => '-41 35.',
            '1st_ContU_T_h_m_s' => '4 40 34'
          },
          {
            'Elvnm' => '0',
            'lto' => '9',
            'Site' => 'Cape Palliser',
            'Longitudo_' => '175 25.',
            '1st_ContU_T_h_m_s' => '4 42 28',
            'PAo' => '327',
            'Latitudeo_' => '-41 35.'
          },
          {
            'Site' => 'Cape Reinga',
            'Longitudo_' => '172 45.',
            'PAo' => '307',
            'Latitudeo_' => '-34 25.',
            '1st_ContU_T_h_m_s' => '4 30 11',
            'lto' => '17',
            'Elvnm' => '50'
          },
          {
            'Latitudeo_' => '-41  5.',
            'PAo' => '324',
            '1st_ContU_T_h_m_s' => '4 40 35',
            'Longitudo_' => '175 35.',
            'Site' => 'Carterton',
            'lto' => '10',
            'Elvnm' => '0'
          },
          {
            'Longitudo_' => '176  5.',
            'Site' => 'Dannevirke',
            '1st_ContU_T_h_m_s' => '4 39  9',
            'Latitudeo_' => '-40 15.',
            'PAo' => '321',
            'Elvnm' => '200',
            'lto' => '10'
          },
          {
            'Elvnm' => '0',
            'lto' => '10',
            '1st_ContU_T_h_m_s' => '4 37 58',
            'PAo' => '315',
            'Latitudeo_' => '-37 45.',
            'Longitudo_' => '178 35.',
            'Site' => 'East Cape'
          },
          {
            'Longitudo_' => '175 25.',
            'Site' => 'Featherston',
            '1st_ContU_T_h_m_s' => '4 40 36',
            'Latitudeo_' => '-41  5.',
            'PAo' => '325',
            'Elvnm' => '40',
            'lto' => '10'
          },
          {
            'lto' => '10',
            'Elvnm' => '0',
            'PAo' => '317',
            'Latitudeo_' => '-38 45.',
            '1st_ContU_T_h_m_s' => '4 38 29',
            'Site' => 'Gisborne',
            'Longitudo_' => '178  5.'
          },
          {
            'PAo' => '312',
            'Latitudeo_' => '-36 15.',
            '1st_ContU_T_h_m_s' => '4 34 15',
            'Longitudo_' => '175 25.',
            'Site' => 'Great Barrier Is',
            'lto' => '13',
            'Elvnm' => '0'
          }
        ];
[download]

The special cases, though, are where I basically tweak things that drive me crazy. There's a 'translation table' at the start that lets me map the incoming column names to a better one, as well as tie it to a function handle that parses the resulting string into a better format. Another version somewhere has a control-break handler that lets you specify key columns so when the key values are blank, it makes 'sub records' and so on.

I've never created and published a module before, but if I had, I'd still be reluctant to try to build this thing out because of the ugly cases that keep coming up. But on the off chance that you might find it useful enough, I'll dig one of them up for you.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Comment on Re: Module for parsing tables from plain text document Download Code