in reply to Module for parsing tables from plain text document
I've got a chunk of code that does this, but I've not turned it into a module because it's a bit temperamental. Rather, the code isn't temperamental, but the problem keeps changing for different projects. Consequently, for each project I find myself either tweaking the code a bit or the table a bit to make it load up.
I'm at work right now, so I don't have it handy, but I can dig it up this evening if you want it. The gist of it, though, is to:
To simplify the first two tasks, I tweak the data and add a line of dashes to the table (as the automatic method I used to use is too finicky). While reading the file, I keep lines before the dash bar (back to the first non-empty line) to build the field keys.
The ugly bit(s) are that there are so many special cases I wind up with for different projects. If you leave the special cases out, it's all fairly straightforward:
$ cat pm_11149401.pl #!env perl use strict; use warnings; use Data::Dumper; ### Find the table start and column header lines my ($dashes, @tmp); while (<DATA>) { # We've found the end of the column headings when we find a line of +dashes and # blanks with at least eight sequential dashes $dashes=$_, last if /^[-\s]*-{8}[-\s]+$/; push @tmp, $_; # The data we'll build the column headers / keys from is only from l +ines # immediately before the dash bar @tmp=(), next if /^\s*$/; } die "No dash bar found!" unless defined $dashes; ### Build the column descriptions # First need the starting position and width of each column my $col=0; my @coldefs; while ($dashes ne '' and $dashes =~ /^(\s*)(-*)/) { # skip blanks $col += length($1); if (length $2) { push @coldefs, { beg=>$col, len=>length($2) }; $col += length($2); } $dashes = substr($dashes, length($1)+length($2)); } # Build the column keys for my $tmp (@tmp) { for my $ar (@coldefs) { my $chunk = substr($tmp, $ar->{beg}, $ar->{len}); $chunk =~ s/(^\s+|\s+$)//g; $chunk =~ s/[^-a-zA-Z0-9_]+/_/g; $ar->{key} .= $chunk; } } # Parse the table my @records; while (<DATA>) { last if /^\s*$/; my $hr = {}; for my $ar (@coldefs) { my $chunk = substr($_, $ar->{beg}, $ar->{len}); $chunk =~ s/(^\s+|\s+$)//g; $hr->{$ar->{key}} = $chunk; } push @records, $hr; } print Dumper(\@records); __DATA__ Annular-Total Eclipse of 2023 Apr 20 - multisite predictions 1st Contact Site Longitude Latitude Elvn U.T. PA Alt o ' o ' m h m s o o ----------------- -------- --------- ------ -------- --- -- Auckland 174 45. -36 55. 0 4 33 59 313 13 Blenheim 173 55. -41 35. 30 4 40 34 326 11 Cape Palliser 175 25. -41 35. 0 4 42 28 327 9 Cape Reinga 172 45. -34 25. 50 4 30 11 307 17 Carterton 175 35. -41 5. 0 4 40 35 324 10 Dannevirke 176 5. -40 15. 200 4 39 9 321 10 East Cape 178 35. -37 45. 0 4 37 58 315 10 Featherston 175 25. -41 5. 40 4 40 36 325 10 Gisborne 178 5. -38 45. 0 4 38 29 317 10 Great Barrier Is 175 25. -36 15. 0 4 34 15 312 13 $ perl pm_11149401.pl $VAR1 = [ { 'Elvnm' => '0', 'lto' => '13', 'Longitudo_' => '174 45.', 'Site' => 'Auckland', '1st_ContU_T_h_m_s' => '4 33 59', 'Latitudeo_' => '-36 55.', 'PAo' => '313' }, { 'lto' => '11', 'Elvnm' => '30', 'Site' => 'Blenheim', 'Longitudo_' => '173 55.', 'PAo' => '326', 'Latitudeo_' => '-41 35.', '1st_ContU_T_h_m_s' => '4 40 34' }, { 'Elvnm' => '0', 'lto' => '9', 'Site' => 'Cape Palliser', 'Longitudo_' => '175 25.', '1st_ContU_T_h_m_s' => '4 42 28', 'PAo' => '327', 'Latitudeo_' => '-41 35.' }, { 'Site' => 'Cape Reinga', 'Longitudo_' => '172 45.', 'PAo' => '307', 'Latitudeo_' => '-34 25.', '1st_ContU_T_h_m_s' => '4 30 11', 'lto' => '17', 'Elvnm' => '50' }, { 'Latitudeo_' => '-41 5.', 'PAo' => '324', '1st_ContU_T_h_m_s' => '4 40 35', 'Longitudo_' => '175 35.', 'Site' => 'Carterton', 'lto' => '10', 'Elvnm' => '0' }, { 'Longitudo_' => '176 5.', 'Site' => 'Dannevirke', '1st_ContU_T_h_m_s' => '4 39 9', 'Latitudeo_' => '-40 15.', 'PAo' => '321', 'Elvnm' => '200', 'lto' => '10' }, { 'Elvnm' => '0', 'lto' => '10', '1st_ContU_T_h_m_s' => '4 37 58', 'PAo' => '315', 'Latitudeo_' => '-37 45.', 'Longitudo_' => '178 35.', 'Site' => 'East Cape' }, { 'Longitudo_' => '175 25.', 'Site' => 'Featherston', '1st_ContU_T_h_m_s' => '4 40 36', 'Latitudeo_' => '-41 5.', 'PAo' => '325', 'Elvnm' => '40', 'lto' => '10' }, { 'lto' => '10', 'Elvnm' => '0', 'PAo' => '317', 'Latitudeo_' => '-38 45.', '1st_ContU_T_h_m_s' => '4 38 29', 'Site' => 'Gisborne', 'Longitudo_' => '178 5.' }, { 'PAo' => '312', 'Latitudeo_' => '-36 15.', '1st_ContU_T_h_m_s' => '4 34 15', 'Longitudo_' => '175 25.', 'Site' => 'Great Barrier Is', 'lto' => '13', 'Elvnm' => '0' } ];
The special cases, though, are where I basically tweak things that drive me crazy. There's a 'translation table' at the start that lets me map the incoming column names to a better one, as well as tie it to a function handle that parses the resulting string into a better format. Another version somewhere has a control-break handler that lets you specify key columns so when the key values are blank, it makes 'sub records' and so on.
I've never created and published a module before, but if I had, I'd still be reluctant to try to build this thing out because of the ugly cases that keep coming up. But on the off chance that you might find it useful enough, I'll dig one of them up for you.
...roboticus
When your only tool is a hammer, all problems look like your thumb.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^2: Module for parsing tables from plain text document
by GrandFather (Saint) on Jan 11, 2023 at 09:53 UTC |