GrandFather:

I've got a chunk of code that does this, but I've not turned it into a module because it's a bit temperamental. Rather, the code isn't temperamental, but the problem keeps changing for different projects. Consequently, for each project I find myself either tweaking the code a bit or the table a bit to make it load up.

I'm at work right now, so I don't have it handy, but I can dig it up this evening if you want it. The gist of it, though, is to:

To simplify the first two tasks, I tweak the data and add a line of dashes to the table (as the automatic method I used to use is too finicky). While reading the file, I keep lines before the dash bar (back to the first non-empty line) to build the field keys.

The ugly bit(s) are that there are so many special cases I wind up with for different projects. If you leave the special cases out, it's all fairly straightforward:

$ cat pm_11149401.pl #!env perl use strict; use warnings; use Data::Dumper; ### Find the table start and column header lines my ($dashes, @tmp); while (<DATA>) { # We've found the end of the column headings when we find a line of +dashes and # blanks with at least eight sequential dashes $dashes=$_, last if /^[-\s]*-{8}[-\s]+$/; push @tmp, $_; # The data we'll build the column headers / keys from is only from l +ines # immediately before the dash bar @tmp=(), next if /^\s*$/; } die "No dash bar found!" unless defined $dashes; ### Build the column descriptions # First need the starting position and width of each column my $col=0; my @coldefs; while ($dashes ne '' and $dashes =~ /^(\s*)(-*)/) { # skip blanks $col += length($1); if (length $2) { push @coldefs, { beg=>$col, len=>length($2) }; $col += length($2); } $dashes = substr($dashes, length($1)+length($2)); } # Build the column keys for my $tmp (@tmp) { for my $ar (@coldefs) { my $chunk = substr($tmp, $ar->{beg}, $ar->{len}); $chunk =~ s/(^\s+|\s+$)//g; $chunk =~ s/[^-a-zA-Z0-9_]+/_/g; $ar->{key} .= $chunk; } } # Parse the table my @records; while (<DATA>) { last if /^\s*$/; my $hr = {}; for my $ar (@coldefs) { my $chunk = substr($_, $ar->{beg}, $ar->{len}); $chunk =~ s/(^\s+|\s+$)//g; $hr->{$ar->{key}} = $chunk; } push @records, $hr; } print Dumper(\@records); __DATA__ Annular-Total Eclipse of 2023 Apr 20 - multisite predictions 1st Contact Site Longitude Latitude Elvn U.T. PA Alt o ' o ' m h m s o o ----------------- -------- --------- ------ -------- --- -- Auckland 174 45. -36 55. 0 4 33 59 313 13 Blenheim 173 55. -41 35. 30 4 40 34 326 11 Cape Palliser 175 25. -41 35. 0 4 42 28 327 9 Cape Reinga 172 45. -34 25. 50 4 30 11 307 17 Carterton 175 35. -41 5. 0 4 40 35 324 10 Dannevirke 176 5. -40 15. 200 4 39 9 321 10 East Cape 178 35. -37 45. 0 4 37 58 315 10 Featherston 175 25. -41 5. 40 4 40 36 325 10 Gisborne 178 5. -38 45. 0 4 38 29 317 10 Great Barrier Is 175 25. -36 15. 0 4 34 15 312 13 $ perl pm_11149401.pl $VAR1 = [ { 'Elvnm' => '0', 'lto' => '13', 'Longitudo_' => '174 45.', 'Site' => 'Auckland', '1st_ContU_T_h_m_s' => '4 33 59', 'Latitudeo_' => '-36 55.', 'PAo' => '313' }, { 'lto' => '11', 'Elvnm' => '30', 'Site' => 'Blenheim', 'Longitudo_' => '173 55.', 'PAo' => '326', 'Latitudeo_' => '-41 35.', '1st_ContU_T_h_m_s' => '4 40 34' }, { 'Elvnm' => '0', 'lto' => '9', 'Site' => 'Cape Palliser', 'Longitudo_' => '175 25.', '1st_ContU_T_h_m_s' => '4 42 28', 'PAo' => '327', 'Latitudeo_' => '-41 35.' }, { 'Site' => 'Cape Reinga', 'Longitudo_' => '172 45.', 'PAo' => '307', 'Latitudeo_' => '-34 25.', '1st_ContU_T_h_m_s' => '4 30 11', 'lto' => '17', 'Elvnm' => '50' }, { 'Latitudeo_' => '-41 5.', 'PAo' => '324', '1st_ContU_T_h_m_s' => '4 40 35', 'Longitudo_' => '175 35.', 'Site' => 'Carterton', 'lto' => '10', 'Elvnm' => '0' }, { 'Longitudo_' => '176 5.', 'Site' => 'Dannevirke', '1st_ContU_T_h_m_s' => '4 39 9', 'Latitudeo_' => '-40 15.', 'PAo' => '321', 'Elvnm' => '200', 'lto' => '10' }, { 'Elvnm' => '0', 'lto' => '10', '1st_ContU_T_h_m_s' => '4 37 58', 'PAo' => '315', 'Latitudeo_' => '-37 45.', 'Longitudo_' => '178 35.', 'Site' => 'East Cape' }, { 'Longitudo_' => '175 25.', 'Site' => 'Featherston', '1st_ContU_T_h_m_s' => '4 40 36', 'Latitudeo_' => '-41 5.', 'PAo' => '325', 'Elvnm' => '40', 'lto' => '10' }, { 'lto' => '10', 'Elvnm' => '0', 'PAo' => '317', 'Latitudeo_' => '-38 45.', '1st_ContU_T_h_m_s' => '4 38 29', 'Site' => 'Gisborne', 'Longitudo_' => '178 5.' }, { 'PAo' => '312', 'Latitudeo_' => '-36 15.', '1st_ContU_T_h_m_s' => '4 34 15', 'Longitudo_' => '175 25.', 'Site' => 'Great Barrier Is', 'lto' => '13', 'Elvnm' => '0' } ];

The special cases, though, are where I basically tweak things that drive me crazy. There's a 'translation table' at the start that lets me map the incoming column names to a better one, as well as tie it to a function handle that parses the resulting string into a better format. Another version somewhere has a control-break handler that lets you specify key columns so when the key values are blank, it makes 'sub records' and so on.

I've never created and published a module before, but if I had, I'd still be reluctant to try to build this thing out because of the ugly cases that keep coming up. But on the off chance that you might find it useful enough, I'll dig one of them up for you.

...roboticus

When your only tool is a hammer, all problems look like your thumb.


In reply to Re: Module for parsing tables from plain text document by roboticus
in thread Module for parsing tables from plain text document by GrandFather

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.