in reply to Re^3: Module for parsing tables from plain text document
in thread Module for parsing tables from plain text document

Could you please show an example how to parse the OP's table?

I find this example particularly challenging, since

Cheers Rolf
(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
Wikisyntax for the Monastery

  • Comment on Re^4: Module for parsing tables from plain text document

Replies are listed 'Best First'.
Re^5: Module for parsing tables from plain text document
by cavac (Prior) on Jan 13, 2023 at 14:03 UTC

    Let me throw in some assumptions here (having to deal with quite a few other text-based formats at work):

    • You can skip the headers, because they are standardized across files
    • Everything before the first number (or minus sign) is the location name
    • Data columns always contain a value
    • The only column that can contain spaces is the location name

    That means, we can just collapse spaces. We have to handle the location name special, but after that can use split to recover the columns:

    #!/usr/bin/env perl use strict; use warnings; use Data::Dumper; use Carp; my @sites; open(my $ifh, '<', 'eclipse.txt') or croak($!); # Skip header for(1..5) { my $tmp = <$ifh>; } while((my $line = <$ifh>)) { chomp $line; next if($line eq ''); # Ignore empty lines my %entry; $line =~ s/\ +/ /g; # Collapse spaces if($line =~ /^(.*?)\s[-\d]/) { $entry{location} = $1; # Remove location name $line =~ s/^.*?\s([-\d])/$1/; # Split along spaces my @parts = split/\ /, $line; foreach my $name (qw[long1 long2 lat1 lat2 elevation h m s PA +Alt]) { $entry{$name} = shift @parts; } push @sites, \%entry; } } close $ifh; print Dumper(\@sites);

    That results in an array of hashes:

    $VAR1 = [ { 's' => '59', 'elevation' => '0', 'long2' => '45.', 'lat2' => '55.', 'lat1' => '-36', 'location' => 'Auckland', 'm' => '33', 'h' => '4', 'long1' => '174', 'PA' => '313', 'Alt' => '13' }, { 'h' => '4', 'm' => '40', 'PA' => '326', 'Alt' => '11', 'long1' => '173', 'lat2' => '35.', 'long2' => '55.', 's' => '34', 'elevation' => '30', 'location' => 'Blenheim', 'lat1' => '-41' }, { 'h' => '4', 'm' => '42', 'PA' => '327', 'Alt' => '9', 'long1' => '175', 'lat2' => '35.', 'long2' => '25.', 's' => '28', 'elevation' => '0', 'location' => 'Cape Palliser', 'lat1' => '-41' }, ...

    PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP

        I stand corrected.

        As for those advanced heuristics, my first instinct would be to look into the "Open/Import" functionality of all those open source Spreadsheet tools like LibreOffice. Those developers spent the last few decades writing software that can make sense of user provided, badly formatted data files.

        As far as it concerns myself, those self-"learning" AI/heuristics/statistics tools might be somewhat interesting for occasional hobby use. But i wouldn't consider them for production use. If something goes wrong (e.g. "a bug happens"), it's easy enough to debug (and verify/certify) a handcrafted parser. If an AI goes wrong, all you can do is tweak the training data, retrain the model and pray to a $DEITY of your choice that

        1. this has fixed the current problem
        2. the change in your training data hasn't introduced new problems

        Advanced statistics (including what we commonly refer to AI) is an amazing tool by itself. But when is goes wrong, you basically have to find an error (or omission) in what boils down to a formula with possibly tens of millions of variables. I mean, winning a Nobel price is nothing to sneer at, but i'm not sure how one would do it on a typical IT department budget ;-)

        PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP
        The point of the OP is that he wanted an "AI/heuristic/statistical tool" do those assumptions for him

        Maybe someone should ask Github Copilot.