parsing CSV

younggrasshopper13 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: parsing CSV by GrandFather (Saint) on Oct 07, 2016 at 03:01 UTC
You don't show what the page data may look like so I assume that you know how to wrangle it into raw CSV. Given that, you can match the data up by stuffing it into a hash: use strict; use warnings; use Text::CSV; my $page1 = <<PG1CSV; 1,23 2,10 3,23 PG1CSV my $page2 = <<PG2CSV; 1,younggrasshopper13 2,GrandFather 4,Mr. Unknown PG2CSV my $csv = Text::CSV->new(); my %idData; open my $pg1In, '<', \$page1; while (my $row = $csv->getline($pg1In)) { $idData{$row->[0]}{size} = $row->[1]; $idData{$row->[0]}{name} = '-- missing --'; } close $pg1In; open my $pg2In, '<', \$page2; while (my $row = $csv->getline($pg2In)) { $idData{$row->[0]}{name} = $row->[1]; $idData{$row->[0]}{size} //= '-- missing --'; } close $pg2In; for my $id (sort keys %idData) { print "$id: $idData{$id}{name} size $idData{$id}{size}\n"; } [download] Prints: `1: younggrasshopper13 size 23 2: GrandFather size 10 3: -- missing -- size 23 4: Mr. Unknown size -- missing --` [download] Premature optimization is the root of all job security	[reply] [d/l] [select]
Re^2: parsing CSV by younggrasshopper13 (Novice) on Oct 07, 2016 at 03:15 UTC
The first wepage has two columns storage size and customer id. It looks like this `512.45,c100 6734, c200 5653.2, c300` [download] the second web page has no column names, is a little messy and looks like this `c100, Joe Shmo c200, Jack Black c300, Cinderella c400, Barack Obama c5 +00, Cruella Deville` [download] The second page is a line after line of customer data and names. no columns just line after line	[reply] [d/l] [select]
Re^3: parsing CSV by GrandFather (Saint) on Oct 07, 2016 at 03:34 UTC
The code is pretty much the same except that the second page data gets new lines inserted in front of the id codes and we do a little clean up to remove white space at the ends of lines: use strict; use warnings; use Text::CSV; my $page1 = <<PG1CSV; 512.45,c100 6734, c200 5653.2, c300 PG1CSV my $csv = Text::CSV->new(); my %idData; open my $pg1In, '<', \$page1; while (my $row = $csv->getline($pg1In)) { s/^\s+\|\s+$//g for @$row; $idData{$row->[1]}{size} = $row->[0]; $idData{$row->[1]}{name} = '-- missing --'; } close $pg1In; my $page2 = <<PG2CSV; c100, Joe Shmo c200, Jack Black c300, Cinderella c400, Barack Obama c5 +00, Cruella Deville PG2CSV $page2 =~ s/\b(?=\w+,)/\n/g; # Insert newlines in front of id codes open my $pg2In, '<', \$page2; while (my $row = $csv->getline($pg2In)) { next if !$row->[0]; # Skip blank lines s/^\s+\|\s+$//g for @$row; $idData{$row->[0]}{name} = $row->[1]; $idData{$row->[0]}{size} //= '-- missing --'; } close $pg2In; for my $id (sort keys %idData) { print "$id: $idData{$id}{name} size $idData{$id}{size}\n"; } [download] Prints: `c100: Joe Shmo size 512.45 c200: Jack Black size 6734 c300: Cinderella size 5653.2 c400: Barack Obama size -- missing -- c500: Cruella Deville size -- missing --` [download] Premature optimization is the root of all job security	[reply] [d/l] [select]
Re^4: parsing CSV by younggrasshopper13 (Novice) on Oct 07, 2016 at 05:01 UTC
Re^5: parsing CSV by GrandFather (Saint) on Oct 07, 2016 at 06:23 UTC
Some notes below your chosen depth have not been shown here
Re^4: parsing CSV by younggrasshopper13 (Novice) on Oct 08, 2016 at 02:15 UTC
Re^5: parsing CSV by AnomalousMonk (Archbishop) on Oct 08, 2016 at 03:13 UTC
Some notes below your chosen depth have not been shown here
Re: parsing CSV by GrandFather (Saint) on Oct 09, 2016 at 10:04 UTC
This doesn't seem to be getting very far very fast. The following puts all the pieces together, albeit using data from Re^2: parsing CSV rather than the real data. The parsing and clean up will no doubt need to be different for the real data. This just pulls out the first two pre tags from one page rather than fetching two pages and doing whatever is needed to pull out the interesting content. use strict; use warnings; use MIME::Lite; use LWP::Simple; use Text::CSV; use HTML::TreeBuilder; # Fetch the "pages" my $content = get("http://perlmonks.org/?node_id=1173447"); die "Couldn't get it!" unless defined $content; # Parse pages and clean up content my $root = HTML::TreeBuilder->new_from_content($content); my ($page1, $page2) = map {$_->as_text()} $root->find_by_tag_name('pre +'); s/\[download\]//g for $page1, $page2; s/\n\+//g for $page1, $page2; # Process page 1 my $csv = Text::CSV->new(); my %idData; open my $pg1In, '<', \$page1; while (my $row = $csv->getline($pg1In)) { s/^\s+\|\s+$//g for @$row; $idData{$row->[1]}{size} = $row->[0]; $idData{$row->[1]}{name} = '-- missing --'; } close $pg1In; # Process page 2 $page2 =~ s/\b(?=\w+,)/\n/g; # Insert newlines in front of id codes open my $pg2In, '<', \$page2; while (my $row = $csv->getline($pg2In)) { next if !$row->[0]; # Skip blank lines s/^\s+\|\s+$//g for @$row; $idData{$row->[0]}{name} = $row->[1]; $idData{$row->[0]}{size} //= '-- missing --'; } close $pg2In; # Generate output string my $output; for my $id (sort keys %idData) { $output .= "$id: $idData{$id}{name} size $idData{$id}{size}\n"; } # Build the email my $msg = MIME::Lite->new( From => 'me@myhost.com', To => 'you@yourhost.com', Cc => 'some@other.com, some@more.com', Subject => "Here's the data you wanted", Data => $output ); # and "send" it (just '$msg->send()' in the next line to really send i +t print $msg->as_string(); [download] Prints: `Content-Disposition: inline Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 X-Mailer: MIME::Lite 3.030 (F2.85; T2.13; A2.16; B3.15; Q3.13) Date: Sun, 9 Oct 2016 22:55:39 +1300 From: me@myhost.com To: you@yourhost.com Cc: some@other.com, some@more.com Subject: Here's the data you wanted c100: Joe Shmo size 512.45 c200: Jack Black size 6734 c300: Cinderella size 5653.2 c400: Barack Obama size -- missing -- c500: Cruella Deville size -- missing --` [download] I suggest you leave the print line in until the body of the email looks right before you change it to the send line. Premature optimization is the root of all job security	[reply] [d/l] [select]