in reply to Extracting data-structure from HTML using Web::Scraper

Thanks to everyone that posted a solution. I learned a lot by reading thru the different approaches to the problem.

I also ended up working out a solution using nothing but Web::Scraper (one of my requirements), and wanted to post it here

use strict; use warnings; use Web::Scraper; use Data::Dumper; my $sample = q{ <html> <body> <h4 class="bla">July 12</h4> <p>Tim</p> <p>Jon</p> <h4 class="bla">July 13</h4> <p>James</p> <p>Eric</p> <p>Jerry</p> <p>Susie</p> <h4 class="bla">July 14</h4> <p>Kami</p> <p>Darryl</p> </body> </html> }; my $names = scraper { process 'h4.bla', 'names[]' => sub { my $elem = shift; my $date = $elem->as_text; my @names = (); for my $node ($elem->parent->findnodes( "//p[preceding-sibling +::h4[1][. = '$date']]" )) { push @names, $node->as_text; } return { $date => \@names }; }; }; my $res = $names->scrape( $sample ); print Dumper $res

That will output the following

$VAR1 = { 'names' => [ { 'July 12' => [ 'Tim', 'Jon' ] }, { 'July 13' => [ 'James', 'Eric', 'Jerry', 'Susie' ] }, { 'July 14' => [ 'Kami', 'Darryl' ] } ] };

Again, thanks to everyone for the help, you guys are awesome!

Replies are listed 'Best First'.
Re^2: Extracting data-structure from HTML using Web::Scraper
by Anonymous Monk on Jul 15, 2012 at 23:38 UTC

    nothing but Web::Scraper

    :) but findnodes is not Web::Scraper, its all XML::LibXML or HTML::Tree::Xpath (: