in reply to Extracting data-structure from HTML using Web::Scraper

Hello windowbreaker,

The following code does what you want, given the sample HTML fragment you supplied:

#! perl use strict; use warnings; use Data::Dumper; use Web::Scraper; my $sample = q{ <h4 class="bla">July 12</h4> <p>Tim</p> <p>Jon</p> <h4 class="bla">July 13</h4> <p>James</p> <p>Eric</p> <p>Jerry</p> <p>Susie</p> <h4 class="date">July 14</h4> <p>Kami</p> <p>Darryl</p> }; # Preprocess $sample =~ s{ ( ^ \s* < \s* h4 ) }{</div><div class="foo">\n$1}gmx; $sample =~ s{</div>}{}; my $names = scraper { process '//div[contains(@class, "foo")]', 'groups[]' = +> scraper { process 'h4', 'date' => 'TEXT'; process 'p', 'names[]' => 'TEXT'; }; }; my $temp = $names->scrape($sample); my @res; push @res, { $_->{'date'} => $_->{'names'} } for @{ $temp->{'groups'} +}; print Dumper(\@res);

Output:

$VAR1 = [ { 'July 12' => [ 'Tim', 'Jon' ] }, { 'July 13' => [ 'James', 'Eric', 'Jerry', 'Susie' ] }, { 'July 14' => [ 'Kami', 'Darryl' ] } ];

The problem with the above is that for real HTML input, with nested nodes, comments, etc., the preprocessing logic quickly becomes so complicated as to render this whole approach impractical.

Bottom line: Web::Scraper is probably just the wrong tool for this job. :-( ++Anonymous Monk for the answers below.

Athanasius <°(((><contra mundum