Re: Extracting data-structure from HTML using Web::Scraper

The following code does what you want, given the sample HTML fragment you supplied:

#! perl
use strict;
use warnings;
use Data::Dumper;
use Web::Scraper;

my $sample = q{
                   <h4 class="bla">July 12</h4>
                   <p>Tim</p>
                   <p>Jon</p>
                   <h4 class="bla">July 13</h4>
                   <p>James</p>
                   <p>Eric</p>
                   <p>Jerry</p>
                   <p>Susie</p>
                   <h4 class="date">July 14</h4>
                   <p>Kami</p>
                   <p>Darryl</p>
              };

# Preprocess
$sample =~ s{ ( ^ \s* < \s* h4 ) }{</div><div class="foo">\n$1}gmx;
$sample =~ s{</div>}{};

my $names = scraper
            {
                process '//div[contains(@class, "foo")]', 'groups[]' =
+>
                    scraper { process 'h4', 'date'    => 'TEXT';
                              process 'p',  'names[]' => 'TEXT'; };
            };

my   $temp = $names->scrape($sample);
my   @res;
push @res, { $_->{'date'} => $_->{'names'} } for @{ $temp->{'groups'} 
+};
print Dumper(\@res);
[download]

Output:

$VAR1 = [
          {
            'July 12' => [
                           'Tim',
                           'Jon'
                         ]
          },
          {
            'July 13' => [
                           'James',
                           'Eric',
                           'Jerry',
                           'Susie'
                         ]
          },
          {
            'July 14' => [
                           'Kami',
                           'Darryl'
                         ]
          }
        ];
[download]

The problem with the above is that for real HTML input, with nested nodes, comments, etc., the preprocessing logic quickly becomes so complicated as to render this whole approach impractical.

~~Bottom line: Web::Scraper is probably just the wrong tool for this job. :-(~~ ++Anonymous Monk for the answers below.

Athanasius <°(((>< contra mundum

Comment on Re: Extracting data-structure from HTML using Web::Scraper Select or Download Code