Re: Extracting data-structure from HTML using Web::Scraper

Thanks to everyone that posted a solution. I learned a lot by reading thru the different approaches to the problem.

I also ended up working out a solution using nothing but Web::Scraper (one of my requirements), and wanted to post it here

use strict;
use warnings;
use Web::Scraper;
use Data::Dumper;

my $sample = q{
<html>
<body>
    <h4 class="bla">July 12</h4>
    <p>Tim</p>
    <p>Jon</p>
    <h4 class="bla">July 13</h4>
    <p>James</p>
    <p>Eric</p>
    <p>Jerry</p>
    <p>Susie</p>
    <h4 class="bla">July 14</h4>
    <p>Kami</p>
    <p>Darryl</p>
</body>
</html>
};

my $names = scraper {
    process 'h4.bla', 'names[]' => sub {
        my $elem = shift;
        my $date = $elem->as_text;
        my @names = ();
        for my $node ($elem->parent->findnodes( "//p[preceding-sibling
+::h4[1][. = '$date']]" )) {
            push @names, $node->as_text;
        }
        return { $date => \@names };
    };
};

my $res = $names->scrape( $sample );
print Dumper $res
[download]

That will output the following

$VAR1 = {
          'names' => [
                       {
                         'July 12' => [
                                        'Tim',
                                        'Jon'
                                      ]
                       },
                       {
                         'July 13' => [
                                        'James',
                                        'Eric',
                                        'Jerry',
                                        'Susie'
                                      ]
                       },
                       {
                         'July 14' => [
                                        'Kami',
                                        'Darryl'
                                      ]
                       }
                     ]
        };
[download]

Again, thanks to everyone for the help, you guys are awesome!

Comment on Re: Extracting data-structure from HTML using Web::Scraper Select or Download Code

Replies are listed 'Best First'.
Re^2: Extracting data-structure from HTML using Web::Scraper by Anonymous Monk on Jul 15, 2012 at 23:38 UTC
nothing but Web::Scraper :) but findnodes is not Web::Scraper, its all XML::LibXML or HTML::Tree::Xpath (:	[reply]