Extracting data-structure from HTML using Web::Scraper

windowbreaker has asked for the wisdom of the Perl Monks concerning the following question:

I need to process the following HTML using Web::Scraper, and produce a data structure (see below).

The HTML looks like this:

<h4 class="bla">July 12</h4>
<p>Tim</p>
<p>Jon</p>
<h4 class="bla">July 13</h4>
<p>James</p>
<p>Eric</p>
<p>Jerry</p>
<p>Susie</p>
<h4 class="date">July 14</h4>
<p>Kami</p>
<p>Darryl</p>
[download]

I would like to create the following data structure (AoH), though any suitable data structure which assicates each name with the proper date would do.

[
  {
     'July 12' => [ 'Tim', 'Jon' ]
  },
  {
    'July 13' => [ 'James', 'Eric', 'Jerry', 'Susie' ]
  },
  {
    'July 14' => [ 'Kami', 'Darryl' ]
  },
]
[download]

I know I can accomplish this with other modules, but I need to be able to do this with the Web::Scraper module, if at all possible.

I am starting off trying to figure out how to do it specifically for one of the dates, July 12. I figured once I get that I'll try to do the same things for all the dates, which is ultimately what I need.

What I've got so far is this:

my $names = scraper {
  process '//h4[@class="bla" and . = "July 12"]', 'dates[]' => scraper
+ {
    process 'p', 'name' => 'TEXT';
  };
}
[download]

I know my first XPATH is finding the correct h4 tag, but the probelm is that the p tags I need are it's siblings, not it's children/descendents, so the expression 'p' in the nexted scraper construct is not finding any 'p' tags.

My full script looks like this

use strict;
use warnings;
use Web::Scraper;
use Data::Dumper;

my $sample = q{
    <h4 class="bla">July 12</h4>
    <p>Tim</p>
    <p>Jon</p>
    <h4 class="bla">July 13</h4>
    <p>James</p>
    <p>Eric</p>
    <p>Jerry</p>
    <p>Susie</p>
    <h4 class="date">July 14</h4>
    <p>Kami</p>
    <p>Darryl</p>
};

my $names = scraper {
    process '//h4[@class="bla" and . = "July 12"]', 'dates[]' => scrap
+er {
        process 'p', 'name' => 'TEXT';
    };
};

my $res = $names->scrape( $sample );
print Dumper $res;
[download]

That outputs the following:

$VAR1 = {
          'dates' => [
                       {}
                     ]
        };
[download]

Any help with this problem would be appreciated.

Comment on Extracting data-structure from HTML using Web::Scraper Select or Download Code

Replies are listed 'Best First'.
Re: Extracting data-structure from HTML using Web::Scraper by Athanasius (Archbishop) on Jul 14, 2012 at 07:15 UTC
Hello windowbreaker, The following code does what you want, given the sample HTML fragment you supplied: #! perl use strict; use warnings; use Data::Dumper; use Web::Scraper; my $sample = q{ <h4 class="bla">July 12</h4> <p>Tim</p> <p>Jon</p> <h4 class="bla">July 13</h4> <p>James</p> <p>Eric</p> <p>Jerry</p> <p>Susie</p> <h4 class="date">July 14</h4> <p>Kami</p> <p>Darryl</p> }; # Preprocess $sample =~ s{ ( ^ \s* < \s* h4 ) }{</div><div class="foo">\n$1}gmx; $sample =~ s{</div>}{}; my $names = scraper { process '//div[contains(@class, "foo")]', 'groups[]' = +> scraper { process 'h4', 'date' => 'TEXT'; process 'p', 'names[]' => 'TEXT'; }; }; my $temp = $names->scrape($sample); my @res; push @res, { $_->{'date'} => $_->{'names'} } for @{ $temp->{'groups'} +}; print Dumper(\@res); [download] Output: `$VAR1 = [ { 'July 12' => [ 'Tim', 'Jon' ] }, { 'July 13' => [ 'James', 'Eric', 'Jerry', 'Susie' ] }, { 'July 14' => [ 'Kami', 'Darryl' ] } ];` [download] The problem with the above is that for real HTML input, with nested nodes, comments, etc., the preprocessing logic quickly becomes so complicated as to render this whole approach impractical. ~~Bottom line: `Web::Scraper` is probably just the wrong tool for this job. `:-(`~~ ++Anonymous Monk for the answers below. Athanasius <°(((>< contra mundum	[reply] [d/l] [select]
Re: Extracting data-structure from HTML using Web::Scraper by Anonymous Monk on Jul 14, 2012 at 07:17 UTC
Any help with this problem would be appreciated. One ugly way #!/usr/bin/perl -- use strict; use warnings; use Web::Scraper; use Data::Dump; my $sample = q{ <html><body> <h4 class="bla">July 12</h4> <p>Tim</p> <p>Jon</p> <h4 class="bla">July 13</h4> <p>James</p> <p>Eric</p> <p>Jerry</p> <p>Susie</p> <h4 class="date">July 14</h4> <p>Kami</p> <p>Darryl</p> </body></html> }; my $names = scraper { process q{//h4 \| //p}, 'h4p[]', scraper { process q{//h4}, 'date' , 'text'; process q{//p}, 'person', 'text'; }; }; my $res = $names->scrape( $sample ); dd $res; { my @root; for my $tag ( @{ $res->{h4p} } ){ if( $$tag{date} ){ pop @root; # remove previous key push @root, {}, $$tag{date}; } if( $$tag{person} ){ push @{ $root[-2]->{ $root[-1] # key } } , $$tag{person}; } } pop @root if not ref $root[-1]; dd \@root; } __END__ { h4p => [ { date => "July 12" }, { person => "Tim" }, { person => "Jon" }, { date => "July 13" }, { person => "James" }, { person => "Eric" }, { person => "Jerry" }, { person => "Susie" }, { date => "July 14" }, { person => "Kami" }, { person => "Darryl" }, ], } [ { "July 12" => ["Tim", "Jon"] }, { "July 13" => ["James", "Eric", "Jerry", "Susie"] }, { "July 14" => ["Kami", "Darryl"] }, ] [download]	[reply] [d/l]
Re^2: Extracting data-structure from HTML using Web::Scraper by Anonymous Monk on Jul 14, 2012 at 07:27 UTC
With help from http://cpansearch.perl.org/src/MIYAGAWA/Web-Scraper-0.36/t/04_callback.t a slight improvement #!/usr/bin/perl -- use strict; use warnings; use Web::Scraper; use Data::Dump; my $sample = q{ <html><body> <h4 class="bla">July 12</h4> <p>Tim</p> <p>Jon</p> <h4 class="bla">July 13</h4> <p>James</p> <p>Eric</p> <p>Jerry</p> <p>Susie</p> <h4 class="date">July 14</h4> <p>Kami</p> <p>Darryl</p> </body></html> }; sub scrap { my @root; my $names = scraper { process q{//h4 \| //p}, sub { if( $_->tag eq 'h4' ){ pop @root; push @root, {}, $_->as_trimmed_text; } if( $_->tag eq 'p' ){ push @{ $root[-2]->{ $root[-1] # key } } , $_->as_trimmed_text; } }; }; $names->scrape( @_ ); pop @root if not ref $root[-1]; return \@root; } dd scrap( \$sample ); __END__ [ { "July 12" => ["Tim", "Jon"] }, { "July 13" => ["James", "Eric", "Jerry", "Susie"] }, { "July 14" => ["Kami", "Darryl"] }, ] [download]	[reply] [d/l]
Re^2: Extracting data-structure from HTML using Web::Scraper by Anonymous Monk on Jul 14, 2012 at 07:41 UTC
Same with xsh The output $ xsh --html --quiet --non-interactive --load pm981742.xsh <?xml version="1.0" standalone="yes"?> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http:// +www.w3.org/TR/REC-html40/loose.dtd"> <html> <body> <h4 class="bla">July 12</h4> <p>Tim</p> <p>Jon</p> <h4 class="bla">July 13</h4> <p>James</p> <p>Eric</p> <p>Jerry</p> <p>Susie</p> <h4 class="date">July 14</h4> <p>Kami</p> <p>Darryl</p> </body> </html> { "July 12" => ["Tim", "Jon"], "July 13" => ["James", "Eric", "Jerry", "Susie"], "July 14" => ["Kami", "Darryl"], } [download] The xsh script (xml shell script) `open pm981742.xml; ls --indent /; for //body/* { $text = string(text()); if( name() = "h4" ){ $key = $text; } if( name() = "p" ){ perl { push @{ $hash{$key} }, $text; }; } } perl { use Data::Dump; dd \%hash; undef %hash; undef $key; };` [download]	[reply] [d/l] [select]
Re^3: Extracting data-structure from HTML using Web::Scraper by Anonymous Monk on Jul 14, 2012 at 07:58 UTC
Since both Web::Scraper and xsh depend on XML::LibXML, you could use straight XML::LibXML, its pretty much like xsh (logic), but perhaps more verbose and less shelly :) #!/usr/bin/perl -- use strict; use warnings; use Data::Dump; use XML::LibXML 1.94; my $sample = q{ <html><body> <h4 class="bla">July 12</h4> <p>Tim</p> <p>Jon</p> <h4 class="bla">July 13</h4> <p>James</p> <p>Eric</p> <p>Jerry</p> <p>Susie</p> <h4 class="date">July 14</h4> <p>Kami</p> <p>Darryl</p> </body></html> }; my $xml = XML::LibXML->load_xml(string => $sample ); my @root; for my $element ( $xml->findnodes("//body/*") ){ if( $element->tagName eq 'h4' ){ pop @root; push @root, {}, $element->textContent; } if( $element->tagName eq 'p' ){ push @{ $root[-2]->{ $root[-1] # key } } , $element->textContent; } } pop @root if not ref $root[-1]; dd \@root; __END__ [ { "July 12" => ["Tim", "Jon"] }, { "July 13" => ["James", "Eric", "Jerry", "Susie"] }, { "July 14" => ["Kami", "Darryl"] }, ] [download]	[reply] [d/l]
Re^2: Extracting data-structure from HTML using Web::Scraper by Anonymous Monk on Jul 14, 2012 at 08:27 UTC
And XML::Twig since the logic is the same #!/usr/bin/perl -- use strict; use warnings; use Data::Dump; use XML::Twig; my $sample = q{ <html><body> <h4 class="bla">July 12</h4> <p>Tim</p> <p>Jon</p> <h4 class="bla">July 13</h4> <p>James</p> <p>Eric</p> <p>Jerry</p> <p>Susie</p> <h4 class="date">July 14</h4> <p>Kami</p> <p>Darryl</p> </body></html> }; my @root; my $xml = XML::Twig->new( twig_handlers => { '//body/h4' => sub { dd $_->path; pop @root; push @root, {}, $_->text; }, '//body/p' => sub { dd $_->path; push @{ $root[-2]->{ $root[-1] # key } } , $_->text; }, }, ); $xml->xparse( $sample ); pop @root if not ref $root[-1]; dd \@root; __END__ "/html/body/h4" "/html/body/p" "/html/body/p" "/html/body/h4" "/html/body/p" "/html/body/p" "/html/body/p" "/html/body/p" "/html/body/h4" "/html/body/p" "/html/body/p" [ { "July 12" => ["Tim", "Jon"] }, { "July 13" => ["James", "Eric", "Jerry", "Susie"] }, { "July 14" => ["Kami", "Darryl"] }, ] [download]	[reply] [d/l]
Re: Extracting data-structure from HTML using Web::Scraper by windowbreaker (Sexton) on Jul 15, 2012 at 21:16 UTC
Thanks to everyone that posted a solution. I learned a lot by reading thru the different approaches to the problem. I also ended up working out a solution using nothing but Web::Scraper (one of my requirements), and wanted to post it here use strict; use warnings; use Web::Scraper; use Data::Dumper; my $sample = q{ <html> <body> <h4 class="bla">July 12</h4> <p>Tim</p> <p>Jon</p> <h4 class="bla">July 13</h4> <p>James</p> <p>Eric</p> <p>Jerry</p> <p>Susie</p> <h4 class="bla">July 14</h4> <p>Kami</p> <p>Darryl</p> </body> </html> }; my $names = scraper { process 'h4.bla', 'names[]' => sub { my $elem = shift; my $date = $elem->as_text; my @names = (); for my $node ($elem->parent->findnodes( "//p[preceding-sibling +::h4[1][. = '$date']]" )) { push @names, $node->as_text; } return { $date => \@names }; }; }; my $res = $names->scrape( $sample ); print Dumper $res [download] That will output the following `$VAR1 = { 'names' => [ { 'July 12' => [ 'Tim', 'Jon' ] }, { 'July 13' => [ 'James', 'Eric', 'Jerry', 'Susie' ] }, { 'July 14' => [ 'Kami', 'Darryl' ] } ] };` [download] Again, thanks to everyone for the help, you guys are awesome!	[reply] [d/l] [select]
Re^2: Extracting data-structure from HTML using Web::Scraper by Anonymous Monk on Jul 15, 2012 at 23:38 UTC
nothing but Web::Scraper :) but findnodes is not Web::Scraper, its all XML::LibXML or HTML::Tree::Xpath (:	[reply]