windowbreaker has asked for the wisdom of the Perl Monks concerning the following question:

I need to process the following HTML using Web::Scraper, and produce a data structure (see below).

The HTML looks like this:

<h4 class="bla">July 12</h4> <p>Tim</p> <p>Jon</p> <h4 class="bla">July 13</h4> <p>James</p> <p>Eric</p> <p>Jerry</p> <p>Susie</p> <h4 class="date">July 14</h4> <p>Kami</p> <p>Darryl</p>

I would like to create the following data structure (AoH), though any suitable data structure which assicates each name with the proper date would do.

[ { 'July 12' => [ 'Tim', 'Jon' ] }, { 'July 13' => [ 'James', 'Eric', 'Jerry', 'Susie' ] }, { 'July 14' => [ 'Kami', 'Darryl' ] }, ]

I know I can accomplish this with other modules, but I need to be able to do this with the Web::Scraper module, if at all possible.

I am starting off trying to figure out how to do it specifically for one of the dates, July 12. I figured once I get that I'll try to do the same things for all the dates, which is ultimately what I need.

What I've got so far is this:

my $names = scraper { process '//h4[@class="bla" and . = "July 12"]', 'dates[]' => scraper + { process 'p', 'name' => 'TEXT'; }; }

I know my first XPATH is finding the correct h4 tag, but the probelm is that the p tags I need are it's siblings, not it's children/descendents, so the expression 'p' in the nexted scraper construct is not finding any 'p' tags.

My full script looks like this

use strict; use warnings; use Web::Scraper; use Data::Dumper; my $sample = q{ <h4 class="bla">July 12</h4> <p>Tim</p> <p>Jon</p> <h4 class="bla">July 13</h4> <p>James</p> <p>Eric</p> <p>Jerry</p> <p>Susie</p> <h4 class="date">July 14</h4> <p>Kami</p> <p>Darryl</p> }; my $names = scraper { process '//h4[@class="bla" and . = "July 12"]', 'dates[]' => scrap +er { process 'p', 'name' => 'TEXT'; }; }; my $res = $names->scrape( $sample ); print Dumper $res;

That outputs the following:

$VAR1 = { 'dates' => [ {} ] };

Any help with this problem would be appreciated.

Replies are listed 'Best First'.
Re: Extracting data-structure from HTML using Web::Scraper
by Athanasius (Archbishop) on Jul 14, 2012 at 07:15 UTC

    Hello windowbreaker,

    The following code does what you want, given the sample HTML fragment you supplied:

    #! perl use strict; use warnings; use Data::Dumper; use Web::Scraper; my $sample = q{ <h4 class="bla">July 12</h4> <p>Tim</p> <p>Jon</p> <h4 class="bla">July 13</h4> <p>James</p> <p>Eric</p> <p>Jerry</p> <p>Susie</p> <h4 class="date">July 14</h4> <p>Kami</p> <p>Darryl</p> }; # Preprocess $sample =~ s{ ( ^ \s* < \s* h4 ) }{</div><div class="foo">\n$1}gmx; $sample =~ s{</div>}{}; my $names = scraper { process '//div[contains(@class, "foo")]', 'groups[]' = +> scraper { process 'h4', 'date' => 'TEXT'; process 'p', 'names[]' => 'TEXT'; }; }; my $temp = $names->scrape($sample); my @res; push @res, { $_->{'date'} => $_->{'names'} } for @{ $temp->{'groups'} +}; print Dumper(\@res);

    Output:

    $VAR1 = [ { 'July 12' => [ 'Tim', 'Jon' ] }, { 'July 13' => [ 'James', 'Eric', 'Jerry', 'Susie' ] }, { 'July 14' => [ 'Kami', 'Darryl' ] } ];

    The problem with the above is that for real HTML input, with nested nodes, comments, etc., the preprocessing logic quickly becomes so complicated as to render this whole approach impractical.

    Bottom line: Web::Scraper is probably just the wrong tool for this job. :-( ++Anonymous Monk for the answers below.

    Athanasius <°(((><contra mundum

Re: Extracting data-structure from HTML using Web::Scraper
by Anonymous Monk on Jul 14, 2012 at 07:17 UTC

    Any help with this problem would be appreciated.

    One ugly way

    #!/usr/bin/perl -- use strict; use warnings; use Web::Scraper; use Data::Dump; my $sample = q{ <html><body> <h4 class="bla">July 12</h4> <p>Tim</p> <p>Jon</p> <h4 class="bla">July 13</h4> <p>James</p> <p>Eric</p> <p>Jerry</p> <p>Susie</p> <h4 class="date">July 14</h4> <p>Kami</p> <p>Darryl</p> </body></html> }; my $names = scraper { process q{//h4 | //p}, 'h4p[]', scraper { process q{//h4}, 'date' , 'text'; process q{//p}, 'person', 'text'; }; }; my $res = $names->scrape( $sample ); dd $res; { my @root; for my $tag ( @{ $res->{h4p} } ){ if( $$tag{date} ){ pop @root; # remove previous key push @root, {}, $$tag{date}; } if( $$tag{person} ){ push @{ $root[-2]->{ $root[-1] # key } } , $$tag{person}; } } pop @root if not ref $root[-1]; dd \@root; } __END__ { h4p => [ { date => "July 12" }, { person => "Tim" }, { person => "Jon" }, { date => "July 13" }, { person => "James" }, { person => "Eric" }, { person => "Jerry" }, { person => "Susie" }, { date => "July 14" }, { person => "Kami" }, { person => "Darryl" }, ], } [ { "July 12" => ["Tim", "Jon"] }, { "July 13" => ["James", "Eric", "Jerry", "Susie"] }, { "July 14" => ["Kami", "Darryl"] }, ]

      With help from http://cpansearch.perl.org/src/MIYAGAWA/Web-Scraper-0.36/t/04_callback.t a slight improvement

      #!/usr/bin/perl -- use strict; use warnings; use Web::Scraper; use Data::Dump; my $sample = q{ <html><body> <h4 class="bla">July 12</h4> <p>Tim</p> <p>Jon</p> <h4 class="bla">July 13</h4> <p>James</p> <p>Eric</p> <p>Jerry</p> <p>Susie</p> <h4 class="date">July 14</h4> <p>Kami</p> <p>Darryl</p> </body></html> }; sub scrap { my @root; my $names = scraper { process q{//h4 | //p}, sub { if( $_->tag eq 'h4' ){ pop @root; push @root, {}, $_->as_trimmed_text; } if( $_->tag eq 'p' ){ push @{ $root[-2]->{ $root[-1] # key } } , $_->as_trimmed_text; } }; }; $names->scrape( @_ ); pop @root if not ref $root[-1]; return \@root; } dd scrap( \$sample ); __END__ [ { "July 12" => ["Tim", "Jon"] }, { "July 13" => ["James", "Eric", "Jerry", "Susie"] }, { "July 14" => ["Kami", "Darryl"] }, ]

      Same with xsh

      The output

      $ xsh --html --quiet --non-interactive --load pm981742.xsh <?xml version="1.0" standalone="yes"?> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http:// +www.w3.org/TR/REC-html40/loose.dtd"> <html> <body> <h4 class="bla">July 12</h4> <p>Tim</p> <p>Jon</p> <h4 class="bla">July 13</h4> <p>James</p> <p>Eric</p> <p>Jerry</p> <p>Susie</p> <h4 class="date">July 14</h4> <p>Kami</p> <p>Darryl</p> </body> </html> { "July 12" => ["Tim", "Jon"], "July 13" => ["James", "Eric", "Jerry", "Susie"], "July 14" => ["Kami", "Darryl"], }

      The xsh script (xml shell script)

      open pm981742.xml; ls --indent /; for //body/* { $text = string(text()); if( name() = "h4" ){ $key = $text; } if( name() = "p" ){ perl { push @{ $hash{$key} }, $text; }; } } perl { use Data::Dump; dd \%hash; undef %hash; undef $key; };

        Since both Web::Scraper and xsh depend on XML::LibXML, you could use straight XML::LibXML, its pretty much like xsh (logic), but perhaps more verbose and less shelly :)

        #!/usr/bin/perl -- use strict; use warnings; use Data::Dump; use XML::LibXML 1.94; my $sample = q{ <html><body> <h4 class="bla">July 12</h4> <p>Tim</p> <p>Jon</p> <h4 class="bla">July 13</h4> <p>James</p> <p>Eric</p> <p>Jerry</p> <p>Susie</p> <h4 class="date">July 14</h4> <p>Kami</p> <p>Darryl</p> </body></html> }; my $xml = XML::LibXML->load_xml(string => $sample ); my @root; for my $element ( $xml->findnodes("//body/*") ){ if( $element->tagName eq 'h4' ){ pop @root; push @root, {}, $element->textContent; } if( $element->tagName eq 'p' ){ push @{ $root[-2]->{ $root[-1] # key } } , $element->textContent; } } pop @root if not ref $root[-1]; dd \@root; __END__ [ { "July 12" => ["Tim", "Jon"] }, { "July 13" => ["James", "Eric", "Jerry", "Susie"] }, { "July 14" => ["Kami", "Darryl"] }, ]

      And XML::Twig since the logic is the same

      #!/usr/bin/perl -- use strict; use warnings; use Data::Dump; use XML::Twig; my $sample = q{ <html><body> <h4 class="bla">July 12</h4> <p>Tim</p> <p>Jon</p> <h4 class="bla">July 13</h4> <p>James</p> <p>Eric</p> <p>Jerry</p> <p>Susie</p> <h4 class="date">July 14</h4> <p>Kami</p> <p>Darryl</p> </body></html> }; my @root; my $xml = XML::Twig->new( twig_handlers => { '//body/h4' => sub { dd $_->path; pop @root; push @root, {}, $_->text; }, '//body/p' => sub { dd $_->path; push @{ $root[-2]->{ $root[-1] # key } } , $_->text; }, }, ); $xml->xparse( $sample ); pop @root if not ref $root[-1]; dd \@root; __END__ "/html/body/h4" "/html/body/p" "/html/body/p" "/html/body/h4" "/html/body/p" "/html/body/p" "/html/body/p" "/html/body/p" "/html/body/h4" "/html/body/p" "/html/body/p" [ { "July 12" => ["Tim", "Jon"] }, { "July 13" => ["James", "Eric", "Jerry", "Susie"] }, { "July 14" => ["Kami", "Darryl"] }, ]
Re: Extracting data-structure from HTML using Web::Scraper
by windowbreaker (Sexton) on Jul 15, 2012 at 21:16 UTC

    Thanks to everyone that posted a solution. I learned a lot by reading thru the different approaches to the problem.

    I also ended up working out a solution using nothing but Web::Scraper (one of my requirements), and wanted to post it here

    use strict; use warnings; use Web::Scraper; use Data::Dumper; my $sample = q{ <html> <body> <h4 class="bla">July 12</h4> <p>Tim</p> <p>Jon</p> <h4 class="bla">July 13</h4> <p>James</p> <p>Eric</p> <p>Jerry</p> <p>Susie</p> <h4 class="bla">July 14</h4> <p>Kami</p> <p>Darryl</p> </body> </html> }; my $names = scraper { process 'h4.bla', 'names[]' => sub { my $elem = shift; my $date = $elem->as_text; my @names = (); for my $node ($elem->parent->findnodes( "//p[preceding-sibling +::h4[1][. = '$date']]" )) { push @names, $node->as_text; } return { $date => \@names }; }; }; my $res = $names->scrape( $sample ); print Dumper $res

    That will output the following

    $VAR1 = { 'names' => [ { 'July 12' => [ 'Tim', 'Jon' ] }, { 'July 13' => [ 'James', 'Eric', 'Jerry', 'Susie' ] }, { 'July 14' => [ 'Kami', 'Darryl' ] } ] };

    Again, thanks to everyone for the help, you guys are awesome!

      nothing but Web::Scraper

      :) but findnodes is not Web::Scraper, its all XML::LibXML or HTML::Tree::Xpath (: