Re^3: Extracting data-structure from HTML using Web::Scraper

Since both Web::Scraper and xsh depend on XML::LibXML, you could use straight XML::LibXML, its pretty much like xsh (logic), but perhaps more verbose and less shelly :)


#!/usr/bin/perl --
use strict; use warnings;
use Data::Dump;
use XML::LibXML 1.94;

my $sample = q{
<html><body>
    <h4 class="bla">July 12</h4>
    <p>Tim</p>
    <p>Jon</p>
    <h4 class="bla">July 13</h4>
    <p>James</p>
    <p>Eric</p>
    <p>Jerry</p>
    <p>Susie</p>
    <h4 class="date">July 14</h4>
    <p>Kami</p>
    <p>Darryl</p>
</body></html>
};


my $xml = XML::LibXML->load_xml(string => $sample );
my @root;

for my $element ( $xml->findnodes("//body/*") ){
    if( $element->tagName eq 'h4' ){
        pop @root;
        push @root, {}, $element->textContent;
    }
    if( $element->tagName eq 'p' ){
        push @{
            $root[-2]->{
                $root[-1] # key
            }
        } , $element->textContent;
    }
}

pop @root if not ref $root[-1];

dd \@root;

__END__
[
  { "July 12" => ["Tim", "Jon"] },
  { "July 13" => ["James", "Eric", "Jerry", "Susie"] },
  { "July 14" => ["Kami", "Darryl"] },
]
[download]

Comment on Re^3: Extracting data-structure from HTML using Web::Scraper Download Code