Re^2: Extracting data-structure from HTML using Web::Scraper

Same with xsh

The output

$ xsh --html --quiet --non-interactive --load     pm981742.xsh
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
+www.w3.org/TR/REC-html40/loose.dtd">
<html>
  <body>
    <h4 class="bla">July 12</h4>
    <p>Tim</p>
    <p>Jon</p>
    <h4 class="bla">July 13</h4>
    <p>James</p>
    <p>Eric</p>
    <p>Jerry</p>
    <p>Susie</p>
    <h4 class="date">July 14</h4>
    <p>Kami</p>
    <p>Darryl</p>
  </body>
</html>

{
  "July 12" => ["Tim", "Jon"],
  "July 13" => ["James", "Eric", "Jerry", "Susie"],
  "July 14" => ["Kami", "Darryl"],
}
[download]

The xsh script (xml shell script)

open pm981742.xml;

ls --indent /;

for //body/* {
    $text = string(text());
    if(  name() = "h4" ){
        $key = $text;
    }
    if( name() = "p" ){
        perl {
            push @{
                $hash{$key}
            }, $text;
        };
    }
}

perl {
    use Data::Dump;
    dd \%hash;
    undef %hash;
    undef $key;
};
[download]

Comment on Re^2: Extracting data-structure from HTML using Web::Scraper Select or Download Code

Replies are listed 'Best First'.
Re^3: Extracting data-structure from HTML using Web::Scraper by Anonymous Monk on Jul 14, 2012 at 07:58 UTC
Since both Web::Scraper and xsh depend on XML::LibXML, you could use straight XML::LibXML, its pretty much like xsh (logic), but perhaps more verbose and less shelly :) #!/usr/bin/perl -- use strict; use warnings; use Data::Dump; use XML::LibXML 1.94; my $sample = q{ <html><body> <h4 class="bla">July 12</h4> <p>Tim</p> <p>Jon</p> <h4 class="bla">July 13</h4> <p>James</p> <p>Eric</p> <p>Jerry</p> <p>Susie</p> <h4 class="date">July 14</h4> <p>Kami</p> <p>Darryl</p> </body></html> }; my $xml = XML::LibXML->load_xml(string => $sample ); my @root; for my $element ( $xml->findnodes("//body/*") ){ if( $element->tagName eq 'h4' ){ pop @root; push @root, {}, $element->textContent; } if( $element->tagName eq 'p' ){ push @{ $root[-2]->{ $root[-1] # key } } , $element->textContent; } } pop @root if not ref $root[-1]; dd \@root; __END__ [ { "July 12" => ["Tim", "Jon"] }, { "July 13" => ["James", "Eric", "Jerry", "Susie"] }, { "July 14" => ["Kami", "Darryl"] }, ] [download]	[reply] [d/l]

Replies are listed 'Best First'.

Re^3: Extracting data-structure from HTML using Web::Scraper
by Anonymous Monk on Jul 14, 2012 at 07:58 UTC

Since both Web::Scraper and xsh depend on XML::LibXML, you could use straight XML::LibXML, its pretty much like xsh (logic), but perhaps more verbose and less shelly :)


#!/usr/bin/perl --
use strict; use warnings;
use Data::Dump;
use XML::LibXML 1.94;

my $sample = q{
<html><body>
    <h4 class="bla">July 12</h4>
    <p>Tim</p>
    <p>Jon</p>
    <h4 class="bla">July 13</h4>
    <p>James</p>
    <p>Eric</p>
    <p>Jerry</p>
    <p>Susie</p>
    <h4 class="date">July 14</h4>
    <p>Kami</p>
    <p>Darryl</p>
</body></html>
};


my $xml = XML::LibXML->load_xml(string => $sample );
my @root;

for my $element ( $xml->findnodes("//body/*") ){
    if( $element->tagName eq 'h4' ){
        pop @root;
        push @root, {}, $element->textContent;
    }
    if( $element->tagName eq 'p' ){
        push @{
            $root[-2]->{
                $root[-1] # key
            }
        } , $element->textContent;
    }
}

pop @root if not ref $root[-1];

dd \@root;

__END__
[
  { "July 12" => ["Tim", "Jon"] },
  { "July 13" => ["James", "Eric", "Jerry", "Susie"] },
  { "July 14" => ["Kami", "Darryl"] },
]
[download]

[reply]
[d/l]