windowbreaker has asked for the wisdom of the Perl Monks concerning the following question:
I need to process the following HTML using Web::Scraper, and produce a data structure (see below).
The HTML looks like this:
<h4 class="bla">July 12</h4> <p>Tim</p> <p>Jon</p> <h4 class="bla">July 13</h4> <p>James</p> <p>Eric</p> <p>Jerry</p> <p>Susie</p> <h4 class="date">July 14</h4> <p>Kami</p> <p>Darryl</p>
I would like to create the following data structure (AoH), though any suitable data structure which assicates each name with the proper date would do.
[ { 'July 12' => [ 'Tim', 'Jon' ] }, { 'July 13' => [ 'James', 'Eric', 'Jerry', 'Susie' ] }, { 'July 14' => [ 'Kami', 'Darryl' ] }, ]
I know I can accomplish this with other modules, but I need to be able to do this with the Web::Scraper module, if at all possible.
I am starting off trying to figure out how to do it specifically for one of the dates, July 12. I figured once I get that I'll try to do the same things for all the dates, which is ultimately what I need.
What I've got so far is this:
my $names = scraper { process '//h4[@class="bla" and . = "July 12"]', 'dates[]' => scraper + { process 'p', 'name' => 'TEXT'; }; }
I know my first XPATH is finding the correct h4 tag, but the probelm is that the p tags I need are it's siblings, not it's children/descendents, so the expression 'p' in the nexted scraper construct is not finding any 'p' tags.
My full script looks like this
use strict; use warnings; use Web::Scraper; use Data::Dumper; my $sample = q{ <h4 class="bla">July 12</h4> <p>Tim</p> <p>Jon</p> <h4 class="bla">July 13</h4> <p>James</p> <p>Eric</p> <p>Jerry</p> <p>Susie</p> <h4 class="date">July 14</h4> <p>Kami</p> <p>Darryl</p> }; my $names = scraper { process '//h4[@class="bla" and . = "July 12"]', 'dates[]' => scrap +er { process 'p', 'name' => 'TEXT'; }; }; my $res = $names->scrape( $sample ); print Dumper $res;
That outputs the following:
$VAR1 = { 'dates' => [ {} ] };
Any help with this problem would be appreciated.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Extracting data-structure from HTML using Web::Scraper
by Athanasius (Archbishop) on Jul 14, 2012 at 07:15 UTC | |
|
Re: Extracting data-structure from HTML using Web::Scraper
by Anonymous Monk on Jul 14, 2012 at 07:17 UTC | |
by Anonymous Monk on Jul 14, 2012 at 07:27 UTC | |
by Anonymous Monk on Jul 14, 2012 at 07:41 UTC | |
by Anonymous Monk on Jul 14, 2012 at 07:58 UTC | |
by Anonymous Monk on Jul 14, 2012 at 08:27 UTC | |
|
Re: Extracting data-structure from HTML using Web::Scraper
by windowbreaker (Sexton) on Jul 15, 2012 at 21:16 UTC | |
by Anonymous Monk on Jul 15, 2012 at 23:38 UTC |