comment on

I need to process the following HTML using Web::Scraper, and produce a data structure (see below).

The HTML looks like this:

<h4 class="bla">July 12</h4>
<p>Tim</p>
<p>Jon</p>
<h4 class="bla">July 13</h4>
<p>James</p>
<p>Eric</p>
<p>Jerry</p>
<p>Susie</p>
<h4 class="date">July 14</h4>
<p>Kami</p>
<p>Darryl</p>
[download]

I would like to create the following data structure (AoH), though any suitable data structure which assicates each name with the proper date would do.

[
  {
     'July 12' => [ 'Tim', 'Jon' ]
  },
  {
    'July 13' => [ 'James', 'Eric', 'Jerry', 'Susie' ]
  },
  {
    'July 14' => [ 'Kami', 'Darryl' ]
  },
]
[download]

I know I can accomplish this with other modules, but I need to be able to do this with the Web::Scraper module, if at all possible.

I am starting off trying to figure out how to do it specifically for one of the dates, July 12. I figured once I get that I'll try to do the same things for all the dates, which is ultimately what I need.

What I've got so far is this:

my $names = scraper {
  process '//h4[@class="bla" and . = "July 12"]', 'dates[]' => scraper
+ {
    process 'p', 'name' => 'TEXT';
  };
}
[download]

I know my first XPATH is finding the correct h4 tag, but the probelm is that the p tags I need are it's siblings, not it's children/descendents, so the expression 'p' in the nexted scraper construct is not finding any 'p' tags.

My full script looks like this

use strict;
use warnings;
use Web::Scraper;
use Data::Dumper;

my $sample = q{
    <h4 class="bla">July 12</h4>
    <p>Tim</p>
    <p>Jon</p>
    <h4 class="bla">July 13</h4>
    <p>James</p>
    <p>Eric</p>
    <p>Jerry</p>
    <p>Susie</p>
    <h4 class="date">July 14</h4>
    <p>Kami</p>
    <p>Darryl</p>
};

my $names = scraper {
    process '//h4[@class="bla" and . = "July 12"]', 'dates[]' => scrap
+er {
        process 'p', 'name' => 'TEXT';
    };
};

my $res = $names->scrape( $sample );
print Dumper $res;
[download]

That outputs the following:

$VAR1 = {
          'dates' => [
                       {}
                     ]
        };
[download]

Any help with this problem would be appreciated.

In reply to Extracting data-structure from HTML using Web::Scraper by windowbreaker

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.