Re: Extracting specific content using WWW::Mechanize

Hi mserino,

While you could certainly get at an individual element of the list returned, it would be cumbersome to access it that way in your example page, where there are more than 600  tags. I think you would be better served searching for a match in the paragraphs you want. Here's an example:

#!/usr/bin/perl
use strict;
use warnings;
use feature qw/ say /;
use Data::Dumper;

use WWW::Mechanize;
use WWW::Mechanize::TreeBuilder;

my $mech = WWW::Mechanize->new();
WWW::Mechanize::TreeBuilder->meta->apply($mech);
$mech->get("http://millercenter.org/president/clinton/oralhistory/roge
+r-altman");

for ( $mech->look_down(_tag => 'p') ) { 
  next unless $_->as_trimmed_text =~ m/a haircut in California/;
  say $_->as_HTML;
}

__END__
[download]

Output:

<p>The election occurred and about a week later Reagan was getting a h
+aircut in California and he came out of the barbershop. There had bee
+n some minor event that day and he was asked what he thought of the l
+atest developments in Iran and he said, <q>I don&rsquo;t talk about b
+arbarians.</q>
[download]

No problem to build up a list of paragraphs and push them onto your output, then get some other content. You can definitely do it in one program. Hope this helps!

The way forward always starts with a minimal test.

Comment on Re: Extracting specific <p>content</p> using WWW::Mechanize Select or Download Code

Replies are listed 'Best First'.
Re^2: Extracting specific <p>content</p> using WWW::Mechanize by mserino (Initiate) on Sep 10, 2015 at 14:02 UTC
This was extremely helpful, thank you! I wasn't aware of look_down and had I known of its existence from the start it would have made life so much easier. I'm having some trouble grabbing some content from the aside content, however. This could be caused by the HTML structure of the website, but I'm hoping I'm wrong and that a workaround exists. I'm trying to grab the "Interviewers" list which lies inside a div ID of "innercontent." The problem is that even when I give it string match it's returning everything inside the innercontent div. Here's what I have. #!/usr/bin/perl -w use strict; use WWW::Mechanize; use WWW::Mechanize::TreeBuilder; use feature qw/ say /; use Data::Dumper; my $mech = WWW::Mechanize->new(); WWW::Mechanize::TreeBuilder->meta->apply($mech); $mech->get("http://millercenter.org/president/clinton/oralhistory/made +leine-k-a$ # introduction for ( $mech->look_down(_tag => "div", id => 'introduction') ) { next unless $_->as_trimmed_text =~ m/Publicly released transcripts/; say $_->as_HTML; } # interviewers for ( $mech->look_down(_tag => "div", id => 'innercontent') ) { next unless $_->as_trimmed_text =~ m/Interviewers:/; say $_->as_HTML; } # interview my @list = $mech->find('dl'); foreach ( @list ) { print $_->as_HTML(); } [download]	[reply] [d/l]

Replies are listed 'Best First'.

Re^2: Extracting specific content using WWW::Mechanize
by mserino (Initiate) on Sep 10, 2015 at 14:02 UTC

This was extremely helpful, thank you! I wasn't aware of look_down and had I known of its existence from the start it would have made life so much easier.

I'm having some trouble grabbing some content from the aside content, however. This could be caused by the HTML structure of the website, but I'm hoping I'm wrong and that a workaround exists. I'm trying to grab the "Interviewers" list which lies inside a div ID of "innercontent." The problem is that even when I give it string match it's returning everything inside the innercontent div. Here's what I have.


#!/usr/bin/perl -w

use strict;
use WWW::Mechanize;
use WWW::Mechanize::TreeBuilder;
use feature qw/ say /;
use Data::Dumper;

my $mech = WWW::Mechanize->new();
WWW::Mechanize::TreeBuilder->meta->apply($mech);
$mech->get("http://millercenter.org/president/clinton/oralhistory/made
+leine-k-a$

# introduction
for ( $mech->look_down(_tag => "div", id => 'introduction') ) {
  next unless $_->as_trimmed_text =~ m/Publicly released transcripts/;
  say $_->as_HTML;
}

# interviewers
for ( $mech->look_down(_tag => "div", id => 'innercontent') ) {
  next unless $_->as_trimmed_text =~ m/Interviewers:/;
  say $_->as_HTML;
}

# interview
my @list = $mech->find('dl');
foreach ( @list ) {
print $_->as_HTML();
}
[download]

[reply]
[d/l]