Extracting specific <p>content</p> using WWW::Mechanize

mserino has asked for the wisdom of the Perl Monks concerning the following question:

Greetings monks :)

I'm working on a script that will extract content from a webpage. So far WWW::Mechanize seems to be the go to tool to do this. I'm a bit lost though, since there are a few things I'd like to extract into a single HTML document, and I'm not quite sure what differentiates one paragraph from another, i.e I only care about the content in a specific paragraph tag.

Here's what I have so far.

 #!/usr/bin/perl -w

use strict;
use WWW::Mechanize;
use WWW::Mechanize::TreeBuilder;

my $mech = WWW::Mechanize->new();
WWW::Mechanize::TreeBuilder->meta->apply($mech);
$mech->get("http://millercenter.org/president/clinton/oralhistory/roge
+r-altman");

# find all <dl> tags
my @list = $mech->find('dl');

foreach ( @list ) {
print $_->as_HTML();
}
[download]

There is nothing wrong with what I have above and it works perfectly fine. My issue though is that the content extracted is only the interview on the website which was originally all we thought we needed. Now I am looking to extract the "introduction" content (which is in a paragraph tag) as well as some sidebar information.

I understand that I can search for p's as opposed to dl's, but how can I write my script so that it only looks for the nth occurrence, meaning that only the paragraph content I'm looking for gives me the output I want.

Lastly, is this doable all in one script? With what I have above, is it possible to tack on the introduction to the dl HTML that gets generated when I run what I have above?

Advice, guidance, help is greatly appreciated. Thank you!

Comment on Extracting specific <p>content</p> using WWW::Mechanize Download Code

Replies are listed 'Best First'.
Re: Extracting specific <p>content</p> using WWW::Mechanize by 1nickt (Canon) on Sep 09, 2015 at 19:43 UTC
Hi mserino, While you could certainly get at an individual element of the list returned, it would be cumbersome to access it that way in your example page, where there are more than 600 `<p>` tags. I think you would be better served searching for a match in the paragraphs you want. Here's an example: `#!/usr/bin/perl use strict; use warnings; use feature qw/ say /; use Data::Dumper; use WWW::Mechanize; use WWW::Mechanize::TreeBuilder; my $mech = WWW::Mechanize->new(); WWW::Mechanize::TreeBuilder->meta->apply($mech); $mech->get("http://millercenter.org/president/clinton/oralhistory/roge +r-altman"); for ( $mech->look_down(_tag => 'p') ) { next unless $_->as_trimmed_text =~ m/a haircut in California/; say $_->as_HTML; } __END__` [download] Output: `<p>The election occurred and about a week later Reagan was getting a h +aircut in California and he came out of the barbershop. There had bee +n some minor event that day and he was asked what he thought of the l +atest developments in Iran and he said, <q>I don’t talk about b +arbarians.</q>` [download] No problem to build up a list of paragraphs and push them onto your output, then get some other content. You can definitely do it in one program. Hope this helps! The way forward always starts with a minimal test.	[reply] [d/l] [select]
Re^2: Extracting specific <p>content</p> using WWW::Mechanize by mserino (Initiate) on Sep 10, 2015 at 14:02 UTC
This was extremely helpful, thank you! I wasn't aware of look_down and had I known of its existence from the start it would have made life so much easier. I'm having some trouble grabbing some content from the aside content, however. This could be caused by the HTML structure of the website, but I'm hoping I'm wrong and that a workaround exists. I'm trying to grab the "Interviewers" list which lies inside a div ID of "innercontent." The problem is that even when I give it string match it's returning everything inside the innercontent div. Here's what I have. #!/usr/bin/perl -w use strict; use WWW::Mechanize; use WWW::Mechanize::TreeBuilder; use feature qw/ say /; use Data::Dumper; my $mech = WWW::Mechanize->new(); WWW::Mechanize::TreeBuilder->meta->apply($mech); $mech->get("http://millercenter.org/president/clinton/oralhistory/made +leine-k-a$ # introduction for ( $mech->look_down(_tag => "div", id => 'introduction') ) { next unless $_->as_trimmed_text =~ m/Publicly released transcripts/; say $_->as_HTML; } # interviewers for ( $mech->look_down(_tag => "div", id => 'innercontent') ) { next unless $_->as_trimmed_text =~ m/Interviewers:/; say $_->as_HTML; } # interview my @list = $mech->find('dl'); foreach ( @list ) { print $_->as_HTML(); } [download]	[reply] [d/l]