in reply to Dump a Web PAge to a text File

If you are looking to extract information from a web page, then you should be using the HTML structure of the page to help you find the information you are looking for. If you flatten that structure to plain text, then you will find it harder to parse the page.

My advice is to go to CPAN and download an HTML parser module such as HTML::TreeBuilder or HTML::TokeParser::Simple both come Highly recommended. Out of the two, my preference is for HTML::TreeBuilder

Also, so you know what the structure of your web page is, I suggest you install a GUI HTML tree inspector such as Firebug, or the inspect element tool in google chrome, to tell you where the elements you are looking for are in the HTML structure.

With these two tools, you can very easily drill into the structure of an HTML page, and find exactly what you need.

To take your example, you are trying to extract links from a web page, you can fetch a page, and make a list of links to other pages with the following:

my $tree = HTML::TreeBuilder->new_from_content($html_file); my @links = $tree->extract_links('a');

Now the array @links will contain all the links on the page. The problem is that on most web pages that will return hundreds of links. With HTML::TreeBuilder you can drill into the structure of a page to find the parts you are interested in, and then extract just the stuff you need from that part of the page.

For example, to fetch the monk picture at the top right of each perl monks web page:

use 5.010; use warnings; use strict; use HTML::TreeBuilder; use LWP::UserAgent; my $ua = LWP::UserAgent->new; my $fetch_result = $ua->get("http://www.perlmonks.org"); my $tree = HTML::TreeBuilder->new_from_content($fetch_result->content) +; my $banner_row = $tree->look_down( '_tag' => 'tr', class => 'bannerrow +' ); my $img_objects = $banner_row->extract_links('img'); # Skip over the advert image. my @monksImage = grep{$_->[0] =~ m/perlmonks/} @$img_objects; # Prove the script worked. say "Today's image points to: ".$monksImage[0][0];

Replies are listed 'Best First'.
Re^2: Dump a Web PAge to a text File
by Anonymous Monk on Mar 14, 2011 at 10:53 UTC