comment on

If you are looking to extract information from a web page, then you should be using the HTML structure of the page to help you find the information you are looking for. If you flatten that structure to plain text, then you will find it harder to parse the page.

My advice is to go to CPAN and download an HTML parser module such as HTML::TreeBuilder or HTML::TokeParser::Simple both come Highly recommended. Out of the two, my preference is for HTML::TreeBuilder

Also, so you know what the structure of your web page is, I suggest you install a GUI HTML tree inspector such as Firebug, or the inspect element tool in google chrome, to tell you where the elements you are looking for are in the HTML structure.

With these two tools, you can very easily drill into the structure of an HTML page, and find exactly what you need.

To take your example, you are trying to extract links from a web page, you can fetch a page, and make a list of links to other pages with the following:

my $tree = HTML::TreeBuilder->new_from_content($html_file);
my @links = $tree->extract_links('a');
[download]

Now the array @links will contain all the links on the page. The problem is that on most web pages that will return hundreds of links. With HTML::TreeBuilder you can drill into the structure of a page to find the parts you are interested in, and then extract just the stuff you need from that part of the page.

For example, to fetch the monk picture at the top right of each perl monks web page:

use 5.010;
use warnings;
use strict;
use HTML::TreeBuilder;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
my $fetch_result = $ua->get("http://www.perlmonks.org");
my $tree = HTML::TreeBuilder->new_from_content($fetch_result->content)
+;

my $banner_row = $tree->look_down( '_tag' => 'tr', class => 'bannerrow
+' );
my $img_objects = $banner_row->extract_links('img');

# Skip over the advert image.
my @monksImage = grep{$_->[0] =~ m/perlmonks/} @$img_objects;

# Prove the script worked.
say "Today's image points to: ".$monksImage[0][0];
[download]

In reply to Re: Dump a Web PAge to a text File by chrestomanci
in thread Dump a Web PAge to a text File by shayak

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.