Dump a Web PAge to a text File

shayak has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Dump a Web PAge to a text File by Corion (Patriarch) on Mar 14, 2011 at 08:26 UTC
See WWW::Mechanize, or just URI. I'm not sure where your concrete problem is, as you don't tell us where you have problems and don't show any code. URI makes it easy to construct an URL given the base URL and the relative path. WWW::Mechanize makes it easy to navigate and extract web pages.	[reply]
Re^2: Dump a Web PAge to a text File by shayak (Acolyte) on Mar 14, 2011 at 08:36 UTC
i want to randomly download a file from any website. So i wanted to go to a particular website and then download a particular file from it. For that i need to convert the web page into a text file to extract the URI of the file..	[reply]
Re^3: Dump a Web PAge to a text File by marto (Cardinal) on Mar 14, 2011 at 08:57 UTC
Consider using the WWW::Mechanize module. The documentaion is clear and it provides the functionality you require to solve your current problem. You've been advised several times now to look at using a module such as WWW::Mechanize, it has great documentation. For some reason you aren't telling us you don't want to take this advice. If you do take this advice you'd find it an easy solution to your current problem. Update: fixed link to WWW::Mechanize, thanks Corion. Update 2: My mistake, updated reply.	[reply]
Re: Dump a Web PAge to a text File by chrestomanci (Priest) on Mar 14, 2011 at 10:26 UTC
If you are looking to extract information from a web page, then you should be using the HTML structure of the page to help you find the information you are looking for. If you flatten that structure to plain text, then you will find it harder to parse the page. My advice is to go to CPAN and download an HTML parser module such as HTML::TreeBuilder or HTML::TokeParser::Simple both come Highly recommended. Out of the two, my preference is for HTML::TreeBuilder Also, so you know what the structure of your web page is, I suggest you install a GUI HTML tree inspector such as Firebug, or the inspect element tool in google chrome, to tell you where the elements you are looking for are in the HTML structure. With these two tools, you can very easily drill into the structure of an HTML page, and find exactly what you need. To take your example, you are trying to extract links from a web page, you can fetch a page, and make a list of links to other pages with the following: `my $tree = HTML::TreeBuilder->new_from_content($html_file); my @links = $tree->extract_links('a');` [download] Now the array `@links` will contain all the links on the page. The problem is that on most web pages that will return hundreds of links. With HTML::TreeBuilder you can drill into the structure of a page to find the parts you are interested in, and then extract just the stuff you need from that part of the page. For example, to fetch the monk picture at the top right of each perl monks web page: use 5.010; use warnings; use strict; use HTML::TreeBuilder; use LWP::UserAgent; my $ua = LWP::UserAgent->new; my $fetch_result = $ua->get("http://www.perlmonks.org"); my $tree = HTML::TreeBuilder->new_from_content($fetch_result->content) +; my $banner_row = $tree->look_down( '_tag' => 'tr', class => 'bannerrow +' ); my $img_objects = $banner_row->extract_links('img'); # Skip over the advert image. my @monksImage = grep{$_->[0] =~ m/perlmonks/} @$img_objects; # Prove the script worked. say "Today's image points to: ".$monksImage[0][0]; [download]	[reply] [d/l] [select]
Re^2: Dump a Web PAge to a text File by Anonymous Monk on Mar 14, 2011 at 10:53 UTC
web::scraper using an xpath	[reply]
Re: Dump a Web PAge to a text File by Anonymous Monk on Mar 14, 2011 at 08:24 UTC
web text	[reply]
Re: Dump a Web PAge to a text File by Anonymous Monk on Nov 09, 2013 at 14:12 UTC
I also have the same problem,i.e,I want to create a text file from a particular web-page.Can you assist me something innovative?	[reply]
Re^2: Dump a Web PAge to a text File by Corion (Patriarch) on Nov 09, 2013 at 14:16 UTC
How did the solutions that were presented not already solve your problem?	[reply]