shayak has asked for the wisdom of the Perl Monks concerning the following question:

I want the complete Url of a particular file to download.How can i convert a web page which has that url , to a text file and then extract the information from it.

Replies are listed 'Best First'.
Re: Dump a Web PAge to a text File
by Corion (Patriarch) on Mar 14, 2011 at 08:26 UTC

    See WWW::Mechanize, or just URI.

    I'm not sure where your concrete problem is, as you don't tell us where you have problems and don't show any code. URI makes it easy to construct an URL given the base URL and the relative path. WWW::Mechanize makes it easy to navigate and extract web pages.

      i want to randomly download a file from any website. So i wanted to go to a particular website and then download a particular file from it. For that i need to convert the web page into a text file to extract the URI of the file..

        Consider using the WWW::Mechanize module. The documentaion is clear and it provides the functionality you require to solve your current problem.

        You've been advised several times now to look at using a module such as WWW::Mechanize, it has great documentation. For some reason you aren't telling us you don't want to take this advice. If you do take this advice you'd find it an easy solution to your current problem.

        Update: fixed link to WWW::Mechanize, thanks Corion.

        Update 2: My mistake, updated reply.

Re: Dump a Web PAge to a text File
by chrestomanci (Priest) on Mar 14, 2011 at 10:26 UTC

    If you are looking to extract information from a web page, then you should be using the HTML structure of the page to help you find the information you are looking for. If you flatten that structure to plain text, then you will find it harder to parse the page.

    My advice is to go to CPAN and download an HTML parser module such as HTML::TreeBuilder or HTML::TokeParser::Simple both come Highly recommended. Out of the two, my preference is for HTML::TreeBuilder

    Also, so you know what the structure of your web page is, I suggest you install a GUI HTML tree inspector such as Firebug, or the inspect element tool in google chrome, to tell you where the elements you are looking for are in the HTML structure.

    With these two tools, you can very easily drill into the structure of an HTML page, and find exactly what you need.

    To take your example, you are trying to extract links from a web page, you can fetch a page, and make a list of links to other pages with the following:

    my $tree = HTML::TreeBuilder->new_from_content($html_file); my @links = $tree->extract_links('a');

    Now the array @links will contain all the links on the page. The problem is that on most web pages that will return hundreds of links. With HTML::TreeBuilder you can drill into the structure of a page to find the parts you are interested in, and then extract just the stuff you need from that part of the page.

    For example, to fetch the monk picture at the top right of each perl monks web page:

    use 5.010; use warnings; use strict; use HTML::TreeBuilder; use LWP::UserAgent; my $ua = LWP::UserAgent->new; my $fetch_result = $ua->get("http://www.perlmonks.org"); my $tree = HTML::TreeBuilder->new_from_content($fetch_result->content) +; my $banner_row = $tree->look_down( '_tag' => 'tr', class => 'bannerrow +' ); my $img_objects = $banner_row->extract_links('img'); # Skip over the advert image. my @monksImage = grep{$_->[0] =~ m/perlmonks/} @$img_objects; # Prove the script worked. say "Today's image points to: ".$monksImage[0][0];
Re: Dump a Web PAge to a text File
by Anonymous Monk on Mar 14, 2011 at 08:24 UTC
Re: Dump a Web PAge to a text File
by Anonymous Monk on Nov 09, 2013 at 14:12 UTC

    I also have the same problem,i.e,I want to create a text file from a particular web-page.Can you assist me something innovative?

      How did the solutions that were presented not already solve your problem?