cheetahpimp has asked for the wisdom of the Perl Monks concerning the following question:

ok, no one laugh at the new guy, ok? thanks. its like this, im working on a script for data mining... but instead of working on an intranet, it works on the open web for a defined number of hops. the problem im running into in my limited knowledge is this- how do i get the source (not url) of a web page? i need to get the urls off the page, and the easiest way (as i can see) is to start with the source and get them out of that.... help... please?

Replies are listed 'Best First'.
Re: web page source?
by Falkkin (Chaplain) on Feb 22, 2001 at 06:48 UTC
    To get the source, I'd get LWP::Simple from CPAN. The code to get your source would then be a simple 2-liner:
    use LWP::Simple; my $source = get("http://whatever.url.you/want/to/view.html");
    You only need the "use" directive once in your program; use the get() command every time you need to get the source of a page.

    Writing an HTML parser by hand is very non-trivial... I'd look at HTML::Parser (again, at CPAN) and see if that'll make your life easier. I've not really used HTML::Parser before, but, by looking at the documentation and playing around for the last 15 minutes, it appears you'd want to do something like the following:

    #!/usr/bin/perl -w use strict; use LWP::Simple; use HTML::Parser; my $source = get("http://www.perlmonks.org"); my $parser = HTML::Parser->new(); $parser->handler( start => \&function, 'token0, attr'); $parser->parse($source); sub function { my ($tag_name, $attr_ref) = @_; if ($tag_name eq 'a') { my %attr = %$attr_ref; print $attr{href}, "\n"; } }
Re: web page source?
by DarkGoth (Acolyte) on Feb 22, 2001 at 13:48 UTC

    Hi,

        As other monks have said, you have to use LWP::Simple from CPAN ...
       What I wanted to give you, is another hint code which shows you how to use it ...

      This script allow you, to GET the source even if the page is secure (and you have the L/p). :
    #!/usr/bin/perl use LWP::Simple qw(get); sub Get_Page { my ($url,%option) = @_ ; ## INTRODUCE LOGIN AND PASS IF IT'S PRECISED if (exists($option{login})) { $url =~ s/http:\/\///si; $url = + "http://$option{'login'}:$option{'pass'}\@$url"; } return get($url); } ## THE URL YOU WANT THE SOURCE CODE my $url = ‘http://www.pipo.com/index.html’; ## THE CALL OF KTHULU my $contenu = Get_Page("http://anon.free.anonymizer.com/$url"); ## PRINT THE SOURCE CODE print $contenu;
Re: web page source?
by BlueLines (Hermit) on Feb 22, 2001 at 07:47 UTC
    You might have luck with wget, which is a super cool program for fetching entire web pages. It does recursive gets (you said you needed this), and allows you to specify how many levels deep you are willing to go. You could call this from a script, then use perl to parse the output.

    BlueLines

    Disclaimer: This post may contain inaccurate information, be habit forming, cause atomic warfare between peaceful countries, speed up male pattern baldness, interfere with your cable reception, exile you from certain third world countries, ruin your marriage, and generally spoil your day. No batteries included, no strings attached, your mileage may vary.
Re: web page source?
by unixwzrd (Beadle) on Feb 22, 2001 at 14:16 UTC
    Just had to do this myself. I had to duplicate a web site I'm maintaining so I would have a development copy. Didn't have access to their server, so I got a copy of W3MIR also available at CPAN.

    Did the trick just fine. It has several module dependencies, but they are wel documented in the INSTALL document.

    Mike

    "The two most common elements in the universe are hydrogen... and stupidity."
    Harlan Ellison
Re: web page source?
by Desdinova (Friar) on Feb 23, 2001 at 00:08 UTC
    I just wrote something like this. THe LWP::Simple is great for getting the page. To split out the URL's I used the HTML::TokeParser Which is great for going through a document and grabbing our the URLS. In my code I had to be able to use the Text and the URL This is a chunk of the program to demo. this will just print a list of links from a web site.
    #!/usr/bin/perl -w use strict; use HTML::TokeParser; use LWP::Simple; my $page=get ('http:web.site.here/file.html'); unless (defined ($page)) {die "Unable to retrive page:$!\n";} my @links; #multi demantional array to hold links my $cnt = 0; my $p = HTML::TokeParser->new(\$page); while (my $token = $p->get_tag("a")) { my $url = $token->[1]{href} || "-"; my $text = $p->get_trimmed_text("/a"); $links[$cnt][0] = $text; $links[$cnt][1] = $url; $cnt++; } #sample of accessing links array $cnt =0; my $size = @links; while ($cnt < $size) { print "Text:$links[$cnt][0]\tURL:$links[$cnt][1]\n"; $cnt++; } exit();
    Ps. I am always open to suggestions. I'm still pretty new
      For the question SimpleLinkExtor is a better solution. My code is acutally from a script that ends up doing some parsing on the text part to extract a date and then sorting the array by date. I chopped it up to be easier to get teh meaning of.