in reply to Program that will grep website for specified keyword

Actually I wrote a program to do something similar for a class I took, although I was pulling URL's from pages - actually links to news stories.

Amel has got you going down the right road with his suggestions. I used LWP::Simple to get the web pages, but be warned that you're going to pull all the graphics and everything else with you. Since my program did what I wanted at the time, I didn't look for a way to pull just source, which is what I believe you want to do, but it was suggested to me on here that I look at using a system call and use lynx to get only text.

Depending on the size of the page you're trying to search on, obviously pages that are graphic intensive will take longer to download. Also be aware that you're going to have to possibly deal with frames, and if you're going to take that in to account, then Amel is right, you're probably looking at some kind of recursion or something along those lines.

Hope that points you in a useful direction. :)

Good luck!

Replies are listed 'Best First'.
Re: Re: In need of guidance....
by Mr. Muskrat (Canon) on Apr 24, 2002 at 21:47 UTC
    "...but be warned that you're going to pull all the graphics and everything else with you."
    Why? I'd would just ignore all URLs if they are inside <IMG> tags...
    Update: (Or look only for those URLs that are in <a> tags.
    Matthew Musgrove
    Who says that programmers can't work in the Marketing Department?
    Or is that who says that Marketing people can't program?
      Well you're going to grab a full web page using LWP::Simple, which includes the graphics. At least that is what I discovered, I could be wrong. Check the LWP module docs to make sure, but as I recall, using the get(www.myhost.com) will pull everything whereas using lynx just pulls text, but you're down to using a system call to use lynx as opposed to the module.
        LWP::Simple does not do that unless you tell it too...
        #!/usr/bin/perl use strict; use warnings; use LWP::Simple; my $res_code = getstore('http://www.perlmonks.org/','index.html'); die "Download failed! Response code is $res_code.\n" if $res_code != 2 +00; # continue processing here

        Matthew Musgrove
        Who says that programmers can work in the Marketing Department?
        Or is that who say that Marketing people can't program?
        Thank you all for the help.