psykosmily has asked for the wisdom of the Perl Monks concerning the following question:

I recently had the idea to write a program that will grep a specified website for a specified keyword. I'm a novice and thought that I might be able to receive some guidance in the best method I might use for such a program. Does anyone have an idea or suggestion they might be willing to share with me?

Edit kudra, 2002-05-01 Changed title

  • Comment on Program that will grep website for specified keyword

Replies are listed 'Best First'.
Re: In need of guidance....
by dsb (Chaplain) on Apr 24, 2002 at 19:29 UTC
    Look into the LWP library, specifically LWP::Simple. If you are planning on doing this recurisively to the whole site rather than just one page, you will also want to read up on references and stacks and stuff like that.

    To summarize, a stack is a list of things to be processed. They usually are one of two types: First in Last out, or Last in First out. I believe most recursive programming is used with the latter type.

    # from the command line $ perldoc perlref $ perldoc LWP::Simple $ perldoc perldata

    Update - Edited stack types. Thanks for the catch Mr. Muskrat ;0)



    Amel
      A stack is First In Last Out. (Think the tray or plate stack in a cafeteria. The last one put on the stack is the first one removed.) Example from either the Programming Perl 3rd Ed or Perl Cookbook... don't remember which. You 'push' onto a stack and 'pop' off of it.

      A list is First In First Out. You 'unshift' into a list and 'shift' out of it.

      *** NOTE: of course you are free to do as you like. So you could push and shift, or unshift and pop or whatever you want...

      Updated: Okay, I looked it up and I was on the right track: "When you push and pop (or shift and unshift) an array, it's a stack; when you push and shift (or unshift and pop) an array, it's a queue." -- Programming Perl, 3rd Edition; Chapter 9 Data Structures, Pg. 268

      ----------------
      Matthew Musgrove
      Who says that programmers can work in the Marketing Department?
      Or is that who says that Marketing people can't program?
Re: In need of guidance....
by Stegalex (Chaplain) on Apr 24, 2002 at 20:00 UTC
    If you're on Linux you could say fgrep -Ri string

    But if not, here's a hokey little program called sgrep (short for site grep).
    #!/usr/bin/perl -w # grep through a website's HTML for a string. use File::Find; foreach my $term (@ARGV) { print "\n$term:\n"; find ({wanted => \&each_file, follow => 0, term => "$term"}, "/iputils/ns-home/docs"); } sub each_file { my $filename = $File::Find::name; return if ((-d $filename) || (! -r $filename) || (! -T $filename)); my $term = ${$_[0]}{term}; open (FILE, "$filename"); my $line = 0; while (<FILE>) { chomp; $line++; if ($_ =~ m/$term/) { print "$filename line $line\n"; } } close FILE; }


    ~~~~~~~~~~~~~~~
    I like chicken.
      Oops, I forgot to mention that the part that says /iputils/ns-home/docs should be replaced with your server root directory.

      ~~~~~~~~~~~~~~~
      I like chicken.
Re: In need of guidance....
by Molt (Chaplain) on Apr 25, 2002 at 09:59 UTC

    I'm going to be radical, and suggest you look at WWW::Robot. This module is intended to go through an entire site pulling down the data, and allowing you to do what you wish with it.

    I'd also suggest splitting the program in two parts. The first part pulls down all the data and stores it on local storage, the second greps the local copy. This way you don't have to wait whilst the data is fetched again for a second time if you decide to grep for something else or if you have a bug in the code, and the website maintainer doesn't begin to hate you for taking up silly amounts of bandwidth by getting the site multiple times.

    You can use File::Find to simplify the second part too.

    Also make sure you obey the robot exclusion rules, and have a delay between getting consecutive URLs so you don't give the server a good kicking.

      Thank you very much this was exactly what I was looking for.
Re: In need of guidance....
by Popcorn Dave (Abbot) on Apr 24, 2002 at 20:02 UTC
    Actually I wrote a program to do something similar for a class I took, although I was pulling URL's from pages - actually links to news stories.

    Amel has got you going down the right road with his suggestions. I used LWP::Simple to get the web pages, but be warned that you're going to pull all the graphics and everything else with you. Since my program did what I wanted at the time, I didn't look for a way to pull just source, which is what I believe you want to do, but it was suggested to me on here that I look at using a system call and use lynx to get only text.

    Depending on the size of the page you're trying to search on, obviously pages that are graphic intensive will take longer to download. Also be aware that you're going to have to possibly deal with frames, and if you're going to take that in to account, then Amel is right, you're probably looking at some kind of recursion or something along those lines.

    Hope that points you in a useful direction. :)

    Good luck!

      "...but be warned that you're going to pull all the graphics and everything else with you."
      Why? I'd would just ignore all URLs if they are inside <IMG> tags...
      Update: (Or look only for those URLs that are in <a> tags.
      Matthew Musgrove
      Who says that programmers can't work in the Marketing Department?
      Or is that who says that Marketing people can't program?
        Well you're going to grab a full web page using LWP::Simple, which includes the graphics. At least that is what I discovered, I could be wrong. Check the LWP module docs to make sure, but as I recall, using the get(www.myhost.com) will pull everything whereas using lynx just pulls text, but you're down to using a system call to use lynx as opposed to the module.