Category: Web
Author/Contact Info

Timm Murray
tmurray.NOSPAM *at* wumpus-cave *dot* NOSPAM.com

Description:

We needed to map our entire site so we would have a better idea of what we need to keep and what we need to throw away. Our web site has grown out of control over the years, and there are portions of it that have never been touched by any of our current staff. We had set aside at least a month for manually mapping out the entire site. I was sufficiently intrested in automating this that I spent my free time to come up with the code below. And that's how Perl helped shave the major work of a one-month project down to a few hours.

Update: Forgot to mention--this does not respect robots.txt at all (for the problem I needed to solve, that's just the way it needed to be). Please, be kind to the owners of sites you intend to run this on. This can generate a lot of noise for webmasters.

Update 2:

  • Added 'javascript' to the list of disallowed schemes, as suggested by Vautrin.
  • Fixed a bug where if 'www.site.com' was allowed, then it would also index 'www.site.com.uk' if it was linked to.
  • Fixed a bug where the URI object was being put into the links list instead of the string form, which was confusing YAML.
  • List of links for each page is now outputted when DEBUG is on.
#!/usr/bin/perl
# Web Site Mapper
# Copyright (C) 2004  Timm Murray
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307
+  USA
#

use strict;
use warnings;
use WWW::Mechanize;
use URI;
use YAML;
#use Data::Dumper 'Dumper';

sub SITES ()               { qw( www.perlmonks.org ) }
sub DISALLOWED_SCHEMES ()  { qw( mailto ) }
sub DEBUG ()               { 1 }

sub get_page
{
        my ($data, $uri) = @_;

        my $page_to_load = $uri->canonical;

        warn "Indexing $page_to_load\n" if DEBUG;

        # Don't process pages that have already been loaded
        if(exists $data->{$page_to_load}) {
                warn "\t$page_to_load already indexed\n" if DEBUG;
                return;
        }
        # Don't process pages that aren't listed in the sites above
        unless( grep { lc($uri->host) eq lc($_) } SITES) {
                warn "\t$page_to_load not in allowed sites\n" if DEBUG
+;
                return;
        }

        my $mech = WWW::Mechanize->new();
        my $response = $mech->get( $page_to_load );

        $data->{$page_to_load}{status} = $mech->status;

        if($mech->success) {
                warn "\tResponse successful\n" if DEBUG;
                $data->{$page_to_load}{content_type} = $mech->ct;
                warn "\tContent-type: " .
                        $data->{$page_to_load}{content_type} .
                        "\n" if DEBUG;
                $data->{$page_to_load}{title} = $mech->title;
                warn "\tTitle: " . $data->{$page_to_load}{title} .
                        "\n" if DEBUG;

                my @links = map { $_->url_abs } $mech->links;
                $data->{$page_to_load}{links} = [];
                warn "\tLinks: " . join("\n", map "\t\t$_", @links) . 
+"\n" 
                        if DEBUG;

                foreach my $link (@links) {
                        my $uri = URI->new($link);
                        next if grep { $uri->scheme eq $_ } DISALLOWED
+_SCHEMES;
                        my $url = $uri->canonical->as_string;
                        warn "\tFollowing $url\n" if DEBUG;
                        push @{ $data->{$page_to_load}{links} }, $url;

                        get_page( $data, $uri );
                }
        }
        else {
                warn "\tResponse unsuccessful\n" if DEBUG;
        }
}


{
        my $data = { };
        foreach my $site (SITES) {
                my $start_page = 'http://' . $site;
                my $uri = URI->new($start_page);
                get_page( $data, $uri );
        }

        print Dump($data);
        #print Dumper($data);
}
Replies are listed 'Best First'.
•Re: Web Site Mapper
by merlyn (Sage) on Feb 16, 2004 at 08:14 UTC

      Perhaps I overlooked something in WWW::Robot's documentation, but it appears to me that it wouldn't quite work for what it seems he wanted to do, since, as he said in the update, for his purposes he needed to ignore robot.txt rules, and I couldn't see any way to turn that off.

      And on the second point, it appears that the sub will (almost) immediately return from a link that it had already hit that page, do it should stop once it has exhausted all pages that it hasn't already index, and finishes returning from a likely rather long list of links that it has already been to.

      Perhap's it might be more efficent to check if a link already exists before invoking the sub's foreach?

        Perhap's it might be more efficent to check if a link already exists before invoking the sub's foreach?

        More efficient, perhaps. However, in my case, I needed to search a number of virtualhosts which may end up linking to each other's front page. If you need to check if a page exists before invoking the spider sub, then that code needs to be duplicated in the initial call. By having the check done at the beginning of the sub, you only need to have one place where the check takes place.

        In any case, I doubt the efficiency advantage matters. This program isn't CPU-intensive. It will more likely be limited by your connection to the target domains and (probably to a lesser extent) by the memory overhead of recursive calls and the data structure.

        Update: I did notice that CPU usage spikes at the end of the program, when YAML::Dump is called. But that's only done once. (I was worried for a moment that it had managed to get itself stuck in an infinate loop and started eating massive system resources *g*).

        ----
        : () { :|:& };:

        Note: All code is untested, unless otherwise stated

Re: Web Site Mapper
by Vautrin (Hermit) on Feb 16, 2004 at 15:21 UTC

    You may want to change:

    sub DISALLOWED_SCHEMES ()  { qw( mailto ) }

    to:

    sub DISALLOWED_SCHEMES () { qw( mailto javascript ) }

    Otherwise your script will try to spider javascript functions. This means you're also going to need to change:

    to something like:

    my @dissallowed = DISALLOWED_SCHEMES; foreach my $scheme (@dissallowed) { next MAINLOOP if ($uri->scheme eq $scheme); }

    I forget whether or not a next works in a for loop, and don't really have time to look it up, so I used MAINLOOP and assumed the loop we're nexting is named.


    Want to support the EFF and FSF by buying cool stuff? Click here.

    Edited by BazB: close stray code tag.

      I could see how adding javascript to the disallowed schemes would be a good idea, but I don't see what the foreach addition would provide over the current grep solution. Seems like a lot more code to do the exact same thing.

      BTW--In the foreach case, you're right about needing an explicit label.

      ----
      : () { :|:& };:

      Note: All code is untested, unless otherwise stated

        Ooops, you're right. I was looking over the code quickly on my way to work and saw $_ and added in my 2 cents.

        Want to support the EFF and FSF by buying cool stuff? Click here.
Re: Web Site Mapper
by exussum0 (Vicar) on Feb 16, 2004 at 15:24 UTC
    use constant saves you from writing out your constants explicitly, as subs. 'sides, the syntax makes it look more variable like.

    use CONSTANT a => 1;


    Play that funky music white boy..

      I used to use constant all the time, but not anymore. It requires a relatively recent version of perl, and it loads up a lot of code that isn't really necessary for simple cases. Further, it is rather clunky to use it with anything more complex than a simple scalar constant. For instance, to handle the SITES constant above, I would need to change it to an arrayref and modify the site checking code to derefernce the array. That's extra noise with little appreciable gain. Using a subroutine with an empty prototype (which is what constant does anyway) doesn't add a lot of extra code and it avoids the problems constant creates.

      ----
      : () { :|:& };:

      Note: All code is untested, unless otherwise stated

        Version 1.0 of constant came out in 1997 so it's not that new :)

        Yes, switching from sub XYZ() to constant requires code changes. But for new development, it's a little clearer the intentions of what your sub is there for and that you aren't insane.

        My version of constant.pm has about 100 lines of perl, with warnings::register using about 25. Not too much for some code clarity if you ask me.


        Play that funky music white boy..

        For instance, to handle the SITES constant above, I would need to change it to an arrayref and modify the site checking code to derefernce the array.

        What are you talking about? You can simply
        use constant SITES => qw(foo.com bar.com baz.com);
        and it'll do what you'd expect.

        Makeshifts last the longest.