S_Shrum has asked for the wisdom of the Perl Monks concerning the following question:

This ? is in coorespondence with my task of creating a site indexing script (look at this node for more background).

What I am trying to do
======================

With the aid of HTML::LinkExtor, I am retrieving a list of links (converted to absolute URLs) from a primer (root) page. These links are stored into 3 hashes:

%all_links for all the links,
%distinct_links for a list of links with duplicates removed, and lastly
%constrained_links for links that conform to a user defined string (mainly used for regex things like "^http://www.shrum.net" as to control the traversal scope).

Aside: Thinking this over again, I should probably just use 1 hash and have a 'occurances' key and increment that for dupes. Anyways...

The first page will fill the hashes with some links to start tranversing therefore I am going to need to loop through them, right!?.

Here's the catch...as the traversal begins, additional links from the respective pages returned by the links will be added to the hashes.

How should I attack this beast? How can I set up a loop that will deal with a growing hash to address all the entries?

TIA.

======================
Sean Shrum
http://www.shrum.net

  • Comment on How do I...? - Looping on a growing hash

Replies are listed 'Best First'.
Re: How do I...? - Looping on a growing hash
by dws (Chancellor) on Mar 21, 2002 at 01:50 UTC
    Why keep both %all_links and %distinct_links?   $links{$aLink}++; handles both cases. All you need is   keys %links to give you distinct links. And you don't need to iterate over hashs. All you need is something like
    foreach $link ( @extracted_links ) { $links{$link}++; $constrainted_linkes{$link}++ if ( 0 < grep { $link =~ m/$_/ } @constraints; }
    The rest is details.

      I was figuring on being able to report on how many times a particular URL came up. I just changed my code to deal with 1 hash, like this: %all_links = ( URL_STRING = ( occurances, visit, title, content, traversed,) );

      Code-wise, I will check if the URL needs to be traversed via a positive value is in the VISIT key (VISIT eq to the value returned by INDEX function) if the URL_STRING contains constraining factors.

      I would then need to traverse those pages that contain a positive VISIT value and have a undefined or (if I preset the value to -1) a negative TRAVERSED value.

      Note: Keying off the TITLE and/or CONTENT values might cause a inifinate loop if a page has no title or body content (why a page would have no title or body is beyond me but it's a possible issue).

      I know I left some of these specifics out of the original post so I humblily bend over and shout: "Thank you sir, may I have another!" ;-}

      PS: Thinking ahead: can I increment the value of a key like this: $all_links{URL_STRING}{occurances}++;

      ...or do I have to do this:

      $occurances = $all_links{URL_STRING}{occurances} + 1; $all_links{URL_STRING}{occurances} = $occurances;

      TIA (again) ;)

      ======================
      Sean Shrum
      http://www.shrum.net

        I just changed my code to deal with 1 hash, ...

        By using one top-level hash, you're buying into a set of intermediate-level data structure problems that you could avoid by using several (5?) top-level hashes, one per "purpose". I recommend you do the latter.

Re: How do I...? - Looping on a growing hash
by demerphq (Chancellor) on Mar 27, 2002 at 16:55 UTC
    Your problem in a nutshell is Breadth First Traversal.

    Here is a bit of perly pseudo code that perhaps you will find useful.

    # the fact that the below looks like perl should not be # taken a sign that it is perl. It certainly ain't tested perl sub getlinks { my $rootpage=shift; my (@queue,%seen,%visited)=($rootpage); $seen{$rootpage}++; while (@queue) { my $this_page=shift @queue; # dont revisit this page next if $visited{$this_page}++; my @found=get_all_links_from_page($this_page); !$seen{$_}++ && !visited{$_} && push @queue,$_ foreach @found; # additional post processing of link } return $results_of_postprocessing }
    BTW this is the basic structure of a breadth first traversal of an arbitrary data structure. You might want to research the subject a bit on your own...

    HTH

    Yves / DeMerphq
    ---
    Writing a good benchmark isnt as easy as it might look.