in reply to Creating loop on undefined hash key value

Well, first, it won't work to mix your types like this:
$links{$url}{html} = $html; ... $links{$new_url} = 1;
The value at $links{$some_key} cannot both simultaenously be the number one and a subhash reference. I suggest you change it to:
$links{$new_url}{visited} = 0;
Then when you actually visit the node, change that 0 to a 1, and put the HTML in there.

However, if all you're writing is a link checker or recursive web walker, you'd be about the 492nd person to do it this month. I suggest you save lots of time and look at WWW::Robot or WWW::SimpleRobot or any of my columns on that subject.

-- Randal L. Schwartz, Perl hacker
Be sure to read my standard disclaimer if this is a reply.

Replies are listed 'Best First'.
Re: •Re: Creating loop on undefined hash key value
by S_Shrum (Pilgrim) on Nov 23, 2002 at 18:29 UTC

    Quick background: I'm building a simple site index builder (no 'use lingua'). Maybe this time I'll actually finish it.

    I am familiar (a little) with Robot and SimpleRobot but they do not do dynamic, script generated pages (to my knowledge and pass experiences) which is how I need it to work.

    Robot and SimpleRobot will not read a page with a URL like:

    http://www.someserver.com/cgi-bin/template.pl?content=foo.htm

    I'm working with a subset of code written by Rob_au from his SiteRobot.pm (you've seen it before). His code works great (returns script URLs) but passes only a single dimensioned array of the page URLs...I wanted to build on this. I wanted to be able to retrieve more things like Title, Body, creation date, etc. Rob's code already retrieves this information but uses the information for validity checking and then throws it to the wind.

    My twist was to use a hash instead of his array since no dulpilcate key data can be created...therefore the listing of URLs would be unique. I thought about doing a AoH but that's too messy (I'd have to build in duplicate checking, code to pull the hashes out of the array, etc.). Ala K.I.S.S.

    So now that you see my quandry a bit more clearly, is there any more information that you can provide.

    TIA

    ======================
    Sean Shrum
    http://www.shrum.net

      It shouldn't be difficult to modify Robot or SimpleRobot to /not/ filter out GET-prametered URLs. Be careful, though, as there are some times you definatly /don't/ want to follow such links, such as when they cause voting, etc, to occour.

      There's probably a line that searches for a ? in the url, and rejects it. It's probably even commented.


      Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).