Searching through links

methosb has asked for the wisdom of the Perl Monks concerning the following question:

Hi I am new to Perl, I am just doing my first Perl unit in uni. I am having trouble getting my head around a question I have to do, I am sure there is a simple way. We are making a basic command line web browser. We have a txt file in which we search through, get the words from and match them against the "Title" header. It is a basically a simple filter, if one of the words like 'porn' is found then the user is displayed a message and the script closes. Now what I am having trouble with is an extension of that. We have to have multiple files and within the files they can have urls to the other filter files. So the script has to go through the first one supplied, then when it comes accross a url it has to check that the url is valid as we did earlier in the script, then check the words within the file stored at the url against the Title header as well. I just can't get my head around how to check every single url inside the txt files. So does anyone have any suggestions on how to approach this? Thanks.

Comment on Searching through links

Replies are listed 'Best First'.
Re: Searching through links by matija (Priest) on Apr 23, 2004 at 12:32 UTC
If you need to extract links from a HTML file, read the documentation of HTML::LinkExtor - that module should do what you need. As for checking all URLs from a TXT (i.e. not HTML) file, just read them in (via some sort of loop), and check them in turn. If the new file gives you more URLs to check, simply push them on the end of the array you use to keep the URLs you still have to check. Be sure to keep a hash of all the URLs already checked, and skip adding them to the "URLs to check" array - that way, you will avoid traversing a loop of links like A->B->C->A. Hope this helps, and good luck with your homework.	[reply]

Replies are listed 'Best First'.

Re: Searching through links
by matija (Priest) on Apr 23, 2004 at 12:32 UTC

HTML::LinkExtor

As for checking all URLs from a TXT (i.e. not HTML) file, just read them in (via some sort of loop), and check them in turn. If the new file gives you more URLs to check, simply push them on the end of the array you use to keep the URLs you still have to check.

Be sure to keep a hash of all the URLs already checked, and skip adding them to the "URLs to check" array - that way, you will avoid traversing a loop of links like A->B->C->A.

Hope this helps, and good luck with your homework.

[reply]