Checking links between web-pages

yacoubean has asked for the wisdom of the Perl Monks concerning the following question:

Holy Monks,

I have a 'website' that is documentation for a largish ColdFusion Intranet application I'm in charge of. Each page of the site is documentation for one page of the ColdFusion app. So there are 135 ColdFusion pages and thus 135 HTML doc pages. I would like to add a section to each HTML doc page that displays every ColdFusion page that links to the particular CFM page documented. I already have a section on each HTML page that shows all of the page that each CFM pages links to, but not which pages link to each individual page. An example would probably help:

ColdFusion pages:
a.cfm, b.cfm, c.cfm

Docs pages:
a.html, b.html, c.html

In a.html there's a section that says that a.cfm links to b.cfm and c.cfm. The perl code I want would tell me which pages link to b.cfm, in this case a.cfm. Repeat that operation for all 135 HTML pages.

I already know enough Perl to be able to open and parse HTML files. So my plan is to make a script that will search the 135 HTML files for all pages that link to a particular page and spit that list out. But it doesn't seem very elegant because I'd be searching the 135 pages 135 times, once for each page. I don't want you to hold my hand and write all the code for me, I'm just looking for general advice to point me in the write direction before I embark on this journey.

Janitored by Arunbear - retitled from 'I've climbed the mountain to speak with the Oracle.', as per Monastery guidelines

Comment on Checking links between web-pages

Replies are listed 'Best First'.
Re: Checking links between web-pages by radiantmatrix (Parson) on Oct 13, 2004 at 16:02 UTC
You needn't repetitively scan if you store a map inside a hash structure.: #like this... #Set up the link-map; my %linkmap; my @pages = glob(.cfm); foreach $page (@pages) { @links_to = find_links($page); #your sub for (@links_to) { $linkmap{$page}{$_} = undef; } } #print out what links to each file foreach $page (@pages) { print "What links to $page: ".join(';',what_links_to($page)); } # sub to find what links to something sub what_links_to { my $dest = shift; #i.e. 'a.cfm' for what links to 'a.cfm' my @links_to; for (keys %linkmap) { next if ($_ eq 'a.cfm'); #skip what 'a.cfm' links to push (@links_to, $_) if exists $linkmap{$_}{'a.cfm'}; } return @links_to; } [download] So you only scan the files once, and then check the hash repeatedly. Much* faster. radiantmatrix `require General::Disclaimer;`	[reply] [d/l]
Re^2: Checking links between web-pages by yacoubean (Scribe) on Oct 13, 2004 at 16:42 UTC
Yes, this is a very good idea. I was actually thinking a hash would be good, but I've been afraid to try it because I have yet to use hashes and they kind of scare me. ;) But I think I'll use this oportunity to break into them.	[reply]
Re: Checking links between web-pages by Random_Walk (Prior) on Oct 13, 2004 at 15:32 UTC
If I grok what you mean When Page 3 links to Page 7 you wish to ensure Page 7 also links back to Page 3... Pass 1, examine all pages and build a hash keyed by linked page containing arrayrefs to an array of pages that linked it. for instance if on page 3 you found a link to page 7 `push @{$hash{7}}, 3` of course you check first the key for 7 exists and create if required Pass 2, go through the hash keys and see what is in the attached array to get a list of all pages that linked to this page update Oh how dumb I am, sequential integer page numbers and I am babbling on about a hash... replace the above code with `push @{$array[7]}, 3` and s/hash/array/g Here is a rough framework `# something like this.... # @Pages is a list of integer page numbers # get_links_linked_from($) does what it looks like foreach my $page (@Pages) { my @Links=&get_pages_linked_from($page); foreach (@Links) { $Link_Store[$_]=[] unless $Link_Store[$_]; push @{$Link_Store[$_]}, $page; } } # assuming you start on page 1 for (my $i=1; $i<=$#Link_Store; $i++) { my $linkers = join ", ", @{$Link_Store[$i]}; print "Page $i linked from $linkers\n" }` [download] Cheers, R.	[reply] [d/l] [select]
Re^2: Checking links between web-pages by yacoubean (Scribe) on Oct 13, 2004 at 16:37 UTC
Thanks for trying to help, but that's not exactly what I am trying to do. I just want to know when a page is linked to, not to create a back link.	[reply]
Re: Checking links between web-pages by pizza_milkshake (Monk) on Oct 13, 2004 at 15:12 UTC
the route i would take would be to maintain 2 database tables, one table a list of every file in the project and the other a list of which files link to which others. you could build this once and generate your documentation once, so browsing wouldn't generate any load. then you'd probably want to run a script once per night perhaps which checks the modification time of each doc and scans them for changes, and if necessary updating the db and regenerating a new static page. perl -e"\$_=qq/nwdd\x7F^n\x7Flm{{llql0}qs\x14/;s/./chr(ord$&^30)/ge;print"	[reply]
Re^2: Checking links between web-pages by yacoubean (Scribe) on Oct 13, 2004 at 15:58 UTC
This is a good idea. I can populate the tables with code, and then use the tables to create my 'linked to by whom' section on each page.	[reply]
Re: Checking links between web-pages by TedPride (Priest) on Oct 13, 2004 at 20:07 UTC
`my ($from, $to, @to); foreach (<DATA>) { chomp($_); ($from, $to) = split(/ -> /, $_); @to = split(/ /, $to); foreach (@to) { $hash{$_}{$from} = (); } } foreach (sort keys %hash) { print "$_ <-"; $to = $hash{$_}; foreach (sort keys %$to) { print " $_"; } print "\n"; } __DATA__ page1.html -> page2.html page3.html page2.html -> page3.html page4.html page3.html -> page4.html page4.html -> page1.html` [download] You'll need to modify the first loop for your page loading and link extraction algorithm, but this should work fine for matching links with only one run-through per page.	[reply] [d/l]

update