Finding files on a website that have been orphaned

aarestad has asked for the wisdom of the Perl Monks concerning the following question:

We have an old website that has a lot of cruft, and I have been asked to write a script that finds the cruft. Specifically, the "cruft" I seek is any file that does not have a link to it within a website. Say I am working with http://example.com/it/, and the files are located on the Unix box itself at /www/docs/it/ I want to extract all the links from any .htm and .html files underneath the it/ subweb and store them. Then I'm thinking I should turn those into absolute pathnames. Finally I should check each file recursively underneath the /www/docs/it/ directory and report if I don't have a reference to it.

Now here's my problem: I don't have admin rights to the Unix box, so I can only install Perl modules like HTML::LinkExtractor locally to my Windoze box with Cygwin. The problem is that I need to be able to get directory listings remotely... Any insight?

(ps: Sorry I don't have any code to start with - I feel bad about this - but this one is a real stumper.)

Comment on Finding files on a website that have been orphaned Select or Download Code

Replies are listed 'Best First'.
Re: Finding files on a website that have been orphaned by simonm (Vicar) on Dec 05, 2003 at 23:48 UTC
If you decide to work this from another angle, you might also ask your hosting provider for a recent month's worth of access logs and for the result of running "ls -lR" from your root directory, and then use text-processing techniques to look for files that haven't been served.	[reply]
Re: Finding files on a website that have been orphaned by waswas-fng (Curate) on Dec 05, 2003 at 23:15 UTC
Here is what I would do if stuck without modules on the unix side. Use the Cygwin perl install with HTML::LinkExtractorand URI to create a list of relative files on the webserver. Save that as a text file. Move the text file over to the unix box where you make a script that loads the text file into a hash and then finds the files in the DocumentRoot. have a sub that fixes the File::Find::name to match the relative paths in the text file data and then compare the var with the hash. if you have a match the file is active. If you do not the file is not. If you must run this more than once get your lazy admin to install the modules you need on the server. =) -Waswas	[reply]
Re: Finding files on a website that have been orphaned by bradcathey (Prior) on Dec 06, 2003 at 15:28 UTC
Along the lines of simonm, if you have FTP permissions use a code-generator, like GoLive, to create a new site from the server. It will pull down all the files and show you all the broken links or "cruft." It's a crap shoot but worth a try. We have done it and it has worked, but it requires a bit of "analog" work (god forbid). But by the time you install modules and all the rest to make it digital, it could take longer. FWIW. Good luck. —Brad "A little yeast leavens the whole dough."	[reply]