Scripts to recursively reading in HTML files

mwhiting has asked for the wisdom of the Perl Monks concerning the following question:

Hello

I need a short Perl subroutine that will recursively change the file paths in an HTML file to new file paths.

My HTML pages are designed to normally call graphics & server-side includes, etc from subfolders (as usual). But I have a perl program that needs to open, read the lines of one these html pages and output it as the output of the perl program.

The catch then, is that the references in the HTML are all wrong (images, links, server side includes, etc), since the Perl operates from the usual /cgi-bin folder, not from the usual HTML folder. The file structure is very standard, such as follows:

public_html
/cgi-bin
/images
/stylesheets
Further to that, the file references in the related SSI files in /stylesheets will be all wrong too.

I'm sure I've seen references in places to short scripts that will do this, and I'm looking to pick one of them up instead of writing it myself. In general it just needs to read an .html file line by line, try to match a text fragment & add in the "../" text. But it also needs to work recursively, identifying and opening the SSI's, and making the same subsitutions on them. I suppose this is the complicating factor that is making me look for a script - the reading and subsituting itself is straightforward.

Does anyone know of an existing script that will do this?

Alternatively, is there a different way to make this work? The real problem is that the server is working with /cgi-bin as the default directory, instead of the public_html folder above it, as it would with any regular websurfing.

Thank you so much.

Comment on Scripts to recursively reading in HTML files

Replies are listed 'Best First'.
Re: Scripts to recursively reading in HTML files by dragonchild (Archbishop) on Aug 31, 2004 at 17:07 UTC
You're going to want to use HTML::Parser and File::Spec. Hooking those two modules together will get you about 90% of the way there. ------ We are the carpenters and bricklayers of the Information Age. Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose I shouldn't have to say this, but any code, unless otherwise stated, is untested	[reply]
Re: Scripts to recursively reading in HTML files by CountZero (Bishop) on Aug 31, 2004 at 18:11 UTC
In the spirit of TIMTOWTDI, I would turn the problem upside down. Rather than going through all your web-pages and SSI-scripts and looking for filenames, I'd first make a list of all files which are (or are likely) to be referenced in my web-pages. Then, armed with this list, it is easy to do a straight search and replace for every filename on every line in your web-pages. You may even be able to collate your list of filenames automatically by recursively searching your web-pages tree. You will have to watch out that you do not change filenames in the text itself, but it is my experience that you will find few or no filenames in the text of business-webpages. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re: Scripts to recursively reading in HTML files by ikegami (Patriarch) on Aug 31, 2004 at 17:06 UTC
The real problem is that the server is working with /cgi-bin as the default directory, instead of the public_html folder above it, as it would with any regular websurfing Have you looked into the possibility of moving your script outside of cgi-bin and into public_html? In my experience, just about every shared-hosting Apache servers are configured to run CGI scripts anywhere, if they have the .cgi (sometimes .pl) extention. If it's not, have you tried adding `AddHandler cgi-script .cgi` to the server's configuration files or to a .htaccess in public_html? While this may not be the answer you were expecting, it will save you lots of work. Update: Rephrased a bit to clarify in response to sintadil's comment about default settings. He also mentioned the security consequecences of my suggestion.	[reply] [d/l]
Re^2: Scripts to recursively reading in HTML files by sintadil (Pilgrim) on Aug 31, 2004 at 17:34 UTC
Most Apache servers (dunno about others) are configured to run CGI scripts anywhere, if they have the .cgi (sometimes .pl) extention. I just checked both a 1.x and a 2.x default configuration file, and neither of them are set up that way. The lines in the files exist, yes, but they're commented out. This default may be the case with some Linux or other vendor-supplied default configuration files, but it's not the case with the Apache-supplied defaults. Also, doing that sort of wildcarding is a bad idea, IMO, because it can lead to unintended execution of files.	[reply]
Re: Scripts to recursively reading in HTML files by Anonymous Monk on Sep 01, 2004 at 00:25 UTC
why don't you just put a leading forward slash on all the href's in the html files? href="/stylesheets/blah.css" will reference www.siteurl.com/stylesheets/blah.css" no matter where it's referenced from.. where as href="stylesheets/blah.css" will reference blah.css relative to the current path.. try it out :)	[reply]
Re^2: Scripts to recursively reading in HTML files by mwhiting (Beadle) on Sep 01, 2004 at 01:19 UTC
Actually that's a good, simple idea. It occurred to me that a 'redirect' statement in htaccess should also be able to fix this. If I redirect all calls to cgi-bin/images to /images, shouldn't this do it too? I'm partly asking because I've tried it and the server doesn't seem to be able to follow through the redirection and find the file. Htaccess is working, because if I put garbage in the file the server chokes on it. Here's the line I've used in the .htaccess file that I've put in the /cgi-bin folder: Redirect /images http://www.mysite.com/images/ I believe this should redirect any calls to cgi-bin/images (which doesn't exist) to the correct folder. I've experimented with different arrangements of slashes, and with putting the full website address in the first parameter too - no success. The webpages that I've read which describe the use of this line make it sound like this is all it requires. Is it possible the server is disallowing a http call to the cgi-bin folder, since it is not a folder for web-browsing? Can my htaccess file overcome this too?	[reply]
Re^3: Scripts to recursively reading in HTML files by snookmz (Beadle) on Sep 01, 2004 at 03:19 UTC
I was the anonymous monk before (forgot to login). I'm not sure why you'd want to put a 'redirect' into your apache config for.. explicetely linking href's to the base URL ( forward slashes before links e.g. href="/images/somegif.jpg") will solve your problem.. You say you've tried all manner of slashes in your links, can you show me what's not working? the leading forward slash should work for all html/image/css references, but maybe not for SSI's (i don't use SSI's so i'm not sure how it'd react)..	[reply]
Re: Scripts to recursively reading in HTML files by wfsp (Abbot) on Aug 31, 2004 at 17:15 UTC
Hi, If I've understood you correctly, would it be worth considering using the output of the cgi script to create an html file in the same directory as the original file and then redirect to it. I'm thinking this might help with the links problem.	[reply]
Re^2: Scripts to recursively reading in HTML files by mwhiting (Beadle) on Aug 31, 2004 at 17:52 UTC
I had thought of this as an approach, but I didn't like the look of the blank screen/redirect process. It seems a bit cumbersome & distracting for a business' website. I was hoping to do it by some means that wouldn't appear unusual at all to the user. Likewise, with running cgi from the main html level, that hasn't been the standard so far - I'd rather not do that. Has no-one heard of a script to do this?	[reply]
Re^3: Scripts to recursively reading in HTML files by wfsp (Abbot) on Aug 31, 2004 at 18:09 UTC
Then it looks like dragonchild's suggestion is the best bet. Also, I have found `URI` useful for building/rebuilding relative links. On that last point, perhaps using absolute addresses (starting with /) might help? In any event, if you're parsing HTML, _don't_ try to do it with regexs! If anyone has heard of a script, I'd like to see it!. Best of luck.	[reply] [d/l]
Re: Scripts to recursively reading in HTML files by cosimo (Hermit) on Sep 01, 2004 at 11:32 UTC
In some cases when you have structured your html in a good (for any meaning of "good" :-) way, you can just add an extra `BASE` tag in the `HEAD` section, like this: `<HTML> <HEAD> <BASE HREF="http://mywww.mydomain.org/"> ... </HEAD> ... <I>proceed with normal html page</I> ...` [download] This will automatically make sure your browser loads relative urls inside your page as if the starting path were `"/"` (root of your webserver). Probably you can also omit the domain part (http://) and leave only the `"/"` Hope this helps!	[reply] [d/l] [select]
Re: Scripts to recursively reading in HTML files by vladdrak (Monk) on Sep 01, 2004 at 07:21 UTC
It's probably best to use an .htaccess like solution if you can. If you decide to do a find & replace, or are curious as to how this might be done for something else: Unix perl -pibak -e 's/src\=\"\'/src\=\'\.\.\//' `find /basepath -name *.ht +ml` [download] Windows `for /f %i in ('dir /s/b d:\basepath \| findstr html$') do perl -pibak - +e 's/src\=\"\'/src\=\'\.\.\//' %i` [download] This will perform substitution on all html files underneath the specified base directory, and make a backup file (ext bak) for each edited file. You'll probably want to season to taste, and practice on a backup a couple times.	[reply] [d/l] [select]