Need direction on mass find/replacement in HTML files.

kevin4truth has asked for the wisdom of the Perl Monks concerning the following question:

Hello:

I have a Web site that uses PHP to query up pages for display. A typical URL maybe something like: http://www.mysite.org/?page=some_html_page

I want to create a Perl script that copies the Web site content to a temp directory structure; then the script will go through all the URLs inside the many HTML files in that temp dir and change the above URL pattern to be a relative link like "/pages/some_html_page.htm" ; then save the file and move onto the next file in the directory until all the files contain relative URL links.

The goal is to create a localized version (tar.zip) of the Web site so people can download it, unzip and browse locally on their computer without having a local Web server installed.

I used Perl many years ago, so there's lots of rust to break off with me.

Can someone please point me in the right direction as far as possible pre-existing code or something I can build on to get started?

Thanks Much,

Kevin

Comment on Need direction on mass find/replacement in HTML files. Select or Download Code

Replies are listed 'Best First'.
Re: Need direction on mass find/replacement in HTML files. by WizardOfUz (Friar) on Apr 20, 2010 at 17:03 UTC
Maybe you can use Wget for this? Take a look at the `--mirror` and the `–-convert-links` option.	[reply] [d/l] [select]
Re^2: Need direction on mass find/replacement in HTML files. by kevin4truth (Initiate) on Apr 20, 2010 at 18:15 UTC
Thanks, but I already tried that with great pain; it didn't properly parse and localize the PHP queries.	[reply]
Re: Need direction on mass find/replacement in HTML files. by starX (Chaplain) on Apr 20, 2010 at 18:42 UTC
I've written things like this before that do something similar with files in a local directory. In those cases, it was pretty simple to set up a stack of directories to investigate separate from the list of files I was building, and just keep processing until the stack was empty. I think your problem is similar. It might look something like this: while ($url_stack is not empty) { $url = pop $url_stack; open URL $url; while (<URL>) { my @words split / /, $_ # split a line into words # For each word in the line, see if it's a URL. Push # it to the stack and substitute the local path if it # is foreach my $word @words { if $word =~ m/^http:\/\//; push $word $url_stack; $word =~ s/'remote_path'/'local_path'/; } # join all the words together into a new line join @words my $output_line; # write that line into the local version of the file. print <LOCAL_VERSION> $output_line; } } [download] As this is intended to be a psuedo-code snippet, I'm obviously leaving a lot out, like opening the output file, &c, but I think the basic premise is sound. That said, I'm sure there is an easier way to do it. w3mir, for example. You also might look into wget options to make sure you're not missing something in there. Good luck!	[reply] [d/l]
Re^2: Need direction on mass find/replacement in HTML files. by kevin4truth (Initiate) on Apr 29, 2010 at 15:36 UTC
Thanks StarX for your help. The process takes place on the server itself, so I don't need to pull the content to a client, therefore wget doesn't fit the bill. I want to use a regex for substitution on the URLs in the files, but I have the following issue: I want to globally change the following: `<a href="http://www.mysite.org/?page=contacts"><font color="#269BD5">` into: `<a href="pages/contacts.htm"><font color="#269BD5">` You'll notice that the match would be `http://www.mysite.org/?page=` but I also need to add a ".htm" to the end of the contacts so it becomes contacts.htm This part of the URL is variable, so how can I use a regex replace to match the above and also add a ".htm" to the end of that variable part? Here are a few dummy URLs for example so you can see the pattern and the variable too. `<a href="http://www.mysite.org/?page=newsletter"><font color="#269BD5">` change to: `<a href="pages/newsletter.htm"><font color="#269BD5">` `<a href="http://www.mysite.org/?page=faq">` change to: `<a href="pages/faq.htm">` So, again the script needs to replace all the full absolute URL links with nothing and replace the PHP "?page=" with just the variable page name (i.e. contacts) plus the ".htm" Is there a combination of Perl code and/or regex that can do this? Any help would be greatly appreciated!	[reply] [d/l] [select]
Re^3: Need direction on mass find/replacement in HTML files. by wfsp (Abbot) on Apr 29, 2010 at 16:58 UTC
To parse HTML and manipulate a URI it might be worth considering using modules that can do the heavy lifting for you. This presses HTML::TreeBuilder and URI into service. #! /usr/bin/perl use strict; use warnings; use HTML::TreeBuilder; use URI; my $t = HTML::TreeBuilder->new_from_file(*DATA) or die qq{TB->new failed: $!\n}; my @anchors = $t->look_down(_tag => q{a}); for my $anchor (@anchors){ my $href = $anchor->attr(q{href}); my $uri = URI->new($href); my $host = $uri->host; next unless $host eq q{www.mysite.org}; my %query_form = $uri->query_form; next unless exists $query_form{page}; my $replace = sprintf(q{pages/%s.htm}, $query_form{page}); $anchor->attr(q{href}, $replace); } print $t->as_HTML(q{}, q{ }, {p => 0}); __DATA__ <html><head><title>mysite</title></head> <body> <p><a href="http://www.mysite.org/?page=contacts">text</a></p> <p><a href="http://www.mysite.org/?page=newsletter">text</a></p> <p><a href="http://www.mysite.org/?page=faq">text</a></p> </body> </html> [download] `<html> <head> <title>mysite</title> </head> <body> <p><a href="pages/contacts.htm">text</a></p> <p><a href="pages/newsletter.htm">text</a></p> <p><a href="pages/faq.htm">text</a></p> </body> </html>` [download] See also HTML::Element.	[reply] [d/l] [select]
Re^3: Need direction on mass find/replacement in HTML files. by choroba (Cardinal) on Apr 29, 2010 at 16:11 UTC
Something like this? `s%http://www.mysite.org/?([^=]+)=(.*?)"%$1s/$2.html%` [download]	[reply] [d/l]