in reply to Re: Need direction on mass find/replacement in HTML files.
in thread Need direction on mass find/replacement in HTML files.

Thanks StarX for your help. The process takes place on the server itself, so I don't need to pull the content to a client, therefore wget doesn't fit the bill.

I want to use a regex for substitution on the URLs in the files, but I have the following issue:

I want to globally change the following: <a href="http://www.mysite.org/?page=contacts"><font color="#269BD5">

into: <a href="pages/contacts.htm"><font color="#269BD5">

You'll notice that the match would be http://www.mysite.org/?page= but I also need to add a ".htm" to the end of the contacts so it becomes contacts.htm This part of the URL is variable, so how can I use a regex replace to match the above and also add a ".htm" to the end of that variable part?

Here are a few dummy URLs for example so you can see the pattern and the variable too.

<a href="http://www.mysite.org/?page=newsletter"><font color="#269BD5">

change to: <a href="pages/newsletter.htm"><font color="#269BD5">

<a href="http://www.mysite.org/?page=faq">

change to: <a href="pages/faq.htm">

So, again the script needs to replace all the full absolute URL links with nothing and replace the PHP "?page=" with just the variable page name (i.e. contacts) plus the ".htm"

Is there a combination of Perl code and/or regex that can do this? Any help would be greatly appreciated!

Replies are listed 'Best First'.
Re^3: Need direction on mass find/replacement in HTML files.
by wfsp (Abbot) on Apr 29, 2010 at 16:58 UTC
    To parse HTML and manipulate a URI it might be worth considering using modules that can do the heavy lifting for you.

    This presses HTML::TreeBuilder and URI into service.

    #! /usr/bin/perl use strict; use warnings; use HTML::TreeBuilder; use URI; my $t = HTML::TreeBuilder->new_from_file(*DATA) or die qq{TB->new failed: $!\n}; my @anchors = $t->look_down(_tag => q{a}); for my $anchor (@anchors){ my $href = $anchor->attr(q{href}); my $uri = URI->new($href); my $host = $uri->host; next unless $host eq q{www.mysite.org}; my %query_form = $uri->query_form; next unless exists $query_form{page}; my $replace = sprintf(q{pages/%s.htm}, $query_form{page}); $anchor->attr(q{href}, $replace); } print $t->as_HTML(q{}, q{ }, {p => 0}); __DATA__ <html><head><title>mysite</title></head> <body> <p><a href="http://www.mysite.org/?page=contacts">text</a></p> <p><a href="http://www.mysite.org/?page=newsletter">text</a></p> <p><a href="http://www.mysite.org/?page=faq">text</a></p> </body> </html>
    <html> <head> <title>mysite</title> </head> <body> <p><a href="pages/contacts.htm">text</a></p> <p><a href="pages/newsletter.htm">text</a></p> <p><a href="pages/faq.htm">text</a></p> </body> </html>
    See also HTML::Element.
Re^3: Need direction on mass find/replacement in HTML files.
by choroba (Cardinal) on Apr 29, 2010 at 16:11 UTC
    Something like this?
    s%http://www.mysite.org/?([^=]+)=(.*?)"%$1s/$2.html%