I've written things like this before that do something similar with files in a local directory. In those cases, it was pretty simple to set up a stack of directories to investigate separate from the list of files I was building, and just keep processing until the stack was empty. I think your problem is similar.
It might look something like this:
while ($url_stack is not empty) {
$url = pop $url_stack;
open URL $url;
while (<URL>) {
my @words split / /, $_ # split a line into words
# For each word in the line, see if it's a URL. Push
# it to the stack and substitute the local path if it
# is
foreach my $word @words {
if $word =~ m/^http:\/\//;
push $word $url_stack;
$word =~ s/'remote_path'/'local_path'/;
}
# join all the words together into a new line
join @words my $output_line;
# write that line into the local version of the file.
print <LOCAL_VERSION> $output_line;
}
}
As this is intended to be a psuedo-code snippet, I'm obviously leaving a lot out, like opening the output file, &c, but I think the basic premise is sound.
That said, I'm sure there is an easier way to do it. w3mir, for example. You also might look into wget options to make sure you're not missing something in there. Good luck!
Re^2: Need direction on mass find/replacement in HTML files.
by kevin4truth (Initiate) on Apr 29, 2010 at 15:36 UTC
|
Thanks StarX for your help. The process takes place on the server itself, so I don't need to pull the content to a client, therefore wget doesn't fit the bill.
I want to use a regex for substitution on the URLs in the files, but I have the following issue:
I want to globally change the following:
<a href="http://www.mysite.org/?page=contacts"><font color="#269BD5">
into:
<a href="pages/contacts.htm"><font color="#269BD5">
You'll notice that the match would be http://www.mysite.org/?page= but I also need to add a ".htm" to the end of the contacts so it becomes contacts.htm This part of the URL is variable, so how can I use a regex replace to match the above and also add a ".htm" to the end of that variable part?
Here are a few dummy URLs for example so you can see the pattern and the variable too.
<a href="http://www.mysite.org/?page=newsletter"><font color="#269BD5">
change to: <a href="pages/newsletter.htm"><font color="#269BD5">
<a href="http://www.mysite.org/?page=faq">
change to: <a href="pages/faq.htm">
So, again the script needs to replace all the full absolute URL links with nothing and replace the PHP "?page=" with just the variable page name (i.e. contacts) plus the ".htm"
Is there a combination of Perl code and/or regex that can do this? Any help would be greatly appreciated! | [reply] [d/l] [select] |
|
#! /usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;
use URI;
my $t = HTML::TreeBuilder->new_from_file(*DATA)
or die qq{TB->new failed: $!\n};
my @anchors = $t->look_down(_tag => q{a});
for my $anchor (@anchors){
my $href = $anchor->attr(q{href});
my $uri = URI->new($href);
my $host = $uri->host;
next unless $host eq q{www.mysite.org};
my %query_form = $uri->query_form;
next unless exists $query_form{page};
my $replace = sprintf(q{pages/%s.htm}, $query_form{page});
$anchor->attr(q{href}, $replace);
}
print $t->as_HTML(q{}, q{ }, {p => 0});
__DATA__
<html><head><title>mysite</title></head>
<body>
<p><a href="http://www.mysite.org/?page=contacts">text</a></p>
<p><a href="http://www.mysite.org/?page=newsletter">text</a></p>
<p><a href="http://www.mysite.org/?page=faq">text</a></p>
</body>
</html>
<html>
<head>
<title>mysite</title>
</head>
<body>
<p><a href="pages/contacts.htm">text</a></p>
<p><a href="pages/newsletter.htm">text</a></p>
<p><a href="pages/faq.htm">text</a></p>
</body>
</html>
See also HTML::Element.
| [reply] [d/l] [select] |
|
s%http://www.mysite.org/?([^=]+)=(.*?)"%$1s/$2.html%
| [reply] [d/l] |