I've been trying to develop a system to hit sites and get the HTML from them, then parse/simplify it, removing HTML tables and replacing images with text (some of you've been very helpful with that already).
The procedure (and I know I should use a module, but I need the making-code experience more than I need the using-module experience right now) is as follows:
- get their html with LWP::Simple
- strip out chunks I know I can't use like SCRIPT
- encode the HTML I want to keep, a few basic tags, into a temporary format -- {{{TAG}}}
- remove all the other HTML
- put the tags I want to keep back
- export the HTML to a file
This works pretty well, most of the time, but on some pages it doesn't work properly... I go through all those lines of
$html =~ s///
stuff, and when the output stage happens at the end I get the exact same $html that I had at the
$html = join('',@html);
line -- which I don't understand at all.
Even if I step through it going
$html =~ s/something/something else/sgi;
print "your html is currently:\n\n$html";
at each stage, it appears to work! Then I output it, and the variable appears to have reverted to its original state before all the changes. And it only happens with some pages, not all of them.
Its obviously complete stupidity on my part, but I'd really appreciate it if you could take a look:
#!/usr/bin/perl -w
use diagnostics;
use CGI::Carp qw(fatalsToBrowser);
use LWP::Simple;
@pages = qw(
page1 http://www.page1.com/
page2 http://www.page2.com/
);
@keepers = qw(b blockquote br i li ol p ul);
# proceed through the array of site 2 by 2
# using the name and URL
for($i=0;$i<$pagelength;$i+=2){
$html = ""; # initialise variable
$pagename = $pages[$i];
$pageurl = $pages[$i+1];
print "accessing $pagename at $pageurl...<BR>\n";
#this is a bit cargo-cult, I got it from someone else's use of LWP
+:Simple
$doc=get($pageurl);
@html = $doc;
$html = join('',@html);
($pagetitle) = $html =~ /<TITLE>(.*)<\/TITLE>/sgi;
$html =~ s/<TITLE>(.*)<\/TITLE>//sgi;
#kill any script blocks
$html =~ s/<script[^>]*>.*?<\/script>//sgi;
#kill any style blocks
$html =~ s/<style[^>]*>.*?<\/style>//sgi;
#replace images with [image]
$html=~s/<IMG[^>]+?(?:ALT="([^"]*)"[^>]*)?>/"[img".((defined $1)?"
+:\"$1\"":"")."]"/sgei;
#temporarily encode the tags we're keeping into {{{tag}}} instead
+of <tag>
for($j=0;$j<$keeperlength;$j++){
my $tag = $keepers[$j];
$html =~ s/<($tag[^>]?)>/{{{$1}}}/sgi;
$html =~ s/<\/($tag[^>]?)>/{{{\/$1}}}/sgi;
}
#remove any remaining html
$html =~ s/<[^>]*>//sgi;
# re-encode the temporarily encoded tags
$html =~ s/\{{3}/</sgi;
$html =~ s/\}{3}/>/sgi;
#tighten up the code
$html =~ s/\s+/ /g;
#write out the file
print "Writing out the new $pagename.html file...<BR>\n";
open (PAGEOUTPUT, ">/www/db/mysite/mydirectory/$pagename.html") ||
+ die "WTF? $!";
print PAGEOUTPUT "<HTML><HEAD>\n<TITLE>$pagename</TITLE>\n</HEAD>\
+n<BODY>";
if(-e "/www/db/mysite/mydirectory/$pagename.gif"){
print PAGEOUTPUT "<CENTER><IMG SRC=\"$pagename.gif\"></CENTER>
+<BR>";
}
print PAGEOUTPUT "<H1>$pagetitle</H1><BR>";
print PAGEOUTPUT $html;
print PAGEOUTPUT "</BODY></HTML>";
close (PAGEOUTPUT);
print "Finished processing $pagename...<BR><HR><BR>\n";
}