comment on

I've been trying to develop a system to hit sites and get the HTML from them, then parse/simplify it, removing HTML tables and replacing images with text (some of you've been very helpful with that already).

The procedure (and I know I should use a module, but I need the making-code experience more than I need the using-module experience right now) is as follows:

get their html with LWP::Simple
strip out chunks I know I can't use like SCRIPT
encode the HTML I want to keep, a few basic tags, into a temporary format -- {{{TAG}}}
remove all the other HTML
put the tags I want to keep back
export the HTML to a file

This works pretty well, most of the time, but on some pages it doesn't work properly... I go through all those lines of $html =~ s/// stuff, and when the output stage happens at the end I get the exact same $html that I had at the $html = join('',@html); line -- which I don't understand at all. Even if I step through it going

$html =~ s/something/something else/sgi;
print "your html is currently:\n\n$html";
[download]

at each stage, it appears to work! Then I output it, and the variable appears to have reverted to its original state before all the changes. And it only happens with some pages, not all of them. Its obviously complete stupidity on my part, but I'd really appreciate it if you could take a look:

#!/usr/bin/perl -w
use diagnostics;
use CGI::Carp qw(fatalsToBrowser);
use LWP::Simple;
@pages = qw(
page1 http://www.page1.com/ 
page2 http://www.page2.com/ 
);

@keepers = qw(b blockquote br i li ol p ul);

# proceed through the array of site 2 by 2
# using the name and URL

for($i=0;$i<$pagelength;$i+=2){
    $html = ""; # initialise variable
    $pagename = $pages[$i];
    $pageurl = $pages[$i+1];
    print "accessing $pagename at $pageurl...<BR>\n";
    
    #this is a bit cargo-cult, I got it from someone else's use of LWP
+:Simple
    $doc=get($pageurl);
    @html = $doc;
    $html = join('',@html);
    
    ($pagetitle) = $html =~ /<TITLE>(.*)<\/TITLE>/sgi;
    $html =~ s/<TITLE>(.*)<\/TITLE>//sgi;
    
    #kill any script blocks
    $html =~ s/<script[^>]*>.*?<\/script>//sgi; 
    
    #kill any style blocks
    $html =~ s/<style[^>]*>.*?<\/style>//sgi; 
    
    #replace images with [image]
    $html=~s/<IMG[^>]+?(?:ALT="([^"]*)"[^>]*)?>/"[img".((defined $1)?"
+:\"$1\"":"")."]"/sgei;
    
    #temporarily encode the tags we're keeping into {{{tag}}} instead 
+of <tag>
    for($j=0;$j<$keeperlength;$j++){
    my $tag = $keepers[$j];
        $html =~ s/<($tag[^>]?)>/{{{$1}}}/sgi;
        $html =~ s/<\/($tag[^>]?)>/{{{\/$1}}}/sgi;
    }
    
    #remove any remaining html
    $html =~ s/<[^>]*>//sgi;
    
    # re-encode the temporarily encoded tags
    $html =~ s/\{{3}/</sgi;
    $html =~ s/\}{3}/>/sgi;
    
    #tighten up the code
    $html =~ s/\s+/ /g;
    
    #write out the file
    print "Writing out the new $pagename.html file...<BR>\n";
    open (PAGEOUTPUT, ">/www/db/mysite/mydirectory/$pagename.html") ||
+ die "WTF? $!";
    print PAGEOUTPUT "<HTML><HEAD>\n<TITLE>$pagename</TITLE>\n</HEAD>\
+n<BODY>";
    if(-e "/www/db/mysite/mydirectory/$pagename.gif"){
        print PAGEOUTPUT "<CENTER><IMG SRC=\"$pagename.gif\"></CENTER>
+<BR>";
        }
    print PAGEOUTPUT "<H1>$pagetitle</H1><BR>";
    print PAGEOUTPUT $html;
    print PAGEOUTPUT "</BODY></HTML>";
    close (PAGEOUTPUT);
    print "Finished processing $pagename...<BR><HR><BR>\n";
}
[download]

In reply to Harvesting and Parsing HTML from other sites by hostile17

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.