I've been trying to develop a system to hit sites and get the HTML from them, then parse/simplify it, removing HTML tables and replacing images with text (some of you've been very helpful with that already).

The procedure (and I know I should use a module, but I need the making-code experience more than I need the using-module experience right now) is as follows:

  1. get their html with LWP::Simple
  2. strip out chunks I know I can't use like SCRIPT
  3. encode the HTML I want to keep, a few basic tags, into a temporary format -- {{{TAG}}}
  4. remove all the other HTML
  5. put the tags I want to keep back
  6. export the HTML to a file
This works pretty well, most of the time, but on some pages it doesn't work properly... I go through all those lines of $html =~ s/// stuff, and when the output stage happens at the end I get the exact same $html that I had at the $html = join('',@html); line -- which I don't understand at all. Even if I step through it going
$html =~ s/something/something else/sgi; print "your html is currently:\n\n$html";
at each stage, it appears to work! Then I output it, and the variable appears to have reverted to its original state before all the changes. And it only happens with some pages, not all of them. Its obviously complete stupidity on my part, but I'd really appreciate it if you could take a look:
#!/usr/bin/perl -w use diagnostics; use CGI::Carp qw(fatalsToBrowser); use LWP::Simple; @pages = qw( page1 http://www.page1.com/ page2 http://www.page2.com/ ); @keepers = qw(b blockquote br i li ol p ul); # proceed through the array of site 2 by 2 # using the name and URL for($i=0;$i<$pagelength;$i+=2){ $html = ""; # initialise variable $pagename = $pages[$i]; $pageurl = $pages[$i+1]; print "accessing $pagename at $pageurl...<BR>\n"; #this is a bit cargo-cult, I got it from someone else's use of LWP +:Simple $doc=get($pageurl); @html = $doc; $html = join('',@html); ($pagetitle) = $html =~ /<TITLE>(.*)<\/TITLE>/sgi; $html =~ s/<TITLE>(.*)<\/TITLE>//sgi; #kill any script blocks $html =~ s/<script[^>]*>.*?<\/script>//sgi; #kill any style blocks $html =~ s/<style[^>]*>.*?<\/style>//sgi; #replace images with [image] $html=~s/<IMG[^>]+?(?:ALT="([^"]*)"[^>]*)?>/"[img".((defined $1)?" +:\"$1\"":"")."]"/sgei; #temporarily encode the tags we're keeping into {{{tag}}} instead +of <tag> for($j=0;$j<$keeperlength;$j++){ my $tag = $keepers[$j]; $html =~ s/<($tag[^>]?)>/{{{$1}}}/sgi; $html =~ s/<\/($tag[^>]?)>/{{{\/$1}}}/sgi; } #remove any remaining html $html =~ s/<[^>]*>//sgi; # re-encode the temporarily encoded tags $html =~ s/\{{3}/</sgi; $html =~ s/\}{3}/>/sgi; #tighten up the code $html =~ s/\s+/ /g; #write out the file print "Writing out the new $pagename.html file...<BR>\n"; open (PAGEOUTPUT, ">/www/db/mysite/mydirectory/$pagename.html") || + die "WTF? $!"; print PAGEOUTPUT "<HTML><HEAD>\n<TITLE>$pagename</TITLE>\n</HEAD>\ +n<BODY>"; if(-e "/www/db/mysite/mydirectory/$pagename.gif"){ print PAGEOUTPUT "<CENTER><IMG SRC=\"$pagename.gif\"></CENTER> +<BR>"; } print PAGEOUTPUT "<H1>$pagetitle</H1><BR>"; print PAGEOUTPUT $html; print PAGEOUTPUT "</BODY></HTML>"; close (PAGEOUTPUT); print "Finished processing $pagename...<BR><HR><BR>\n"; }

In reply to Harvesting and Parsing HTML from other sites by hostile17

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.