Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Harvesting and Parsing HTML from other sites

by hostile17 (Novice)
on Mar 28, 2001 at 05:04 UTC ( [id://67700]=perlquestion: print w/replies, xml ) Need Help??

hostile17 has asked for the wisdom of the Perl Monks concerning the following question:

I've been trying to develop a system to hit sites and get the HTML from them, then parse/simplify it, removing HTML tables and replacing images with text (some of you've been very helpful with that already).

The procedure (and I know I should use a module, but I need the making-code experience more than I need the using-module experience right now) is as follows:

  1. get their html with LWP::Simple
  2. strip out chunks I know I can't use like SCRIPT
  3. encode the HTML I want to keep, a few basic tags, into a temporary format -- {{{TAG}}}
  4. remove all the other HTML
  5. put the tags I want to keep back
  6. export the HTML to a file
This works pretty well, most of the time, but on some pages it doesn't work properly... I go through all those lines of $html =~ s/// stuff, and when the output stage happens at the end I get the exact same $html that I had at the $html = join('',@html); line -- which I don't understand at all. Even if I step through it going
$html =~ s/something/something else/sgi; print "your html is currently:\n\n$html";
at each stage, it appears to work! Then I output it, and the variable appears to have reverted to its original state before all the changes. And it only happens with some pages, not all of them. Its obviously complete stupidity on my part, but I'd really appreciate it if you could take a look:
#!/usr/bin/perl -w use diagnostics; use CGI::Carp qw(fatalsToBrowser); use LWP::Simple; @pages = qw( page1 http://www.page1.com/ page2 http://www.page2.com/ ); @keepers = qw(b blockquote br i li ol p ul); # proceed through the array of site 2 by 2 # using the name and URL for($i=0;$i<$pagelength;$i+=2){ $html = ""; # initialise variable $pagename = $pages[$i]; $pageurl = $pages[$i+1]; print "accessing $pagename at $pageurl...<BR>\n"; #this is a bit cargo-cult, I got it from someone else's use of LWP +:Simple $doc=get($pageurl); @html = $doc; $html = join('',@html); ($pagetitle) = $html =~ /<TITLE>(.*)<\/TITLE>/sgi; $html =~ s/<TITLE>(.*)<\/TITLE>//sgi; #kill any script blocks $html =~ s/<script[^>]*>.*?<\/script>//sgi; #kill any style blocks $html =~ s/<style[^>]*>.*?<\/style>//sgi; #replace images with [image] $html=~s/<IMG[^>]+?(?:ALT="([^"]*)"[^>]*)?>/"[img".((defined $1)?" +:\"$1\"":"")."]"/sgei; #temporarily encode the tags we're keeping into {{{tag}}} instead +of <tag> for($j=0;$j<$keeperlength;$j++){ my $tag = $keepers[$j]; $html =~ s/<($tag[^>]?)>/{{{$1}}}/sgi; $html =~ s/<\/($tag[^>]?)>/{{{\/$1}}}/sgi; } #remove any remaining html $html =~ s/<[^>]*>//sgi; # re-encode the temporarily encoded tags $html =~ s/\{{3}/</sgi; $html =~ s/\}{3}/>/sgi; #tighten up the code $html =~ s/\s+/ /g; #write out the file print "Writing out the new $pagename.html file...<BR>\n"; open (PAGEOUTPUT, ">/www/db/mysite/mydirectory/$pagename.html") || + die "WTF? $!"; print PAGEOUTPUT "<HTML><HEAD>\n<TITLE>$pagename</TITLE>\n</HEAD>\ +n<BODY>"; if(-e "/www/db/mysite/mydirectory/$pagename.gif"){ print PAGEOUTPUT "<CENTER><IMG SRC=\"$pagename.gif\"></CENTER> +<BR>"; } print PAGEOUTPUT "<H1>$pagetitle</H1><BR>"; print PAGEOUTPUT $html; print PAGEOUTPUT "</BODY></HTML>"; close (PAGEOUTPUT); print "Finished processing $pagename...<BR><HR><BR>\n"; }

Replies are listed 'Best First'.
Re: Harvesting and Parsing HTML from other sites
by marius (Hermit) on Mar 28, 2001 at 09:31 UTC
    First, change your @pages array to a hash. Then you can step through this with a:
    foreach $page (keys %pages) { }
    rather than the cumbersome and obfuscated for(){} loop above.

    Second, a lot of your regexes don't need the /s modifier. See perldoc perlre for info about that.

    Third, use strict.

    And now for code error issues: I don't see where you set $keeperlength before using it in your nested for(){} loop. Incidentally, your changing of <tag> to {{{tag}}} doesn't account for things like <br />. That's a minor nitpick, though. Other than that, I can't see why it would "revert" back to the original $html variable. Wanna fix these things I've pointed out (or point out my flaws in thinking as the case may be =]) and try it, and if it still doesn't work point us to some pages that do and pages that don't work and we'll continue hammering.

    Good luck!

    -marius
Re: Harvesting and Parsing HTML from other sites
by davorg (Chancellor) on Mar 28, 2001 at 13:41 UTC

    Parsing HTML using regular expressions is generally a very bad idea. You will always come across stuff that breaks your regular expressions eventually.

    You are far better off using a real HTML parser. There is an HTML::Parser module on the CPAN and you'd be better off using that or one of its subclasses. It sound to me as if HTML::TreeBuilder might be just want you need in this instance.

    --
    <http://www.dave.org.uk>

    "Perl makes the fun jobs fun
    and the boring jobs bearable" - me

      OK two things, thanks very much for your input, it's really appreciated.

      marius I'm not sure that any of my regexes can get by without the /s modifier though, why did you say that?It's almost my one concession to the absolute vagueness of HTML practice that I use it. You're right about keeperlength, I failed to cut and paste that from the actual code, and yes, you're right about a hash being better. Just habit on my part, that "by twos" thing. Thanks.

      davorg, yes, of course you're right. I think there's a psychological reason why people like me want to do it the "hard" way rather than using a module, but I've learnt something from this. I want to put on record my complete stupidity though, which will chime nicely with the "use a module, dummy" refrain.

      My variable $html didn't revert at all, what happened in that I had an html file in which someone had foolishly put two titles, really far apart, so that when I did the

      ($pagetitle) = $html =~ /<TITLE>(.*)<\/TITLE>/sgi;

      thing, it actually pulled out nearly the whole document!

      The final output, as you can see, consisted of the title, then the document, but the title, due to the somewhat random HTML, was the document.

      All I can do is apologise for wasting your time and try to get more sleep and be more sensible in future. And use

      ($pagetitle) = $html =~ /<TITLE>(.*?)<\/TITLE>/sgi;

      instead... Your humble servant h17

        hostile17,
        I mentioned the /s modifier due to this, from the perlre page:

        s Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which it normally would not match.
        But in re-reading that and your code, I found I was mistaken and you do need it incase your tags span multiple lines.. Doh! Ahh well, glad you caught the problem otherwise =]

        -marius

        Edit: chipmunk 2001-03-30

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://67700]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2024-04-19 21:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found