in reply to Harvesting and Parsing HTML from other sites

Parsing HTML using regular expressions is generally a very bad idea. You will always come across stuff that breaks your regular expressions eventually.

You are far better off using a real HTML parser. There is an HTML::Parser module on the CPAN and you'd be better off using that or one of its subclasses. It sound to me as if HTML::TreeBuilder might be just want you need in this instance.

--
<http://www.dave.org.uk>

"Perl makes the fun jobs fun
and the boring jobs bearable" - me

  • Comment on Re: Harvesting and Parsing HTML from other sites

Replies are listed 'Best First'.
Re: Re: Harvesting and Parsing HTML from other sites
by hostile17 (Novice) on Mar 28, 2001 at 14:15 UTC
    OK two things, thanks very much for your input, it's really appreciated.

    marius I'm not sure that any of my regexes can get by without the /s modifier though, why did you say that?It's almost my one concession to the absolute vagueness of HTML practice that I use it. You're right about keeperlength, I failed to cut and paste that from the actual code, and yes, you're right about a hash being better. Just habit on my part, that "by twos" thing. Thanks.

    davorg, yes, of course you're right. I think there's a psychological reason why people like me want to do it the "hard" way rather than using a module, but I've learnt something from this. I want to put on record my complete stupidity though, which will chime nicely with the "use a module, dummy" refrain.

    My variable $html didn't revert at all, what happened in that I had an html file in which someone had foolishly put two titles, really far apart, so that when I did the

    ($pagetitle) = $html =~ /<TITLE>(.*)<\/TITLE>/sgi;

    thing, it actually pulled out nearly the whole document!

    The final output, as you can see, consisted of the title, then the document, but the title, due to the somewhat random HTML, was the document.

    All I can do is apologise for wasting your time and try to get more sleep and be more sensible in future. And use

    ($pagetitle) = $html =~ /<TITLE>(.*?)<\/TITLE>/sgi;

    instead... Your humble servant h17

      hostile17,
      I mentioned the /s modifier due to this, from the perlre page:

      s Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which it normally would not match.
      But in re-reading that and your code, I found I was mistaken and you do need it incase your tags span multiple lines.. Doh! Ahh well, glad you caught the problem otherwise =]

      -marius

      Edit: chipmunk 2001-03-30