Hi. I have 300 HTML pages in various states of HTML compliance. I'm basically trying to strip out all the header and footer junk and get all the middle of the document, even with any crappy HTML it might have.

Documents look something like this:

<html>
--stuff--
<head>
--more stuff--
</head>
<body>
--still more stuff--
<div class="myBody">
--all the stuff I want, which might include div tags, too--
</div>
--yet more stuff--
</body>
</html>

I've tried a few things. I know that XML::XPath and XML::XPath::XMLParser get me to the right place. I have an XPath expression that seems to work most of the time. The problem is that I want all the tags and everything--just as it currently is in the file. When I use methods like findvalue() or string_value(), I get just the text without the tags.

I tried HTML::TokeParser::Simple, but I wasn't sure how to do this. I'm hoping I don't have to write some loop that iterates over all the tags and text and prints them out bit by bit. I just want to say "keep everything from this point in the tree on down...".

Ideally, I want to do this without first fixing crappy, non-compliant HTML. I have lots of <p> tags that are used to separate paragraphs (instead of <p>foo</p>). I also have lots of <meta ... > tags instead of <meta... />. These unclosed tags tend to give XML parsers heartburn. I'll preprocess with tidy to make things tidy if I have to.

Update

I got a good enough result by using XML::XPath, XML::XPath::NodeSet, and XML::Parser. The trick seemed to be disentangling XML::Parser and XML::XPath. That is, I needed my own parser object which I used with XML::XPath. The entire script is 200 lines because of the vagaries of my specific input. But here's what I think is the salient bit that worked:

$m::xpath = '/html/body/table/tr/td/div';
my $parser = XML::Parser->new(
  'NoLWP' => 1,
  'NoExpand' =>1,
  'Namespaces' => 0);
my $XP = XML::XPath->new( filename => $inputfile, parser => $parser );
my $body = $XP->findnodes_as_string($m::xpath);

I ended up cheating because I discovered that the XPath expression above gets me the right div. There was a bit more uniformity on the pages (at least the pages I cared about) than I realised.

Thanks to all the suggestions


In reply to Extract Portion of HTML by pacohope

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.