Okay, that seems to work HTML::TreeBuilder seems be more forgiving

however $tree->dump gives a lot of information, luckely _as_XML_intended looks more readable again

Now the next part... extracting the right pieces of information with XPath

some pieces will be quite easy, for example the title. Others will be from traversing a <TABLE>:
in the left colum there is a data description, like 'Author', in the right column the name, like 'Wall, L.' (sometimes inside the <a HREF=...>Author Name</a> which makes it a bit more complicated, for I only want the text)

my guess is to look for a text element in a <td> tag etc, that equals "Author" and then do something with the next sibling?


In reply to Re^2: extracting data from HTML by Jurassic Monk
in thread extracting data from HTML by Jurassic Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.