There is nothing beyond structure in HTML. Whatever you might think are "semantics" will differ between websites and even between webpages. For example, this site uses lots of tables for layout, where other pages might use tables for presenting data. There are other views of this site that describe the same information as XML, but even in them, the meat of the information is contained within one XML tag (doctext I guess).

You haven't provided a link and I don't know what ARC2 is, but I guess it is a dictionary of RDF-triples, potentially applicable to this task or maybe not.

You will have to parse a lot of HTML pages and extract information from them. I would look at XML::LibXML, as it has convenient XPath support.


In reply to Re: Reverse engineering HTML pages to RDF/XML Schema Using ARC2 or Perl ? by Corion
in thread Reverse engineering HTML pages to RDF/XML Schema Using ARC2 or Perl ? by hz1039

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.