in reply to Reverse engineering HTML pages to RDF/XML Schema Using ARC2 or Perl ?
There is nothing beyond structure in HTML. Whatever you might think are "semantics" will differ between websites and even between webpages. For example, this site uses lots of tables for layout, where other pages might use tables for presenting data. There are other views of this site that describe the same information as XML, but even in them, the meat of the information is contained within one XML tag (doctext I guess).
You haven't provided a link and I don't know what ARC2 is, but I guess it is a dictionary of RDF-triples, potentially applicable to this task or maybe not.
You will have to parse a lot of HTML pages and extract information from them. I would look at XML::LibXML, as it has convenient XPath support.
|
|---|