Reverse engineering HTML pages to RDF/XML Schema Using ARC2 or Perl ?

hz1039 has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I want to translate HTML pages Not just the structure! For example for a given HTML page I would like to extract semantics and constraints embeded within the page, to process this task I guess I'll devide the process by analyzing forms apart and tables apart and so forth... as on output its more suitable to have an RDF(S)/XML(S), that describe semanticaly the page.. So ARC2 would be a good choice to start? or Can I process by any trick with help of perl? Thanks in advance!

Comment on Reverse engineering HTML pages to RDF/XML Schema Using ARC2 or Perl ?

Replies are listed 'Best First'.
Re: Reverse engineering HTML pages to RDF/XML Schema Using ARC2 or Perl ? by Corion (Patriarch) on Dec 07, 2010 at 09:08 UTC
There is nothing beyond structure in HTML. Whatever you might think are "semantics" will differ between websites and even between webpages. For example, this site uses lots of tables for layout, where other pages might use tables for presenting data. There are other views of this site that describe the same information as XML, but even in them, the meat of the information is contained within one XML tag (`doctext` I guess). You haven't provided a link and I don't know what ARC2 is, but I guess it is a dictionary of RDF-triples, potentially applicable to this task or maybe not. You will have to parse a lot of HTML pages and extract information from them. I would look at XML::LibXML, as it has convenient XPath support.	[reply] [d/l]
Re: Reverse engineering HTML pages to RDF/XML Schema Using ARC2 or Perl ? by chrestomanci (Priest) on Dec 07, 2010 at 09:09 UTC
I think you need to be more specific on what you are trying to do. Can you post a link to a page you are trying to parse, or paste in a short fragment, along with what you are trying to extract. Having said that, if you are tying to parse HTML, then you probably want to use modules such as HTML::TreeBuilder (from CPAN). There have been two threads on this recently: how to quickly parse 50000 html documents?, Parsing HTML files	[reply]