Untemplating

tlhf has asked for the wisdom of the Perl Monks concerning the following question:

When searching the sites listed by google, it seemed that the world and his dog wanted to explain templating to me - how to get data from a dataset into some sort of styled html. Well, I need to do the opposite; use a template, some html pages, and some magic to get a nice, clean dataset.

Ok. So at first I was gonna attack the problem with some regexps. But, considering the number of seperate sets of data to untemplate, this is an extremely unattractive prospect.

I have a number of html pages for each day. Each day has one or more contributions. Also, the contributions have an option title.

Eg, a contribution like this may appear a few times in a page:

<tr><td><b>A Title</b> - <b>12/3/2002 23:11</b>
<br>
Some Contribution
<p>
</td>
</tr>
[download]

Unfortunately, the page HTML isn't always hunky-dory, it seems to side on the non-standard kind, which I think sidelines most of the HTML modules. Luckly though, all the contributions are all written in the same manner.

Can anyone help? Is there a module already written for something like this? If not, where would I start? Just quotemeta() the template and do some sort of match? But there's more than one match per page. I'm finding myself out of my league here...

tlhf
xxx

Comment on Untemplating Download Code

Replies are listed 'Best First'.
Re: Untemplating by Chmrr (Vicar) on Jul 16, 2002 at 03:39 UTC
By far the most common solution to this is to use one of the HTML modules. Yes, you say that the html is "non-standard" -- but, truth to be told, most HTML out there is, and the HTML-parsing modules know that, and are perfectly able to cope. If they were only able to deal with perfectly syntactic HTML, they'd be called XML-parsing, not HTML-parsing. :) My personal favorite tool for extracting data from web pages is HTML::TreeBuilder -- in your case, it would be a simple matter of asking for all <td> elements, and grabbing the various answers out of them. You may find the dump method particularly useful in examining what the parser makes of your HTML. perl -pe '"I lo`+$^X$\"$]!$/"=~m%(.)%s;$_=$1;y^`+*^e v^#$&V"+@( NO CARRIER'	[reply]
Re: Untemplating by grantm (Parson) on Jul 16, 2002 at 08:25 UTC
Another name for this type of activity is 'screen scraping'. One approach that matts advocates for screen scraping HTML is to use XPath. The first issue you'll need to address is that your HTML is probably not well-formed XML. Two approaches that spring to mind are: pipe the HTML through HTML Tidy to convert it to XHTML process it using XML::LibXML which can read HTML directly Then you can 'zero in' on a part of the page using an XPath expression like this: `/html/body/table/tr/td[./b]` which would match all td 'nodes' which contain a 'b' tag and occur in a 'tr' in a 'table' in the 'body' of the 'html' document. Once you have selected nodes in this way, you can use XPath to dissect them further, or dump them back out to an XML string (including all child nodes) and do regex matches against that. See also, the XPath tutorial at zvon.org	[reply] [d/l]


Perl-Sensitive Sunglasses
	PerlMonks