Recommendation on a module for HTML/XML extraction.

stevieb has asked for the wisdom of the Perl Monks concerning the following question:

Hello wise Monks,

I need to extract a bunch of data out of my Facebook archive, and am looking for advice on which module I should be using to do so. I haven't dealt with HTML in years, and I've never dealt with XML. I just need to extract data within certain "class"es, regardless of the tag.

The header of the file looks like:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w
+3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
[download]

...and here's a snip of the body:

<div class="message reply">

<span class="profile fn">Person Name</span>

<abbr class="time published" title="2012-03-14T21:37:16+0000">March 14
+, 2012 at 3:37 pm</span>
<div class="msgbody">

Message body here.

</div>
</div>
[download]

I could write something with regex and other trickery to pull the data I need, but I know there's people who have invented that wheel. I've taken a look at a few XML/HTML parsers, but I'm unsure with all the options which one would suit my basic extraction needs.

Can I get some feedback on which modules will help with this, with an easy to use interface (as this is pretty much a one-off)?

Thanks,

-stevieb

Comment on Recommendation on a module for HTML/XML extraction. Select or Download Code

Replies are listed 'Best First'.
Re: Recommendation on a module for HTML/XML extraction. by GrandFather (Saint) on Aug 16, 2015 at 10:50 UTC
HTML::TreeBuilder. I've not used it for a few years, but it did sterling work last time I did use it. Premature optimization is the root of all job security	[reply]
Re^2: Recommendation on a module for HTML/XML extraction. by bitingduck (Deacon) on Aug 16, 2015 at 18:40 UTC
I've been using Treebuilder for years to read music listings from the local weekly for me. Every few years they change their format and it usually only takes a few minutes to modify the code to get the data I want again. I used HTML::Tokeparser before that, which was a slightly easier module to get started with but not as easy to maintain as the source data format changed.	[reply]
Re: Recommendation on a module for HTML/XML extraction. by tangent (Parson) on Aug 16, 2015 at 13:03 UTC
I just need to extract data within certain "class"es, regardless of the tag. Have a look at HTML::TreeBuilder::XPath - once you get to know Xpath you'll never look back. This should work for your sample data (slightly modified): use HTML::TreeBuilder::XPath; my $html = q\|<div class="message reply"> <span class="profile fn">Person Name</span> <span class="time published" title="2012-03-14T21:37:16+0000">March 14 +, 2012 at 3:37 pm</span> <abbr class="time published" title="2013-03-17T21:37:16+0000">March 17 +, 2013 at 3:37 pm</abbr> <div class="msgbody">Message body here.</div> </div>\|; my $tree = HTML::TreeBuilder::XPath->new_from_content($html); my @nodes = $tree->findnodes('//*[@class="time published"]'); for my $node ( @nodes ) { print $node->attr('title'), "\n"; print $node->as_text, "\n"; } [download] Output: `2012-03-14T21:37:16+0000 March 14, 2012 at 3:37 pm 2013-03-17T21:37:16+0000 March 17, 2013 at 3:37 pm` [download]	[reply] [d/l] [select]
Re: Recommendation on a module for HTML/XML extraction. by 1nickt (Canon) on Aug 16, 2015 at 01:00 UTC
Decent overview here ... The way forward always starts with a minimal test.	[reply] [d/l]
Re: Recommendation on a module for HTML/XML extraction. by afoken (Chancellor) on Aug 16, 2015 at 16:24 UTC
Despite the name, XML::LibXML can also handle HTML. tangent++ proposed to use XPath, that's also supported by XML::LibXML. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re: Recommendation on a module for HTML/XML extraction. by Your Mother (Archbishop) on Aug 16, 2015 at 16:13 UTC
Before you go too far down the road you're on, consider researching the Graph API a bit to see if it can do what you want. If it can, it will be much easier and more accurate. I think. Here is a starting point: Search messages FQL on SO (FQL is deprecated but will apparently still work as of now) and Facebook+Graph on metacpan. This is interesting to me so I might play with the problem but I'm also pretty booked up so if I don't come back with some snippets tonight, :P Update: this path is probably not any easier since it looks like you have to wrap anything you want to do in "an app." Not impossible or super hard but a pain if you already have the HTML.	[reply]
Re: Recommendation on a module for HTML/XML extraction. by stevieb (Canon) on Aug 16, 2015 at 18:46 UTC
Just to start getting familiar with HTML::TreeBuilder::XPath, I took some time and came up with the following which seems to work on at least a single chunk of the file. I'll keep playing later, but it looks promising. `#!/usr/bin/perl use warnings; use strict; use Data::Dumper; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_file('txt.html'); my @nodes = $tree->findnodes('//*[@class="message reply"]'); for (@nodes){ my $person = $_->findvalue('span[@class="profile fn"]'); my $time = $_->findvalue('abbr[@class="time published"]/@title'); my $msg = $_->findvalue('abbr[@class="time published"]/div[@class="msgbo +dy"]'); print "$person :: $time :: $msg\n"; }` [download] I truly appreciate all the feedback. Once I get something usable, I will likely look deeper into the suggestions by Your Mother.	[reply] [d/l]
Re: Recommendation on a module for HTML/XML extraction. by stevieb (Canon) on Aug 16, 2015 at 15:06 UTC
Thanks for the great feedback all, I'll review them this morning. The purpose of this task is to extract the messages between my girlfriend and I for immigration purposes... I need to show how long we were in communication to prove we have a real relationship prior to her visiting me in Canada for the first time (she's from TX). This, along with phone bills, both of our passport stamps/travel docs/itineraries of our travel back and forth will suit the requirement I believe. I don't know if this'll be useful to me or anyone else in the future, but I've decided to take a bit of extra time to build it into a module in case there is a need in the future for it. -stevieb	[reply]