stevieb has asked for the wisdom of the Perl Monks concerning the following question:

Hello wise Monks,

I need to extract a bunch of data out of my Facebook archive, and am looking for advice on which module I should be using to do so. I haven't dealt with HTML in years, and I've never dealt with XML. I just need to extract data within certain "class"es, regardless of the tag.

The header of the file looks like:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w +3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml">

...and here's a snip of the body:

<div class="message reply"> <span class="profile fn">Person Name</span> <abbr class="time published" title="2012-03-14T21:37:16+0000">March 14 +, 2012 at 3:37 pm</span> <div class="msgbody"> Message body here. </div> </div>

I could write something with regex and other trickery to pull the data I need, but I know there's people who have invented that wheel. I've taken a look at a few XML/HTML parsers, but I'm unsure with all the options which one would suit my basic extraction needs.

Can I get some feedback on which modules will help with this, with an easy to use interface (as this is pretty much a one-off)?

Thanks,

-stevieb

Replies are listed 'Best First'.
Re: Recommendation on a module for HTML/XML extraction.
by GrandFather (Saint) on Aug 16, 2015 at 10:50 UTC

    HTML::TreeBuilder. I've not used it for a few years, but it did sterling work last time I did use it.

    Premature optimization is the root of all job security
      I've been using Treebuilder for years to read music listings from the local weekly for me. Every few years they change their format and it usually only takes a few minutes to modify the code to get the data I want again. I used HTML::Tokeparser before that, which was a slightly easier module to get started with but not as easy to maintain as the source data format changed.
Re: Recommendation on a module for HTML/XML extraction.
by tangent (Parson) on Aug 16, 2015 at 13:03 UTC
    I just need to extract data within certain "class"es, regardless of the tag.
    Have a look at HTML::TreeBuilder::XPath - once you get to know Xpath you'll never look back. This should work for your sample data (slightly modified):
    use HTML::TreeBuilder::XPath; my $html = q|<div class="message reply"> <span class="profile fn">Person Name</span> <span class="time published" title="2012-03-14T21:37:16+0000">March 14 +, 2012 at 3:37 pm</span> <abbr class="time published" title="2013-03-17T21:37:16+0000">March 17 +, 2013 at 3:37 pm</abbr> <div class="msgbody">Message body here.</div> </div>|; my $tree = HTML::TreeBuilder::XPath->new_from_content($html); my @nodes = $tree->findnodes('//*[@class="time published"]'); for my $node ( @nodes ) { print $node->attr('title'), "\n"; print $node->as_text, "\n"; }
    Output:
    2012-03-14T21:37:16+0000 March 14, 2012 at 3:37 pm 2013-03-17T21:37:16+0000 March 17, 2013 at 3:37 pm
Re: Recommendation on a module for HTML/XML extraction.
by 1nickt (Canon) on Aug 16, 2015 at 01:00 UTC

    Decent overview here ...

    The way forward always starts with a minimal test.
Re: Recommendation on a module for HTML/XML extraction.
by afoken (Chancellor) on Aug 16, 2015 at 16:24 UTC

    Despite the name, XML::LibXML can also handle HTML. tangent++ proposed to use XPath, that's also supported by XML::LibXML.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: Recommendation on a module for HTML/XML extraction.
by Your Mother (Archbishop) on Aug 16, 2015 at 16:13 UTC

    Before you go too far down the road you're on, consider researching the Graph API a bit to see if it can do what you want. If it can, it will be much easier and more accurate. I think. Here is a starting point: Search messages FQL on SO (FQL is deprecated but will apparently still work as of now) and Facebook+Graph on metacpan. This is interesting to me so I might play with the problem but I'm also pretty booked up so if I don't come back with some snippets tonight, :P

    Update: this path is probably not any easier since it looks like you have to wrap anything you want to do in "an app." Not impossible or super hard but a pain if you already have the HTML.

Re: Recommendation on a module for HTML/XML extraction.
by stevieb (Canon) on Aug 16, 2015 at 18:46 UTC

    Just to start getting familiar with HTML::TreeBuilder::XPath, I took some time and came up with the following which seems to work on at least a single chunk of the file. I'll keep playing later, but it looks promising.

    #!/usr/bin/perl use warnings; use strict; use Data::Dumper; use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_file('txt.html'); my @nodes = $tree->findnodes('//*[@class="message reply"]'); for (@nodes){ my $person = $_->findvalue('span[@class="profile fn"]'); my $time = $_->findvalue('abbr[@class="time published"]/@title'); my $msg = $_->findvalue('abbr[@class="time published"]/div[@class="msgbo +dy"]'); print "$person :: $time :: $msg\n"; }

    I truly appreciate all the feedback. Once I get something usable, I will likely look deeper into the suggestions by Your Mother.

Re: Recommendation on a module for HTML/XML extraction.
by stevieb (Canon) on Aug 16, 2015 at 15:06 UTC

    Thanks for the great feedback all, I'll review them this morning.

    The purpose of this task is to extract the messages between my girlfriend and I for immigration purposes... I need to show how long we were in communication to prove we have a real relationship prior to her visiting me in Canada for the first time (she's from TX). This, along with phone bills, both of our passport stamps/travel docs/itineraries of our travel back and forth will suit the requirement I believe.

    I don't know if this'll be useful to me or anyone else in the future, but I've decided to take a bit of extra time to build it into a module in case there is a need in the future for it.

    -stevieb