stevieb has asked for the wisdom of the Perl Monks concerning the following question:
Hello wise Monks,
I need to extract a bunch of data out of my Facebook archive, and am looking for advice on which module I should be using to do so. I haven't dealt with HTML in years, and I've never dealt with XML. I just need to extract data within certain "class"es, regardless of the tag.
The header of the file looks like:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w +3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml">
...and here's a snip of the body:
<div class="message reply"> <span class="profile fn">Person Name</span> <abbr class="time published" title="2012-03-14T21:37:16+0000">March 14 +, 2012 at 3:37 pm</span> <div class="msgbody"> Message body here. </div> </div>
I could write something with regex and other trickery to pull the data I need, but I know there's people who have invented that wheel. I've taken a look at a few XML/HTML parsers, but I'm unsure with all the options which one would suit my basic extraction needs.
Can I get some feedback on which modules will help with this, with an easy to use interface (as this is pretty much a one-off)?
Thanks,
-stevieb
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Recommendation on a module for HTML/XML extraction.
by GrandFather (Saint) on Aug 16, 2015 at 10:50 UTC | |
by bitingduck (Deacon) on Aug 16, 2015 at 18:40 UTC | |
|
Re: Recommendation on a module for HTML/XML extraction.
by tangent (Parson) on Aug 16, 2015 at 13:03 UTC | |
|
Re: Recommendation on a module for HTML/XML extraction.
by 1nickt (Canon) on Aug 16, 2015 at 01:00 UTC | |
|
Re: Recommendation on a module for HTML/XML extraction.
by afoken (Chancellor) on Aug 16, 2015 at 16:24 UTC | |
|
Re: Recommendation on a module for HTML/XML extraction.
by Your Mother (Archbishop) on Aug 16, 2015 at 16:13 UTC | |
|
Re: Recommendation on a module for HTML/XML extraction.
by stevieb (Canon) on Aug 16, 2015 at 18:46 UTC | |
|
Re: Recommendation on a module for HTML/XML extraction.
by stevieb (Canon) on Aug 16, 2015 at 15:06 UTC |