Hello wise Monks,
I need to extract a bunch of data out of my Facebook archive, and am looking for advice on which module I should be using to do so. I haven't dealt with HTML in years, and I've never dealt with XML. I just need to extract data within certain "class"es, regardless of the tag.
The header of the file looks like:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w +3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml">
...and here's a snip of the body:
<div class="message reply"> <span class="profile fn">Person Name</span> <abbr class="time published" title="2012-03-14T21:37:16+0000">March 14 +, 2012 at 3:37 pm</span> <div class="msgbody"> Message body here. </div> </div>
I could write something with regex and other trickery to pull the data I need, but I know there's people who have invented that wheel. I've taken a look at a few XML/HTML parsers, but I'm unsure with all the options which one would suit my basic extraction needs.
Can I get some feedback on which modules will help with this, with an easy to use interface (as this is pretty much a one-off)?
Thanks,
-stevieb
In reply to Recommendation on a module for HTML/XML extraction. by stevieb
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |