Check CPAN. Look for HTML::Parser and HTML::TokeParser and all of the modules that are built from them. There are specialized modules for reading info from tables, for extracting URLs, etc.
I know you asked for a "simple way", but you'll have to learn something at some point :-)
| [reply] |
Just want to add that if the data is presented in html tables, you may find HTML::TableExtract very useful. | [reply] |
| [reply] |
This is just kinda funny ... my current project (which is starting to take too long) is to grab a bunch of data from an email (Lotus Notes) and try to figure out from there a set of rules by which I can replicate a table. So I copied the email into a notes database on a domino server, connected to it via a browser, saved the HTML, and then started using HTML::Parser to extract the data (I wish there was an easier way, but Notes is on a Windows box, and, while there is perl on that box, I prefer my Linux environment, and I didn't want to start learning OLE for this ;-}). Then I remembered something I had seen months ago on PM, and then installed HTML::TableExtract. What a difference that module made.
Since I just need stuff from a couple of tables, this made it quite trivial. Now I'm massaging it this way and that, and have found a number of inconsistencies in the table because of it.
In the past, I've done something very similar - data in webpages (again with the Domino servers), used HTML::Parser to pull out the data, and loaded it all into a DB2 database. Had I known about HTML::TableExtract at that time, I would have probably saved about 4 hours of work. And it would have been much less fragile.
| [reply] |
I can second the use of HTML::TokeParser. I wrote a newspaper crawler to pull stories off of newspaper websites a long time ago using regexes - not the way to go. I rewrote the whole thing using HTML::TokeParser and it made an extensive difference.
In it's first incarnation I had it down to 9 rules for parsing 24 papers. Using HTML::TokeParser I was able to parse about 92 sites using 3 rules. Definitely the way to go. :)
Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.
| [reply] |
Another suggestion would XML::LibXML. It has a mode for parsing HTML; once you have the document parsed, you can go at it with XPath and do all the other things that can be done with XML. It won’t cope well if your HTML input is more than moderately broken tagsoup, however; personally, I’d use HTML::Tidy in that case so I can stick with XML::LibXML anyway, but you may prefer HTML::Parser or one of its derived modules in that case. In that case, HTML::TokeParser::Simple is probably your best bet.
Makeshifts last the longest.
| [reply] |