Re: The State of Web spidering in Perl

Replies are listed 'Best First'.
Re^2: The State of Web spidering in Perl by digital_carver (Sexton) on Sep 22, 2013 at 16:49 UTC
I'll give HTML::Parser a second look, thanks for the suggestion. How do you match something like `//div[@id='blah']/p` though, do you explicitly maintain state? As for LWP vs Mech, LWP does work for my use case, I just prefer Mech for a few niceties like `autocheck`, auto-delegation of `$mech->content()` to `$response->decoded_content()`, `cookie_jar` defaulting to on, etc.	[reply] [d/l] [select]
Re^3: The State of Web spidering in Perl by Anonymous Monk on Sep 23, 2013 at 00:03 UTC
I'll give HTML::Parser a second look, thanks for the suggestion. How do you match something like //div@id='blah'/p though, do you explicitly maintain state? You don't -- you might use HTML::Parser if you want to reinvent HTML::Tree. Its like XML::Parser, you might use it if you want to reinvent XML::Twig, but since both Tree/Twig exist and do a fantastic job already , don't waste your time reinventing them :) And now my linkdump of examples docs tutorials ... because xml::parser is low level, you should parse html/xml with xpath/twig/dom, Re: How to grab a portion of file with regex (don't), HTML Parser suggestions See also the real discouragement Oh Yes You Can Use Regexes to Parse HTML! and the real encouragement Re^2: parsing XML fragments (xml log files) with... a regex How do I match XML, HTML, or other nasty, ugly things with a regex? How do I remove HTML from a string? Re: Parsing webpages See htmltreexpather.pl , Parsing HTML / Re^4: Parsing HTML, A regex question , NASA's Astronomy Picture of the Day / Re: NASA's Astronomy Picture of the Day , Re: Extracting HTML content between the h tags, Re^2: Help With Online Table Scraper, Re^4: web::scraper using an xpath, .... HTML Parser suggestions See also htmltreexpather.pl and xpather.pl htmltreexpather.pl , Parsing HTML / Re^4: Parsing HTML, A regex question , NASA's Astronomy Picture of the Day / Re: NASA's Astronomy Picture of the Day , Re: Extracting HTML content between the h tags, Re^2: Help With Online Table Scraper, Re^4: web::scraper using an xpath, .... HTML Parser suggestions xpather.pl Re: Get Node Value from irregular XML (xpather.pl) Re: Having trouble with siblings Re^2: XML parsing and Lists Re: Counting number of child nodes based on element value (typos) Re^3: Extracting specific childnodes (xpath whitespace) Re^3: Extracting specific childnodes (play xmllint --shell ) Re: How do i get value of an element if the next elememnt has specific value in XML::LibXML using Xpath? Re: How do i get value of an element if the next elememnt has specific value in XML::LibXML using Xpath? Re: How to parse xml with namespase vale in XMl:LibXML? ( XPath error : Undefined namespace prefix ) Re^2: How to parse xml with namespase vale in XMl:LibXML? (xmllint --shell setns / xpathtester) There is a better way :)	[reply]