Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re^3: The State of Web spidering in Perl

by Anonymous Monk
on Sep 23, 2013 at 00:03 UTC ( [id://1055207]=note: print w/replies, xml ) Need Help??


in reply to Re^2: The State of Web spidering in Perl
in thread The State of Web spidering in Perl

I'll give HTML::Parser a second look, thanks for the suggestion. How do you match something like //div@id='blah'/p though, do you explicitly maintain state?

You don't -- you might use HTML::Parser if you want to reinvent HTML::Tree. Its like XML::Parser, you might use it if you want to reinvent XML::Twig, but since both Tree/Twig exist and do a fantastic job already , don't waste your time reinventing them :)

And now my linkdump of examples docs tutorials ... because xml::parser is low level, you should parse html/xml with xpath/twig/dom, Re: How to grab a portion of file with regex (don't),
HTML Parser suggestions
See also the real discouragement Oh Yes You Can Use Regexes to Parse HTML! and the real encouragement Re^2: parsing XML fragments (xml log files) with... a regex
How do I match XML, HTML, or other nasty, ugly things with a regex?
How do I remove HTML from a string?
Re: Parsing webpages

See htmltreexpather.pl , Parsing HTML / Re^4: Parsing HTML, A regex question , NASA's Astronomy Picture of the Day / Re: NASA's Astronomy Picture of the Day , Re: Extracting HTML content between the h tags, Re^2: Help With Online Table Scraper, Re^4: web::scraper using an xpath, .... HTML Parser suggestions

See also htmltreexpather.pl and xpather.pl

htmltreexpather.pl , Parsing HTML / Re^4: Parsing HTML, A regex question , NASA's Astronomy Picture of the Day / Re: NASA's Astronomy Picture of the Day , Re: Extracting HTML content between the h tags, Re^2: Help With Online Table Scraper, Re^4: web::scraper using an xpath, .... HTML Parser suggestions

xpather.pl
Re: Get Node Value from irregular XML (xpather.pl)
Re: Having trouble with siblings
Re^2: XML parsing and Lists
Re: Counting number of child nodes based on element value (typos)
Re^3: Extracting specific childnodes (xpath whitespace)
Re^3: Extracting specific childnodes (play xmllint --shell )
Re: How do i get value of an element if the next elememnt has specific value in XML::LibXML using Xpath?
Re: How do i get value of an element if the next elememnt has specific value in XML::LibXML using Xpath?
Re: How to parse xml with namespase vale in XMl:LibXML? ( XPath error : Undefined namespace prefix )
Re^2: How to parse xml with namespase vale in XMl:LibXML? (xmllint --shell setns / xpathtester)

There is a better way :)

  • Comment on Re^3: The State of Web spidering in Perl

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1055207]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (6)
As of 2024-04-24 10:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found