Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^4: Any spider framework?

by bart (Canon)
on Jan 10, 2012 at 08:07 UTC ( [id://947109]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Any spider framework?
in thread Any spider framework?

In the case of <a name="foo"> it simply won't match, as the regexp includes href.
And what makes you think the regex would limit itself to a single tag? In your example, the "<a" could be matched while the "href=" would be much further down in the document. In fact, there is no guarantee that that this string is a tag attribute, it could just be in plain html text ("PCDATA"), Javascript code, or even in HTML comments.

To be reliable, a parser (actually just a lexer; it could be regex based) should extract whole tags, and you should then test each on its own. That would be much more reliable.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://947109]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2024-03-29 05:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found