In the case of <a name="foo"> it simply won't match, as the regexp includes href. And you wouldn't want it to match, as it's not a link. Whitespace around the equals sign (which is rare, but valid) is more problematic. There are other edge cases which behave differently to how you might want them to as well - note that the first subcapture allows ">" to occur within it.
But in practise, it's probably good enough to work for the majority of people.
The author may well accept a patch to parse the page properly using HTML::Parser given that the module already has a dependency on that module (indirectly, via LWP::UserAgent).
Or if you can't wait for a new fixed version to be released, just subclass it - it's only really that one method that's in major need of fixing.
In reply to Re^3: Any spider framework?
by tobyink
in thread Any spider framework?
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |