in reply to Re: extracting web links
in thread extracting web links

I've noticed that my regular expression version finds more links than when I use HTML::Parser. Since HTML::linkExtor and LinkExtractor are based on Parser can I assume that they aren't going to pick up more links than Parser?

Replies are listed 'Best First'.
Re: Re: Re: extracting web links
by Corion (Patriarch) on Dec 27, 2003 at 22:22 UTC

    All modules will only find links in anchor tags or image links. Your regular expression dosen't seem very valid to me, so I doubt that it will find more links, but it will surely find different links, as it will more or less gobble up anything that remotely looks like a link in double quotes, while leaving out links in single quotes.

    I'm not sure about your requirements, but for me, any of these modules has always been enough. If you have special requirements as to the nature of links extracted, please state them more specifically and if possible, with examples.

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
      *Ahem*

      I believe you meant to say all except HTML::LinkExtractor, which gets them all :).

      MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
      I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
      ** The third rule of perl club is a statement of fact: pod is sexy.

Re: Re: Re: extracting web links
by dominix (Deacon) on Dec 27, 2003 at 22:49 UTC
    probably more, but what about link like

    http://user:passwd@site

    that become specialy interresting when you consider that user or passwd could containt space (or quote ?),that URL can contain comment like "a > b ?", that quote aren't mandatory and finally consider that regex could be memory/CPU hog , and if you still feel like using regexes test things like
    perl -Mre=debug -e '" "=~/href\s*=\s*"*([^"\s]+)"*\s*>/gi' put your URL here ^^
    you'll have a better idea of what the regex engine do for you. After that may be your regex be the solution ... but I doubt.
    --
    dominix
      These are some of the things I was worried about with regex. I also figured that a lot of time and energy has gone into the Parser module and that many people have already reviewed it and given it their blessing. I just wanted to make sure that I was heading in the right direction.
      Thanks for the input!