xtomservox has asked for the wisdom of the Perl Monks concerning the following question:

Salutations, Monks! I am trying to bust out URLs from HTML using regex and I have tried several fruitless solutions. I think that the regexs I am using are getting hung up on special characters that I am not quoting correctly???

$page =~ /<a HREF="(\w*)" TITLE=""><b>Click Here/; $url = $1;


Is there a kind monk among ye to show me the error of my ways?

Replies are listed 'Best First'.
Re: Parsing out URLs with regex
by artist (Parson) on May 14, 2003 at 19:28 UTC

      Agreed. When working on code for heavy use, don't reinvent the wheel.

      For learning purposes, though, you don't want (\w*) for the maximum number of consecutive word characters, you want (.*?) for the minimum number of characters followed by the closing quote.

      --
      [ e d @ h a l l e y . c c ]

        Actually, this is a good example of when .*? is not the best choice. [^"]* is a much better idea. You don't want to run into this problem:

        $page= '<a href="foo">...' . '<a href="bar" title="baz"><b>Click Here'; $page =~ /<a href="(.*?)" title="(.*?)"><b>Click Here/i;
        where $1 will contain 'foo">...<a href="bar'.

                        - tye
(jeffa) Re: Parsing out URLs with regex
by jeffa (Bishop) on May 14, 2003 at 19:41 UTC
    Monks don't let Seekers parse HTML with regexes (but i'll bet if you search, you'll find some regexes to do just that). However, if your href attributes are going to be schemed (http://, ftp://, etc.) then use URI::Find instead:
    use strict; use warnings; use URI::Find; my @found; # this is just for this example, your $page will have the HTML # this line slurps the DATA filehandle below into a scalar my $page = do {local $/;<DATA>}; my $finder = URI::Find->new( sub { push @found,shift } ); $finder->find(\$page); print $_,$/ for @found; __DATA__ <a href="http://foo.com/bar/qux.html">stuff</a> <a href="http://bar.com/baz.cgi?foo=bar&stuff=more%20stuff">click</a> <a href="mailto:spam@me.com">don't feed the trolls</a>

    UPDATE: i was originally using URI::Find::Schemeless - switched to URI::Find.

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
    A reply falls below the community's threshold of quality. You may see it by logging in.
•Re: Parsing out URLs with regex
by merlyn (Sage) on May 14, 2003 at 20:49 UTC
    Amongst the other issues already pointed out by fellow monks, I saw no mention yet of the gravest error here:
    Don't use $1 unless you are absolutely sure the match succeeded.
    Definitely bad code here.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

Re: Parsing out URLs with regex
by benn (Vicar) on May 14, 2003 at 19:56 UTC
    update but of course be sure to read tye's crit of halley's got-there-first version of this, benn's late entry...:)

    As artist implies, there are more well-thought-out solutions to this particular problem than you can shake a stick at.

    *If*, however, you're doing this purely as an learning exercise {g}, then you're correct - "\w" matches *only* alphanumerics and '_' - "http://" will throw it for instance. "." matches anything...you could use that, or a character class ('[\w:\/-]*' or something) to match only the characters that you want.

    If you use "." though, be warned that "*" is 'greedy' - if your page contains more than one '" TITLE=""><b>Click Here' , then ".*" will grab a whole lot more than you bargained for...you'll probably want to make it ".*?" - the "?" makes it 'minimal'.

    Hope this clarifies things,
    Cheers, Ben.

Re: Parsing out URLs with regex
by nite_man (Deacon) on May 15, 2003 at 08:57 UTC
    You can retrive URLs from HTML page using HTML::LinkExtor:
    #!/usr/bin/perl -w use LWP::Simple; use HTML::LinkExtor; use Data::Dumper; my $content = get("http://www.yandex.ru"); #Get web page die "Get web page failed!" unless defined $content; my $parser = HTML::LinkExtor->new(); #create LinkExtor object $parser->parse($content); #parse content my @links = $parser->links; #get list of links print Dumper (\@links); #print list of links
          
    --------------------------------
    SV* sv_bless(SV* sv, HV* stash);