Parsing out URLs with regex

xtomservox has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parsing out URLs with regex by artist (Parson) on May 14, 2003 at 19:28 UTC
pickup the wheel: Regexp::Common::URI aritst	[reply]
Re: Re: Parsing out URLs with regex by halley (Prior) on May 14, 2003 at 19:41 UTC
Agreed. When working on code for heavy use, don't reinvent the wheel. For learning purposes, though, you don't want (\w) for the maximum number of consecutive word characters, you want (.?) for the minimum number of characters followed by the closing quote. -- `[ e d @ h a l l e y . c c ]`	[reply]
Re^3: Parsing out URLs with regex (diedotstar) by tye (Sage) on May 14, 2003 at 19:46 UTC
Actually, this is a good example of when .? is not the best choice. `[^"]` is a much better idea. You don't want to run into this problem: `$page= '<a href="foo">...' . '<a href="bar" title="baz"><b>Click Here'; $page =~ /<a href="(.?)" title="(.?)"><b>Click Here/i;` [download] where $1 will contain `'foo">...<a href="bar'`. - tye	[reply] [d/l] [select]
Re: Re^3: Parsing out URLs with regex (diedotstar) by halley (Prior) on May 14, 2003 at 19:51 UTC
Re^5: Parsing out URLs with regex (diedotstar) by tye (Sage) on May 14, 2003 at 19:56 UTC
(jeffa) Re: Parsing out URLs with regex by jeffa (Bishop) on May 14, 2003 at 19:41 UTC
Monks don't let Seekers parse HTML with regexes (but i'll bet if you search, you'll find some regexes to do just that). However, if your `href` attributes are going to be schemed (http://, ftp://, etc.) then use URI::Find instead: `use strict; use warnings; use URI::Find; my @found; # this is just for this example, your $page will have the HTML # this line slurps the DATA filehandle below into a scalar my $page = do {local $/;<DATA>}; my $finder = URI::Find->new( sub { push @found,shift } ); $finder->find(\$page); print $_,$/ for @found; __DATA__ <a href="http://foo.com/bar/qux.html">stuff</a> <a href="http://bar.com/baz.cgi?foo=bar&stuff=more%20stuff">click</a> <a href="mailto:spam@me.com">don't feed the trolls</a>` [download] UPDATE: i was originally using URI::Find::Schemeless - switched to URI::Find. jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l] [select]
A reply falls below the community's threshold of quality. You may see it by logging in.
•Re: Parsing out URLs with regex by merlyn (Sage) on May 14, 2003 at 20:49 UTC
Amongst the other issues already pointed out by fellow monks, I saw no mention yet of the gravest error here: Don't use `$1` unless you are absolutely sure the match succeeded. Definitely bad code here. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]
Re: Parsing out URLs with regex by benn (Vicar) on May 14, 2003 at 19:56 UTC
update but of course be sure to read tye's crit of halley's got-there-first version of this, benn's late entry...:) As artist implies, there are more well-thought-out solutions to this particular problem than you can shake a stick at. If, however, you're doing this purely as an learning exercise {g}, then you're correct - "\w" matches only alphanumerics and '_' - "http://" will throw it for instance. "." matches anything...you could use that, or a character class ('`[\w:\/-]`' or something) to match only the characters that you want. If you use "." though, be warned that "" is 'greedy' - if your page contains more than one '`" TITLE=""><b>Click Here`' , then "." will grab a whole lot more than you bargained for...you'll probably want to make it ".?" - the "?" makes it 'minimal'. Hope this clarifies things, Cheers, Ben.	[reply] [d/l] [select]
Re: Parsing out URLs with regex by nite_man (Deacon) on May 15, 2003 at 08:57 UTC
You can retrive URLs from HTML page using HTML::LinkExtor: `#!/usr/bin/perl -w use LWP::Simple; use HTML::LinkExtor; use Data::Dumper; my $content = get("http://www.yandex.ru"); #Get web page die "Get web page failed!" unless defined $content; my $parser = HTML::LinkExtor->new(); #create LinkExtor object $parser->parse($content); #parse content my @links = $parser->links; #get list of links print Dumper (\@links); #print list of links` [download] -------------------------------- SV* sv_bless(SV* sv, HV* stash);	[reply] [d/l]