Help with regs

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Help with regs by Aristotle (Chancellor) on Dec 16, 2002 at 00:27 UTC
Don't use a regex. Use HTML::TokeParser::Simple and URI::Find (or URI::Find::Schemeless) instead. The HTML::TokeParser::Simple documentation has examples of very similar cases you can adapt. The core loop would look something like this: `my $finder = URI::Find::Schemeless->new(sub { my ($uri, $text) = @_; return qq{<a href="@{[ $uri->abs() ]}">$text</a>}; }); my HTML::TokeParser::Simple->new(\STDIN); while (my $token = $p->get_token) { my $text = $token->as_is; $finder->(\$text) if $token->is_text; # here's the key print $text; }` [download] The key line invokes URI::Find::Schemeless only for tokens which are plaintext, not part of a tag or other things. Makeshifts last the longest.*	[reply] [d/l]
Re: Help with regs by graff (Chancellor) on Dec 16, 2002 at 01:41 UTC
If you know that the incoming text is html data, then there is probably a good way to us HTML::TokeParser::Simple so that you can locate just the pieces in the data that represent usable URL's that happen to be part of the visible text of the page. This node shows an example of how it's used for a similar sort of editing task. Apart from that, the first parenthesized portion looks a bit odd, and the basic problem is that it doesn't really guard against hitting on a URL that happens to be inside of (i.e. an attribute of) some other tag. Something like the following might be an improvement (but HTML::TokeParser, or TokeParser::Simple, is still the preferred approach): `s{(>[^<]?)(http://([.\w/]+))}{$1<a href=$2>$3</a>}gi;` [download] Note the use of curly braces to bound the left and right sides of the expression -- so we don't have to backslash-escape all the slashes in the pattern content (you forgot to add the backslash for the </a> part in your code, so it should have caused a syntax error). In this version, the first part assumes that once you see a close angle bracket, you're not inside any sort of tag, so look for zero or more characters that are not an open bracket, followed by a URL. (update:* fixed a couple typos in the explanation.)	[reply] [d/l]