Using URI::Find with HTML

skx has asked for the wisdom of the Perl Monks concerning the following question:

I'm having a problem using the URI::Find module to process a mixture of plain text and HTML code.

The perldoc suggests using the following code:

 use CGI qw(escapeHTML);

  $text = "<pre>\n" . escapeHTML($text) . "</pre>\n";
  my $finder = URI::Find->new(
                              sub {
                                  my($uri, $orig_uri) = @_;
                                  return qq|<a href="$uri">$orig_uri</
+a>|;
                              });
  $finder->find(\$text);
[download]

This works beautifully when I'm processing plain text input, however the callback doesn't have any context so it cannot avoid modifying the following badly:

<a href="http://foo.com/">foo</a>
[download]

This is transformed even though I don't need it to be.

Short of trying to heuristically detect whether I'm processing HTML or plain text is there another module I could use to insert hyperlinks around URIs which are not already linked ?

Steve
--

Comment on Using URI::Find with HTML Select or Download Code

Replies are listed 'Best First'.
Re: Using URI::Find with HTML by merlyn (Sage) on Dec 06, 2005 at 15:12 UTC
You could use this technique with HTML::Parser that by default passes the tags, attributes, and comments through un-touched, but for the text portion performs the substitution above. Be sure to set "unbroken text" so you don't get two callbacks in a given text run. Adaping one of the examples there, it'd be something like: `use HTML::Parser; HTML::Parser->new( unbroken_text => 1, default_h => [sub { print shift }, 'text'], text_h => sub { my $text = shift; (URI::Find here) +; print $text }, 'text'], )->parse_file(shift \|\| die) \|\| die $!;` [download] -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply] [d/l]
Re^2: Using URI::Find with HTML by skx (Parson) on Dec 06, 2005 at 15:59 UTC
Thanks a lot for your help, that pointed me in the right direction. Steve --	[reply]