PerlStalker has asked for the wisdom of the Perl Monks concerning the following question:

I am seeking the wisdom of the Monks.

I am trying to parse an HTML page for specific words and link them to their definition in a glossary.

That is easy enough to do using this simple bit of code:

foreach my $word (keys %glossary) { print STDERR "$word "; # Works but tries to link inside other links. $file =~ s/(\s+)($word)([\.\?\!]?\s+)/"$1<a href=\"".$glossary{$wo +rd}{'link'}."\">$2<\/a>$3"/eig; }

($file contains the entire text of the page and I forgot a \s or two to deal with whitespace. Oops.)

The problem I'm having is that it is trying to link words that are already part of other links. I don't want that to happen. How can I replace just the words that are not enclosed in <a> tags?

I have an idea that involves replacing the links first with a place holder, then replacing the words in the doc, and finally putting the links back but that means going through the file two more times. I would prefer to do it all at once if I could.

I would be greatful for any enlightenment I could find.
PerlStalker

Replies are listed 'Best First'.
Re: Linking words in html to glossary.
by lshatzer (Friar) on Jul 20, 2001 at 02:13 UTC
    Here is a similar node that touched on this issue.

    Digest version: HTML::Parser is your friend.

      Doh! You are right. I missed that one somehow. He's also trying to solve a slightly different problem. I can skip the stiff in the quotes (href="..."). I want to skip the stuff beteen <a href="..."> and </a>. I also forgot that I will be doing this inside of an XML::Parser. So I will know when I'm inside the tag. (That's what I get for trying to test new ideas in a seperate program :) )

      I am still interested in how this would work using a regex (or two or three). The node you linked didn't mention much in the comments on how to do that. I don't think the solutions mentioned there would work. I suspect some very fun lookahead and/or lookbehind stuff but I have not used those options before.

      PerlStalker

        Ahhh, okay. Sorry about that. If you look at my reply in that node, I mention a method that uses a negative lookahead. Play around with it a little. And as always, read perlre :o)
Re: Linking words in html to glossary.
by mirod (Canon) on Jul 20, 2001 at 14:09 UTC

    Interesting problem (and one I have to solve soon), so here is my take, using (of course!) XML::Twig. You will need the absolute latest version of the module as I found a bug and added an option to make it easier. (at the moment you have to go through 193.251.86.24/xmltwig/ to get it, I still have DNS problems.

    Note that this version loads the entire document in memory. In order to do it in stream mode you would need 2 passes, one that gets the terms to link to, creates the a tag around them and stores the name attribute, and one that goes again through the document and creates the links. It would actually be quite similar except the code would be in twig_handlers and there would be purge and flush calls to free the memory as we go.

    So here is the code:

    #!/usr/bin/perl -w use strict; use XML::Twig; my $t= XML::Twig->new( pretty_print => 'indented'); $t->parse( \*DATA); # get the terms my @dt= $t->descendants( 'dt'); # wrap the terms in <a name="normalized_term"> tags foreach my $dt (@dt) { my $name= lc $dt->text; $name=~ s/\W/_/g; # normalize the term so it c +an be a name $dt->insert( a => { name => $name }); # insert the a element in th +e dt } # create a hash term (litteral text to match) => name to link to my %dt= map { $_->text, $_->first_child('a')->att( 'name') } @dt; # create a (potentially huge!) regexp or'ing all the texts to match # the texts are sorted by reverse length so 'foo bar' comes before 'fo +o' # note that the actual word matched needs to be captured my $regexp_text= join '|', sort { length( $b) <=> length( $a) } keys % +dt; my $regexp= qr/\b($regexp_text)\b/i; # now go through the interesting parts of document and match away! # an other option would be to get all text descendants and to # 'next' if the context is not right (like a h<n> or whatever my @new_links; foreach my $text ($t->descendants( qr/^(p|dd)$/)) { next if( $text->in_context( 'a')); # that's where we skip text +in links # this is the magic method that creates a new link_to_dt element w +rapped # around the term in the text and returns the list of created elem +ents push @new_links, $text->split( { return_matched_elt => 1 }, $regex +p, 'link_to_dt'); } # now we need to replace those link_to_dt elements with real links. foreach my $link ( @new_links) { $link->set_gi( 'a'); # turn it into an html link my $href= "#$dt{lc($link->text)}"; # the target is the normalized +text $link->set_att( href => $href); } # time to output the whole thing $t->print; __DATA__ <html> <body> <h2>Glossary</h2> <dl> <dt>foo</dt> <dd>1. interj. Term of disgust. 2. [very common] Used very genera +lly as a sample name for absolutely anything, esp. programs and file +s (esp. scratch files). 3. First on the standard list of metasyntacti +c variables used in syntax examples. See also bar, baz, qux, qu +ux, corge, grault, garply, waldo, fred, plugh, xyzzy, thud.</dd> <dt>bar</dt> <dd>1. [very common] The second metasyntactic variable, after foo +and before baz. "Suppose we have two functions: FOO and BAR. FOO + calls BAR...." 2. Often appended to foo to produce foobar.</dd> <dt>toto</dt> <dd>French equivalent of foo. See also tata, tutu, titi</dd> <dt>foo bar</dt> <dd>This one is here just to show that this can get a little trick +y!</dd> </dl> <h2>Text</h2> <p>This para could describe anything really, as it references foo an +d bar (it actually references foo several times, including <a href="#my +_foo"> my own foo</a>) but also to foo bar.</p> <h2><a name="my_foo">My own foo</a></h2> <p>My own foo is usually called toto, although wombat is also cool.< +/p> <p>Note that the code is barred from linking... foo or bar as part o +f a word</p> </body> </html>

      That should do it. In my case, the glossary page is a seperate page with the terms and defs comming out of a database. So, I will have the defs already (pulled from the DB) before I start parsing my XML doc.

      Thanks for the help.

      PerlStalker