Interesting problem (and one I have to solve soon), so here is my take, using (of course!) XML::Twig. You will need the absolute latest version of the module as I found a bug and added an option to make it easier. (at the moment you have to go through 193.251.86.24/xmltwig/ to get it, I still have DNS problems.
Note that this version loads the entire document in memory. In order to do it in stream mode you would need 2 passes, one that gets the terms to link to, creates the a tag around them and stores the name attribute, and one that goes again through the document and creates the links. It would actually be quite similar except the code would be in twig_handlers and there would be purge and flush calls to free the memory as we go.
So here is the code:
#!/usr/bin/perl -w
use strict;
use XML::Twig;
my $t= XML::Twig->new( pretty_print => 'indented');
$t->parse( \*DATA);
# get the terms
my @dt= $t->descendants( 'dt');
# wrap the terms in <a name="normalized_term"> tags
foreach my $dt (@dt)
{ my $name= lc $dt->text;
$name=~ s/\W/_/g; # normalize the term so it c
+an be a name
$dt->insert( a => { name => $name }); # insert the a element in th
+e dt
}
# create a hash term (litteral text to match) => name to link to
my %dt= map { $_->text, $_->first_child('a')->att( 'name') } @dt;
# create a (potentially huge!) regexp or'ing all the texts to match
# the texts are sorted by reverse length so 'foo bar' comes before 'fo
+o'
# note that the actual word matched needs to be captured
my $regexp_text= join '|', sort { length( $b) <=> length( $a) } keys %
+dt;
my $regexp= qr/\b($regexp_text)\b/i;
# now go through the interesting parts of document and match away!
# an other option would be to get all text descendants and to
# 'next' if the context is not right (like a h<n> or whatever
my @new_links;
foreach my $text ($t->descendants( qr/^(p|dd)$/))
{ next if( $text->in_context( 'a')); # that's where we skip text
+in links
# this is the magic method that creates a new link_to_dt element w
+rapped
# around the term in the text and returns the list of created elem
+ents
push @new_links, $text->split( { return_matched_elt => 1 }, $regex
+p, 'link_to_dt');
}
# now we need to replace those link_to_dt elements with real links.
foreach my $link ( @new_links)
{ $link->set_gi( 'a'); # turn it into an html link
my $href= "#$dt{lc($link->text)}"; # the target is the normalized
+text
$link->set_att( href => $href);
}
# time to output the whole thing
$t->print;
__DATA__
<html>
<body>
<h2>Glossary</h2>
<dl>
<dt>foo</dt>
<dd>1. interj. Term of disgust. 2. [very common] Used very genera
+lly as
a sample name for absolutely anything, esp. programs and file
+s (esp.
scratch files). 3. First on the standard list of metasyntacti
+c
variables used in syntax examples. See also bar, baz, qux, qu
+ux,
corge, grault, garply, waldo, fred, plugh, xyzzy, thud.</dd>
<dt>bar</dt>
<dd>1. [very common] The second metasyntactic variable, after foo
+and
before baz. "Suppose we have two functions: FOO and BAR. FOO
+ calls
BAR...." 2. Often appended to foo to produce foobar.</dd>
<dt>toto</dt>
<dd>French equivalent of foo. See also tata, tutu, titi</dd>
<dt>foo bar</dt>
<dd>This one is here just to show that this can get a little trick
+y!</dd>
</dl>
<h2>Text</h2>
<p>This para could describe anything really, as it references foo an
+d bar
(it actually references foo several times, including <a href="#my
+_foo">
my own foo</a>) but also to foo bar.</p>
<h2><a name="my_foo">My own foo</a></h2>
<p>My own foo is usually called toto, although wombat is also cool.<
+/p>
<p>Note that the code is barred from linking... foo or bar as part o
+f a
word</p>
</body>
</html>
|