#!perl -wl # vim: set ts=8: # how to replace certain text (your $name) with a link (to $url) # based on what tags it is in inside the url # the one problem i see is that if text is already in <a> then we # embed an <a href></a> into it, the text after the href but # before the end of the original </a> will lose its properties. # this can be accounted for, but i'm lazy use strict; use HTML::Parser; # global variables our ($name, $url, @tagstack, $out) = qw{pizza http://www.parseerror.com/}; # sub defs sub start { my ($tag, $attr, $text) = @_; $tag .= " href" if defined $attr->{"href"}; push @tagstack, $tag; output($text); } sub end { my ($tag, $text) = @_; shift @tagstack while (@tagstack && $tagstack[0] !~ /^$tag$/); shift @tagstack if @tagstack; output($text); } sub text { my ($text) = @_; if ($text =~ /\b$name\b/ && canreplace() && unlinked()) { $text =~ s#\b$name\b#<a href="$url">$name</a>#g; } output($text); } sub output { my ($txt) = @_; $out .= $txt if defined $txt; } # are we inside a tag we don't want to link? sub canreplace { return ! grep {m/^(head|title)$/} @tagstack; } # are we inside a link right now? sub unlinked { return ! grep {m/^a href$/} @tagstack; } # start code my $p = HTML::Parser->new( "start_h" => [ \&start, "tagname, attr, text" ] ,"end_h" => [ \&end, "tagname, text" ] ,"text_h" => [ \&text, "text" ] ); $p->parse(q{ <html> <head> <title>pizza is delicious!</title> </head> <body> <a href="">pizza already linked</a> <b>some more pizza</b> <a>pizza in a but not linked</a> </body> </html> }); print $out;
i've put the code up on http://www.parseerror.com/scripts/replace_name.pl

perl -e'$_="nwdd\x7F^n\x7Flm{{llql0}qs\x14";s/./chr(ord$&^30)/ge;print'


In reply to Re: Search and replacing across 500,000 HTML documents by pizza_milkshake
in thread Search and replacing across 500,000 HTML documents by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.