Regex within html

ropey has asked for the wisdom of the Perl Monks concerning the following question:

Excuse the dodgy title, I really couldn't think how to describe the problem very accurately.

So basically what I am trying to do, is to modify the source of the html document to add some special links (they are actually used to allow easy translation of a web page in context)

I think this is easier explained in a few examples, so lets take the easy one as follows, lets say I have this text in the html source

<!-- headerStandard.siteFeedbackLink.label --!>Feedback<!-- headerStan
+dard.siteFeedbackLink.label --!>
[download]

And lets say that I have a hash such as

%data = ('headerStandard.siteFeedbackLink.label' => 'Feedback Translat
+ed');
[download]

So the html comments are used to map the key to the content and are available for me in the html. I want to replace whats in between the tags with the data equivalent AND add a special link AFTER it as so above example becomes...

Feedback Translated<a href='foo'>foo</a>
[download]

So this example is easy, I just regex between the lines such as

my $heading = 'headerStandard.siteFeedbackLink.label';
    $wp =~ s/\<\!-- $heading --\!>(.*)\<\!-- $heading --\!>/$data{$hea
+ding}\<a href="foo"\>foo\</a>/gm;
[download]

So no problem, however my slow working brain needs some assistance with dealing when the contents I want to replace are nested inside tags already, for instance lets say the text is inside a <a href="bar"> such as

<a href="http://localhost:8585/deals/dealshome"><!-- headerStandard.de
+alsTab.label --!>Deals<!-- headerStandard.dealsTab.label --!></a>
[download]

So i would want the text to be replaced inside the a tag, but the link to be outside the current href such as follows

<a href="http://localhost:8585/deals/dealshome">Deals Modified</a><a h
+ref="foo">foo</a>
[download]

or inside alt tags for a image

                    <a href="bar"><img src="http://localhost" alt="
<!-- adTags.AdvertisementAltText --!>Advertisement<!-- adTags.Advertis
+ementAltText --!>
  " /></a>
[download]

Again would want the text to be replaced between the comments, but add the link outside of the original a tag.

This scenario also applies to text within textfield, textareas, img tags etc etc.... SO I guess can anyone suggest a easy way to achieve what I am doing (replace the text within the comments and add the link outside any nested tags).... My regex skills are not particularly great but I am sure some of you may be able to assist or offer me a better way of doing this ?

Comment on Regex within html Select or Download Code

Replies are listed 'Best First'.
Re: Regex within html by moritz (Cardinal) on Sep 08, 2008 at 12:17 UTC
While this might be achievable with regexes, I don't recommend it. Parsing HTML with regex is a fundamentally bad idea, because regexes aren't good for matching nested data structures. What I'd recommend instead is to tokenize your text, that is split it up into chunks that are 1) either normal text or 2) your special comments or 3) opening or closing `<a>` tags. Then iterate over all these chunks, and count the difference in the number of opening and closing anchor tags. While iterating over these tokens you construct an output string, and in that string it shouldn't be too hard to get the nesting of `<a>` tags correctly.	[reply] [d/l] [select]
Re^2: Regex within html by ropey (Hermit) on Sep 08, 2008 at 12:38 UTC
Thanks for assist Moritz Im not sure how you would 'tokenize' it in the first place ? would that not have a regex as well ? Its also worth commenting this isnt about a templating system, the raw html is generated by another host (which I have no control over) and just have that to work with ?	[reply]
Re^3: Regex within html by moritz (Cardinal) on Sep 08, 2008 at 13:08 UTC
Im not sure how you would 'tokenize' it in the first place ? would that not have a regex as well ? It sure would, but the point is that it would need one regexp per possible token type, not one huge regex that solves the whole problem. Usually I use the tokenizer from Math::Expression::Evaluator::Lexer (don't let the name fool you; it's good for more than mathematical expressions), from which you could draw inspiration. And don't use `.*` in your regexes, that's almost always an error. See Death to Dot Star!.	[reply] [d/l]
Re^3: Regex within html by Anonymous Monk on Sep 08, 2008 at 12:43 UTC
get yourself a html parser, and use it :)	[reply]
Re: Regex within html by shmem (Chancellor) on Sep 08, 2008 at 14:00 UTC
Ah well. Why do you use such a complicated templating in the first place? Reach out for Template Toolkit, HTML::Template or some such. But if you absolutely have to stick with such a beast, don't even try to use regexes to transform your template! See Why this simple regex freeze my computer? for an example of horrors you might run into with that approach. You could use HTML::Parser to achieve what you want. That module tokenizes your HTML and provides you with callbacks for comments, opening tags, closing tags, plain text and much more. In those callbacks, you can track the state of your opening/closing tags depending on whether there's content found to be substituted. But first, some sanitizing: `<!-- headerStandard.siteFeedbackLink.label --!>` [download] should be `<!-- headerStandard.siteFeedbackLink.label -->` [download] to be a well formed comment. But if you stick with that, remove at least the last '!' to make your (invalid) comment pairs into one comment: `<!-- foo.label --!>text<!-- foo.label -->` [download] I've cranked out an example for a starter, which does the job for the examples given, but has its rough edges and doesn't treat nested stuff well, e.g `<a href="<!--foo.label --!>foo<!-- foo.label -->"> blah blah <img src="bar.jpg" alt="<!-- bar.label --!>bar<!-- bar.lab +el -->" /> </a>` [download] which can be solved using a stack of replacement links and have `$pending` below as a pointer to it. But you should really, really switch to a seasoned templating system! use HTML::Parser; use warnings; use strict; my $p = HTML::Parser->new( api_version => 3, start_h => [\&start, 'tagname, attr, attrseq, text'], end_h => [\&end, 'tagname, text'], comment_h => [\&comm, 'text' ], default_h => [ sub {print shift}, 'text'], ); $p->unbroken_text(1); my $file = shift; $p->parse_file($file); my ($pending, $link); sub start { my($tag, $attr, $attrseq, $text) = @_; for my $k (keys %$attr) { if ($attr->{$k} =~ /\!/) { ($attr->{$k},$link) = transform_comments($attr->{$k}); } } $pending++; my $a = join ' ', map { $_ eq '/' ? $_ : "$_=\"$attr->{$_}\"" } @$attrseq; print "<$tag", $a ? " $a>" : '>'; } sub end { my ($tag,$text) = @_; print $text; if ($pending) { print $link; $pending = $link = ''; } } sub comm { my $text; ($text,$link) = transform_comments($_[0]); print $text; print $link unless $pending; } sub transform_comments { my $str = shift; if ($str =~ /(\S+) --!>([^<]+?)<!-- (\1)/) { my ($key,$text) = ($1,$2); # return value of hash and link my $val = "fake-$text-translated"; my $link = '<a href=\'foo\'>foo</a>'; return $val,$link; } $str; } [download] Update: seeing your answer to moritz above - well, then at least it isn't your fault ;-) For the above code to work, you'll need to sanitize the comments with e.g. `perl -pi -e 's/(<!-- \S+ --!>[^<]+?<!-- \S+ --)!>/$1>/g' $file` [download] since HTML::Parser won't recognize the comments otherwise.	[reply] [d/l] [select]
Re^2: Regex within html by ropey (Hermit) on Sep 08, 2008 at 14:36 UTC
Thanks Shmem for a excellent post It isnt a templating system :) think of it as a filter, the actual real html is generated from a java host and I have to intercept it and transform it slightly to add these extra tags . So parsing the html seems a much more reasonable approach so appreciate your help	[reply]
Re^3: Regex within html by user0869 (Initiate) on Sep 09, 2008 at 07:40 UTC
hi you may give "Recursive Search and Replacement" a try visit http://freshmeat.net/projects/sandr gr33tz f.c.	[reply]