Re: Regex within html

Ah well. Why do you use such a complicated templating in the first place? Reach out for Template Toolkit, HTML::Template or some such.

But if you absolutely have to stick with such a beast, don't even try to use regexes to transform your template! See Why this simple regex freeze my computer? for an example of horrors you might run into with that approach.

You could use HTML::Parser to achieve what you want. That module tokenizes your HTML and provides you with callbacks for comments, opening tags, closing tags, plain text and much more. In those callbacks, you can track the state of your opening/closing tags depending on whether there's content found to be substituted.

But first, some sanitizing:

<!-- headerStandard.siteFeedbackLink.label --!>
[download]

should be

<!-- headerStandard.siteFeedbackLink.label -->
[download]

to be a well formed comment. But if you stick with that, remove at least the last '!' to make your (invalid) comment pairs into one comment:

<!-- foo.label --!>text<!-- foo.label -->
[download]

I've cranked out an example for a starter, which does the job for the examples given, but has its rough edges and doesn't treat nested stuff well, e.g

<a href="<!--foo.label --!>foo<!-- foo.label -->">
  blah blah <img src="bar.jpg" alt="<!-- bar.label --!>bar<!-- bar.lab
+el -->" />
</a>
[download]

which can be solved using a stack of replacement links and have $pending below as a pointer to it. But you should really, really switch to a seasoned templating system!

use HTML::Parser;

use warnings;
use strict;

my $p = HTML::Parser->new(
    api_version => 3,
    start_h   => [\&start, 'tagname, attr, attrseq, text'],
    end_h     => [\&end,   'tagname, text'],
    comment_h => [\&comm,  'text' ],
    default_h => [ sub {print shift}, 'text'],
);
$p->unbroken_text(1);

my $file = shift;
$p->parse_file($file);

my ($pending, $link);

sub start {
    my($tag, $attr, $attrseq, $text) = @_;
    for my $k (keys %$attr) {
        if ($attr->{$k} =~ /\!/) {
            ($attr->{$k},$link) = transform_comments($attr->{$k});
        }
    }
    $pending++;
    my $a = join ' ',
          map {
            $_ eq '/' ? $_ :
            "$_=\"$attr->{$_}\""
          } @$attrseq;
    print "<$tag", $a ? " $a>" : '>';
}
sub end {
    my ($tag,$text) = @_;
    print $text;
    if ($pending) {
        print $link;
        $pending = $link = '';
    }
}
sub comm {
    my $text;
    ($text,$link) = transform_comments($_[0]);
    print $text;
    print $link unless $pending;
}

sub transform_comments {
    my $str = shift;
    if ($str =~ /(\S+) --!>([^<]+?)<!-- (\1)/) {
        my ($key,$text) = ($1,$2);
        # return value of hash and link
        my $val  = "fake-$text-translated";
        my $link = '<a href=\'foo\'>foo</a>';
        return $val,$link;
    }
    $str;
}
[download]

Update: seeing your answer to moritz above - well, then at least it isn't your fault ;-)

For the above code to work, you'll need to sanitize the comments with e.g.

perl -pi -e 's/(<!-- \S+ --!>[^<]+?<!-- \S+ --)!>/$1>/g' $file
[download]

since HTML::Parser won't recognize the comments otherwise.

Comment on Re: Regex within html Select or Download Code

Replies are listed 'Best First'.
Re^2: Regex within html by ropey (Hermit) on Sep 08, 2008 at 14:36 UTC
Thanks Shmem for a excellent post It isnt a templating system :) think of it as a filter, the actual real html is generated from a java host and I have to intercept it and transform it slightly to add these extra tags . So parsing the html seems a much more reasonable approach so appreciate your help	[reply]
Re^3: Regex within html by user0869 (Initiate) on Sep 09, 2008 at 07:40 UTC
hi you may give "Recursive Search and Replacement" a try visit http://freshmeat.net/projects/sandr gr33tz f.c.	[reply]