rastoboy has asked for the wisdom of the Perl Monks concerning the following question:
I'm trying to convert an HTML file to a Wiki format. The HTML is...not well formed, and so I have to do some "intelligent" processing.
Basically there are some multi-line chunks that I need to actually remove some tags. These chunks look like this:
<P ALIGN=LEFT STYLE="background: #b3b3b3; font-style: normal"><B>Figur +e 1.10</B></P> <P ALIGN=LEFT STYLE="background: #b3b3b3; font-style: normal">rasto@za +phod ~ $ screen -x</P> <P ALIGN=LEFT STYLE="background: #b3b3b3; font-style: normal">There are several suitable screens on:</P> <P ALIGN=LEFT STYLE="background: #b3b3b3; font-style: normal"> 12207.pts-1.Evil-Fish (Attached)</P> <P ALIGN=LEFT><SPAN STYLE="font-style: normal">
If I print out $match after my series of substutions, it looks fine. The problem is folding these changes back into $text.#!/usr/bin/perl my $file = $ARGV[0]; open my $fh, $file; my $text; foreach my $line (readline $fh) { $text .= $line; } foreach my $match ( $text =~ /(<P ALIGN=LEFT STYLE="background: #b3b3b +3; font-style: normal"><B>.*?<P ALIGN=LEFT><SPAN STYLE="font-style: n +ormal">)/gsm) { my $temp = $match; $match =~ s/<P ALIGN=LEFT STYLE="background: #b3b3b3; font-style: n +ormal"><B>/<pre>/; $match =~ s/<P ALIGN=LEFT STYLE="background: #b3b3b3; font-style: n +ormal"><B>//g; $match =~ s/<P ALIGN=LEFT><SPAN STYLE="font-style: normal">/<\/pre> +/; $match =~ s/<\/B>//g; $text =~ s/$temp/$match/gsm; } print $text;
What's happening is the substutution in the third to last line is not matching, *except* on the last match of the foreach loop.
Clearly I'm missing something. I did a proof of concept with some small, simple, multiline text and the logic worked fine. It seems to me that since $match was, by definition, found in the $text, it ought to be able to find an exact copy of it reliably!
Any ideas on how exactly I'm being wrong about this would be greatly appreciated.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Regex bafflement
by planetscape (Chancellor) on Oct 09, 2010 at 00:34 UTC | |
|
Re: Regex bafflement
by Anonymous Monk on Oct 09, 2010 at 00:28 UTC | |
|
Re: Regex bafflement
by ig (Vicar) on Oct 10, 2010 at 00:46 UTC | |
by rastoboy (Monk) on Oct 13, 2010 at 21:07 UTC |