Regex bafflement

rastoboy has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Brothers,

I'm trying to convert an HTML file to a Wiki format. The HTML is...not well formed, and so I have to do some "intelligent" processing.

Basically there are some multi-line chunks that I need to actually remove some tags. These chunks look like this:

<P ALIGN=LEFT STYLE="background: #b3b3b3; font-style: normal"><B>Figur
+e
1.10</B></P>
<P ALIGN=LEFT STYLE="background: #b3b3b3; font-style: normal">rasto@za
+phod
~ $ screen -x</P>
<P ALIGN=LEFT STYLE="background: #b3b3b3; font-style: normal">There
are several suitable screens on:</P>
<P ALIGN=LEFT STYLE="background: #b3b3b3; font-style: normal">       
12207.pts-1.Evil-Fish   (Attached)</P>
<P ALIGN=LEFT><SPAN STYLE="font-style: normal">
[download]

So I need to detect these largish sections and process them. I'm able to find and manipulate these strings fine, but where I'm getting my head handed to me is merging them back into the original text. Here's what I've got so far:


#!/usr/bin/perl


my $file = $ARGV[0];

open my $fh, $file;
my $text;
foreach my $line (readline $fh) {
        $text .= $line;
}

foreach my $match ( $text =~ /(<P ALIGN=LEFT STYLE="background: #b3b3b
+3; font-style: normal"><B>.*?<P ALIGN=LEFT><SPAN STYLE="font-style: n
+ormal">)/gsm) {
   my $temp = $match;

   $match =~ s/<P ALIGN=LEFT STYLE="background: #b3b3b3; font-style: n
+ormal"><B>/<pre>/;
   $match =~ s/<P ALIGN=LEFT STYLE="background: #b3b3b3; font-style: n
+ormal"><B>//g;

   $match =~ s/<P ALIGN=LEFT><SPAN STYLE="font-style: normal">/<\/pre>
+/;
   $match =~ s/<\/B>//g;

   $text =~ s/$temp/$match/gsm;
}

print $text;
[download]

If I print out $match after my series of substutions, it looks fine. The problem is folding these changes back into $text.

What's happening is the substutution in the third to last line is not matching, *except* on the last match of the foreach loop.

Clearly I'm missing something. I did a proof of concept with some small, simple, multiline text and the logic worked fine. It seems to me that since $match was, by definition, found in the $text, it ought to be able to find an exact copy of it reliably!

Any ideas on how exactly I'm being wrong about this would be greatly appreciated.

Comment on Regex bafflement Select or Download Code

Replies are listed 'Best First'.
Re: Regex bafflement by planetscape (Chancellor) on Oct 09, 2010 at 00:34 UTC
I usually process stuff like that out with HTML::Tidy. See also options `--bare` and `--clean`. Once you have sane HTML, further processing gets much easier. Update: Word HTML to TWiki converter may also be of interest. HTH, planetscape	[reply] [d/l] [select]
Re: Regex bafflement by Anonymous Monk on Oct 09, 2010 at 00:28 UTC
You're modifying $text, but looping over the initial set of matches. The way you described the problem, this might not matter, but I'd stick it in a while loop anyway to see if that makes a difference (eg. `while (defined (my ($match) = $text =~ /(foo)/sn)) { ... }`). I also note though that the HTML you've shown us isn't necessarily invalid, if we assume a "transitional" doctype and a suitable container so that dangling paragraph gets auto-closed. If you haven't yet actually seen what an HTML parser will do with it, it might still be an option.	[reply] [d/l]
Re: Regex bafflement by ig (Vicar) on Oct 10, 2010 at 00:46 UTC
Part of your problem may be with `$text =~ s/$temp/$match/gsm;`. The text in `$temp` may include regular expression meta characters. For example, in your sample data it includes `(Attached)`. When you use this in the RE, the parentheses are not matched as characters - they provide grouping. As a result, the pattern doesn't match at all. Whether your text is changed will depend on whether `$text` contains any regular expression meta characters or not. I take it from your description that in the case you described all but the last match did include meta characters. One way to prevent the characters in `$temp` being interpreted as regular expression meta characters is to quote all the text as follows: `$text =~ s/\Q$temp\E/$match/gsm;` [download]	[reply] [d/l] [select]
Re^2: Regex bafflement by rastoboy (Monk) on Oct 13, 2010 at 21:07 UTC
ig: That was totally it, thanks! I figured it might be something like that, I just didn't know what to do about it. Thanks to all, useful and helpful information all around :-)	[reply]