comment on

Greetings Brothers,

I'm trying to convert an HTML file to a Wiki format. The HTML is...not well formed, and so I have to do some "intelligent" processing.

Basically there are some multi-line chunks that I need to actually remove some tags. These chunks look like this:

<P ALIGN=LEFT STYLE="background: #b3b3b3; font-style: normal"><B>Figur
+e
1.10</B></P>
<P ALIGN=LEFT STYLE="background: #b3b3b3; font-style: normal">rasto@za
+phod
~ $ screen -x</P>
<P ALIGN=LEFT STYLE="background: #b3b3b3; font-style: normal">There
are several suitable screens on:</P>
<P ALIGN=LEFT STYLE="background: #b3b3b3; font-style: normal">       
12207.pts-1.Evil-Fish   (Attached)</P>
<P ALIGN=LEFT><SPAN STYLE="font-style: normal">
[download]

So I need to detect these largish sections and process them. I'm able to find and manipulate these strings fine, but where I'm getting my head handed to me is merging them back into the original text. Here's what I've got so far:


#!/usr/bin/perl


my $file = $ARGV[0];

open my $fh, $file;
my $text;
foreach my $line (readline $fh) {
        $text .= $line;
}

foreach my $match ( $text =~ /(<P ALIGN=LEFT STYLE="background: #b3b3b
+3; font-style: normal"><B>.*?<P ALIGN=LEFT><SPAN STYLE="font-style: n
+ormal">)/gsm) {
   my $temp = $match;

   $match =~ s/<P ALIGN=LEFT STYLE="background: #b3b3b3; font-style: n
+ormal"><B>/<pre>/;
   $match =~ s/<P ALIGN=LEFT STYLE="background: #b3b3b3; font-style: n
+ormal"><B>//g;

   $match =~ s/<P ALIGN=LEFT><SPAN STYLE="font-style: normal">/<\/pre>
+/;
   $match =~ s/<\/B>//g;

   $text =~ s/$temp/$match/gsm;
}

print $text;
[download]

If I print out $match after my series of substutions, it looks fine. The problem is folding these changes back into $text.

What's happening is the substutution in the third to last line is not matching, *except* on the last match of the foreach loop.

Clearly I'm missing something. I did a proof of concept with some small, simple, multiline text and the logic worked fine. It seems to me that since $match was, by definition, found in the $text, it ought to be able to find an exact copy of it reliably!

Any ideas on how exactly I'm being wrong about this would be greatly appreciated.

In reply to Regex bafflement by rastoboy

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.