Curious Regex

HamNRye has asked for the wisdom of the Perl Monks concerning the following question:

I'm certain this is old hat to many of the regs here, but I've been bustin' a synapse here trying to get this done.

So, here's the idea... Translating info coming out of a legacy system to XHTML. There are control codes "\x90" "\x8F" that tell me useful things elsewhere in the text, but need to be excluded within certain tags. There could be any number of them within the tags.

Here is some sample text:

</Mil,p12.10,3,15,d>But he won<cm EQ>t say whether 
he would veto bill, which 
is short of what he wanted 

[download]

Human readable instead of funky boxes...

</Mil,p12.10,3,15,d>But he won<cm EQ>t say whether \x90
he would veto bill, which \x90
is short of what he wanted \x90
\x9D
[download]

My current regex looks like $text =~ s/(<\/Mil[^>]*>.*?)[\x90\x8F](.*?\x9D)/$1$2/ig; And this only removes the very first control character as expected.

How can I create the regex so that between the open tag </Mil.*> and the close char \x9D I can remove any number of \x8F and \x90 characters?

This all takes place in the middle of ~100 different transforms that need to be done to each file, so having this as a one-liner would be nice.

Edit: Current regex updated to use [^>]* per Moritz's suggestion.

Comment on Curious Regex Select or Download Code

Replies are listed 'Best First'.
Re: Curious Regex by ikegami (Patriarch) on Feb 11, 2009 at 18:15 UTC
My current regex looks like `$text =~ s/(<\/Mil.>.?)[\x90\x8F](.?\x9D)/$1$2/ig;` And this only removes the very first control character as expected.* Not even. It doesn't necessarily remove the first \x90 no \x8F because of the greedy /`.`/. It can remove a \x90 or \x8F after the \x9D if there is neither \x90 nor \x8F in one of the blocks. /`.?`/ is rarely the right tool. That would presume the newlines weren't there since /`.`/ only matches newlines when the `s` modifier is in effect. Anyway, here's a two-state parser that should do the trick: `my $out = ''; PROCESS: for ($text) { pos() = 0; for (;;) { # Search for Mil element for (;;) { /\G ( [^<]+ ) /xgc && $out .= $1; /\G ( <\/Mil(?:,[^>]*])?> ) /xgc && do { $out .= $1; last }; /\G ( < ) /xgc && $out .= $1; /\G \z /xgc && last PROCESS; } # Search for end of Mil element, # removing \x8F and \x90 as we go along. for (;;) { /\G ( [^\x8F\x90\x9D]+ ) /xgc && $out .= $1; /\G [\x8F\x90]+ /xgc; /\G \x9D /xgc && last; /\G \z /xgc && last PROCESS; } } }` [download]	[reply] [d/l] [select]
Re: Curious Regex by moritz (Cardinal) on Feb 11, 2009 at 18:12 UTC
The first part of your regex currently matches `</Mil,p12.10,3,15,d>But he won<cm EQ>`, not only `</Mil,p12.10,3,15,d>`. If that's not what you want, you should change it to `(</Mil[^>]>)`. Then you can go on with `m{ (</Mil[^>]>[^\x90\x8F]) [\x90\x8F] # your target characters... [^\x9D] # anything but the terminator, # including \x90 and \x8F \x9D # the terminator }xs` [download] (I hope I understood your question correctly).	[reply] [d/l] [select]
Re^2: Curious Regex by HamNRye (Monk) on Feb 11, 2009 at 18:20 UTC
Moritz, thanks for the reply. The question is how do I substitute out any number of the control characters? Reading your match as written, It would match up to the first control character and then match everything after it. Including more control characters.	[reply]
Re: Curious Regex by almut (Canon) on Feb 11, 2009 at 18:35 UTC
You could use a two-stage approach: in the first stage you extract the entire part from `<\/Mil` to `\x9D`. In the second stage, you remove any control characters from the extracted substring: `sub remove_ctrls { my $s = shift; $s =~ tr/\x90\x8F//d; return $s; } $text =~ s/(<\/Mil.?>.?\x9D)/remove_ctrls($1)/esg;` [download] The `/e` option makes the substitution part be treated as Perl code, i.e. it calls `remove_ctrls()` with the extracted substring. Personally, I find this easier to read than overcomplex regexes which would do it all in one go... YMMV, of course.	[reply] [d/l] [select]
Re^2: Curious Regex by HamNRye (Monk) on Feb 11, 2009 at 19:16 UTC
Thanks... That helps with readability and is easy enough to understand. I don't use the "subroutines in regex" very often and just didn't think of it. I've got the code in place and it's working like a champ. the `</Mil>` tags are actually font declarations (Miller) so this will provide me some reusability of the subroutine if files start popping up with other fonts used. Thanks for the help Monks! It is very much appreciated.	[reply] [d/l]
Re: Curious Regex by ELISHEVA (Prior) on Feb 11, 2009 at 18:34 UTC
Another option, easier to read but slower than ikegami's solution: `sub stripCtrls { my $text = shift; #changed capture of $1 as per moritz $text =~ /(<\/Mil[^>]>)([^\x9D])\x9D/; my $sStartTag = $1; my $sBetweenTags = $2; $sBetweenTags =~ s/[\x90\x8F]//g; return "$sStartTag$sBetweenTags\x9D"; }` [download] which you can call like this `stripCtrls($sTaggedText)`. And if you don't need to preserve \x90 and \x8F in the start tag, maybe you might try the even simpler `$text =~ s/[\x90\x8F]//g`? Best, beth Update: put in sub so can be used as one liner	[reply] [d/l] [select]
Re^2: Curious Regex by HamNRye (Monk) on Feb 11, 2009 at 18:39 UTC
Thanks... Yeah, unfortunately the "simpler" method doesn't work because I need to preserve the control chars outside of the tags. Here's what I have as a "two stage" deal... I was just hoping someone could help me figure out how to make this work as a one liner. `if ($text =~ /<\/Mil[^>]>(.)\x9D/i) { my $string = $1; $string =~ s/[\x90\x8F]//g; $text =~ /(<\/Mil[^>]>).(\x9D)/$1$string$2/i; }` [download]	[reply] [d/l]
Re^3: Curious Regex by ELISHEVA (Prior) on Feb 11, 2009 at 18:48 UTC
Well, basically what you have done is the same idea as what I did except that you have three regex evaluations when only two are needed - you can extract the start tag at the same time as the middle stuff and then put it back together using simple interpolation - which is what I did. Its a bit faster that way, but you should stick with whatever version you will understand three weeks from now. As for the one liner - just put it all in a sub - either your version or mine or ikegami's which is even faster - and presto - a one liner. That is what subs are for. Best, beth	[reply]