Regex Question

HamNRye has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks.

I am working converting some text from a legacy system to good old fashioned XHTML. I have a problem that I can't get my brain around and was looking for some assistance.

The system in question uses tags with hex control characters. so a formatting tag might look like this:
<tag>text\x9D Now about 10% of the time, there's another control code that follows later... \x90. Usually there are other formatting tags in between the end tag char and the \x90.

Here is an example of the data:

I am the start of the data.
<tag>I am <cm+bd>tagged<cm-bd> text\x9D
<cm+bd> <cm-bd>  <cm+bd> <cm-bd>   <cm+bd> <cm-bd>
\x90
I am the end of the data with another control code for fun.
\x90
[download]

Ideally I would like my output to be:

I am the start of the data.
I am the end of the data with another control code for fun.
\x90
[download]

I want to remove the tag up to the \x90 char... But it will not always be there. I do not want to match a \x90 later in the file and truncate the data. I want to match from the beginning of the tag to the first word character that is NOT inside of angle brackets.

Here is the regex I've been using. s/<\/bug[^>]*>[^9D]*\x9D.*?\x90//isg I'm matching with the \x90 and then without, but wound up with the first match being too greedy. I have tried more complex regexes using look ahead/lookbehind asertions... Couldn't get those working.

Your help is much appreciated.

Comment on Regex Question Select or Download Code

Replies are listed 'Best First'.
Re: Regex Question by HamNRye (Monk) on Mar 26, 2009 at 17:06 UTC
Ahhh, never mind. I have it now. `$text =~ s/\x90[^\w]<\/bugbreak[^>]>.?\x9D.?\x9D(.?)\x90/translat +eString($1)/esig sub translateString { my $text = shift; $text =~ s/<[^>]>//g; }` [download] This way, even if the match does slurp up too much, the formatting tags are removed and the "clean" string is returned for the substitution. Schweet. Just typing the problem out made me think about it again and come up with a solution. Thanks.	[reply] [d/l]
Re: Regex Question by JavaFan (Canon) on Mar 26, 2009 at 17:05 UTC
I'm not quite sure what you want, but it may be: `s/<tag>[^\x9D]\x9D(?:\W\x90)?//` [download]	[reply] [d/l]