HamNRye has asked for the wisdom of the Perl Monks concerning the following question:
Greetings Monks.
I am working converting some text from a legacy system to good old fashioned XHTML. I have a problem that I can't get my brain around and was looking for some assistance.
The system in question uses tags with hex control characters. so a formatting tag might look like this:
<tag>text\x9D Now about 10% of the time, there's another control code that follows later... \x90. Usually there are other formatting tags in between the end tag char and the \x90.
Here is an example of the data:
I am the start of the data. <tag>I am <cm+bd>tagged<cm-bd> text\x9D <cm+bd> <cm-bd> <cm+bd> <cm-bd> <cm+bd> <cm-bd> \x90 I am the end of the data with another control code for fun. \x90
Ideally I would like my output to be:
I am the start of the data. I am the end of the data with another control code for fun. \x90
I want to remove the tag up to the \x90 char... But it will not always be there. I do not want to match a \x90 later in the file and truncate the data. I want to match from the beginning of the tag to the first word character that is NOT inside of angle brackets.
Here is the regex I've been using. s/<\/bug[^>]*>[^9D]*\x9D.*?\x90//isg I'm matching with the \x90 and then without, but wound up with the first match being too greedy. I have tried more complex regexes using look ahead/lookbehind asertions... Couldn't get those working.
Your help is much appreciated.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Regex Question
by HamNRye (Monk) on Mar 26, 2009 at 17:06 UTC | |
|
Re: Regex Question
by JavaFan (Canon) on Mar 26, 2009 at 17:05 UTC |