Greetings Monks.

I am working converting some text from a legacy system to good old fashioned XHTML. I have a problem that I can't get my brain around and was looking for some assistance.

The system in question uses tags with hex control characters. so a formatting tag might look like this:
<tag>text\x9D Now about 10% of the time, there's another control code that follows later... \x90. Usually there are other formatting tags in between the end tag char and the \x90.

Here is an example of the data:

I am the start of the data. <tag>I am <cm+bd>tagged<cm-bd> text\x9D <cm+bd> <cm-bd> <cm+bd> <cm-bd> <cm+bd> <cm-bd> \x90 I am the end of the data with another control code for fun. \x90

Ideally I would like my output to be:

I am the start of the data. I am the end of the data with another control code for fun. \x90

I want to remove the tag up to the \x90 char... But it will not always be there. I do not want to match a \x90 later in the file and truncate the data. I want to match from the beginning of the tag to the first word character that is NOT inside of angle brackets.

Here is the regex I've been using. s/<\/bug[^>]*>[^9D]*\x9D.*?\x90//isg I'm matching with the \x90 and then without, but wound up with the first match being too greedy. I have tried more complex regexes using look ahead/lookbehind asertions... Couldn't get those working.

Your help is much appreciated.


In reply to Regex Question by HamNRye

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.