Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Regex Question

by HamNRye (Monk)
on Mar 26, 2009 at 16:49 UTC ( [id://753454]=perlquestion: print w/replies, xml ) Need Help??

HamNRye has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks.

I am working converting some text from a legacy system to good old fashioned XHTML. I have a problem that I can't get my brain around and was looking for some assistance.

The system in question uses tags with hex control characters. so a formatting tag might look like this:
<tag>text\x9D Now about 10% of the time, there's another control code that follows later... \x90. Usually there are other formatting tags in between the end tag char and the \x90.

Here is an example of the data:

I am the start of the data. <tag>I am <cm+bd>tagged<cm-bd> text\x9D <cm+bd> <cm-bd> <cm+bd> <cm-bd> <cm+bd> <cm-bd> \x90 I am the end of the data with another control code for fun. \x90

Ideally I would like my output to be:

I am the start of the data. I am the end of the data with another control code for fun. \x90

I want to remove the tag up to the \x90 char... But it will not always be there. I do not want to match a \x90 later in the file and truncate the data. I want to match from the beginning of the tag to the first word character that is NOT inside of angle brackets.

Here is the regex I've been using. s/<\/bug[^>]*>[^9D]*\x9D.*?\x90//isg I'm matching with the \x90 and then without, but wound up with the first match being too greedy. I have tried more complex regexes using look ahead/lookbehind asertions... Couldn't get those working.

Your help is much appreciated.

Replies are listed 'Best First'.
Re: Regex Question
by HamNRye (Monk) on Mar 26, 2009 at 17:06 UTC

    Ahhh, never mind. I have it now.

    $text =~ s/\x90[^\w]*<\/bugbreak[^>]*>.*?\x9D.*?\x9D(.*?)\x90/translat +eString($1)/esig sub translateString { my $text = shift; $text =~ s/<[^>]*>//g; }

    This way, even if the match does slurp up too much, the formatting tags are removed and the "clean" string is returned for the substitution. Schweet.

    Just typing the problem out made me think about it again and come up with a solution. Thanks.

Re: Regex Question
by JavaFan (Canon) on Mar 26, 2009 at 17:05 UTC
    I'm not quite sure what you want, but it may be:
    s/<tag>[^\x9D]*\x9D(?:\W*\x90)?//

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://753454]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (3)
As of 2024-04-25 08:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found