regex and multibyte strings

Kalimeister has asked for the wisdom of the Perl Monks concerning the following question:

Having has such success in the past I am returning to the Monastary for more help from the Perl Monks! Please bear in mind that I am a relative beginner here ... but here is my problem:
We have to do a fair bit of post processing of HTML and javascript files and all has worked fine until we started handling Japanese files - now the text is getting corrupted because perl is handling the text byte by byte. Is there a (relatively) easy way to tell perl that the text is multibyte and therefore look character by character and not byte by byte? So that something like: s/\\([^\\])/$1/g; doesn't change a Japanese/English mixed sequence like: \x94\x5C \x83\x41 to \x94\x20\x83\x41?? I need it to know that the \x5C is the second byte in a multibyte character and not an escaped space.

Thank you for any help you can offer.

Comment on regex and multibyte strings Select or Download Code

Replies are listed 'Best First'.
Re: regex and multibyte strings by Joost (Canon) on Jun 10, 2003 at 15:21 UTC
If you're talking about unicode files, try using `binmode FILEHANDLE,":utf8";` (please use perl 5.8.0 if you're not doing that right now, it is reported to have better unicode support than the older versions that support it) See the binmode entry in perlfunc, perlopentut and perlunicode. Don't know much about unicode tho' - YMMV. Joost	[reply] [d/l]
Re: regex and multibyte strings by graff (Chancellor) on Jun 11, 2003 at 07:06 UTC
Joost's advice about using Perl 5.8.0 is on the mark,. even if your HTML input data is not unicode -- and it's very likely that your data is something other than unicode, such as ShiftJIS or god-knows-what (I hope you know which encoding you are dealing with). Not only are perl-5.8.0's strings stored as utf8 internally, but the Encode module, which is part of the 5.8.0 distribution, provides the means for converting back and forth between utf8 and a wide assortment of alternate character sets, including all the major (pre-unicode) Japanese standards, as well as the other forms of unicode (i.e. utf16, big- or little-endian). And the new tricks that you get to do with regex matches, involving predefined unicode character classes, are truly awesome. Not only do you avoid nefarious corruptions of multi-byte characters completely, but you get to match characters according to what they really are.	[reply]
Re: Re: regex and multibyte strings by Kalimeister (Acolyte) on Jun 12, 2003 at 14:49 UTC
Thank you all for you thoughts - I will definitely look into 5.8 - we are currently using 5.6.1. As far as the encoding of the files they are Shift-JIS.	[reply]
Re: regex and multibyte strings by perlguy (Deacon) on Jun 10, 2003 at 18:14 UTC
Jeffrey Friedl's book Mastering Regular Expressions goes into detail on handling unicode/multi-byte characters in regular expressions. You may wish to start there.	[reply]