If you're talking about unicode files, try using binmode FILEHANDLE,":utf8"; (please use perl 5.8.0 if you're not doing that right now, it is reported to have better unicode support than the older versions that support it)
See the binmode entry in perlfunc, perlopentut and perlunicode.
Don't know much about unicode tho' - YMMV.
Joost | [reply] [d/l] |
Joost's advice about using Perl 5.8.0 is on the mark,.
even if your HTML input data is not unicode -- and it's very
likely that your data is something other than unicode, such as ShiftJIS
or god-knows-what (I hope you know which encoding you are
dealing with).
Not only are perl-5.8.0's strings stored as utf8 internally,
but the Encode module, which is part of the 5.8.0 distribution,
provides the means for converting back and forth between
utf8 and a wide assortment of alternate character sets,
including all the major (pre-unicode) Japanese standards,
as well as the other
forms of unicode (i.e. utf16, big- or little-endian).
And the new tricks that you get to do with regex matches,
involving predefined unicode character classes, are truly awesome.
Not only do you avoid nefarious corruptions of multi-byte
characters completely, but you get to match characters
according to what they really are. | [reply] |
Thank you all for you thoughts - I will definitely look into 5.8 - we are currently using 5.6.1. As far as the encoding of the files they are Shift-JIS.
| [reply] |
Jeffrey Friedl's book Mastering Regular Expressions goes into detail on handling unicode/multi-byte characters in regular expressions. You may wish to start there.
| [reply] |