in reply to Re: Search and replace in Unicode files
in thread Search and replace in Unicode files
Still, you haven't said which version of Perl you're using. This is important, because 5.6 has only partial unicode support, compared to 5.8 and its astonishing Encode module, which could make quite a difference here.
This comment from your original post really stumped me:
This file contains both ASCII and UTF-16 LE...
I would normally interpret this to mean that some "characters" in the file are single-byte, while others are UTF-16, and frankly, that would be a really bad idea, unless you have some really clear, reliable signal in the file mark the change-over from one encoding to the other. But then, based on one of your replies, I gather that the entire file is really utf-16, and that some of the 16-bit characters happen to represent ascii values in the unicode table (ie. their high bytes are null). Fine.
As for the best way to do what you want, see Perl 5.8 and its Encode module. In this version, utf8 can be used as the "native" internal encoding for strings, and you can read and write to files using this encoding, or have the characters converted to any other chosen (appropriate, supported) non-unicode encoding, either in memory, or via an "IO layer" attached to an input or output file handle (refer to the "PerlIO" modules).
I'm sorry not to be more helpful -- I skipped directly from 5.5 to 5.8, and never had the opportunity/need to use the various modules like "Unicode::String" -- I think they are mostly superceded in 5.8. If you have a utf-16LE file, 5.8 would handle it like this:
and of course, there would be other ways of doing the same thing, and allowing for things like checking that the input really conforms to the specified input encoding, and checking that the characters really can be converted into the specified output encoding. See the perluniintro, perlunicode, Encode and PerlIO docs.open( IN, "<:UTF-16LE", "file.in" ); open( OUT ">:utf8", "file.out" ); while (<IN>) { s/=/key1/; # using utf8 internally makes this easy print <OUT>; }
One other point: in your reply where you showed the unicode data in hex, a couple of strings actually have a "NULL" (U0000, right after the "=" that gets replaced by "key1", I think). I wonder if this had any bad impact on your process, or on your ability to check the results of the process...
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: Re: Search and replace in Unicode files
by bm (Hermit) on Jun 18, 2003 at 09:13 UTC | |
|
Re: Re: Re: Search and replace in Unicode files
by bm (Hermit) on Jun 17, 2003 at 17:04 UTC |