Re: Search and replace in Unicode files

I have tried to make my question a little more specific.

The following code does not do what I expect.

It is intended to:

read all hex values in a multibyte encoded file into memory,

copy the hex values to a new list,

create a new, empty Unicode::String object

set the hex values in the unicode object to the parsed hex values

open an output file for writing

print the Unicode object to the output file

so basically it is similiar in nature to a plain old copy.

use Unicode::String;
$o_in   = 'in.txt';
$o_out  = 'out.txt';

# get the input data into utf16 objects
open ( F, $o_in ) || die;
while (<F>) {
       push ( @raw, utf16($_));
}
close (F) || die;
# get a list of hex values for the data
map { push (@hexout, $_->hex) } @raw; 
# now create an empty utf16 object 
Unicode::String->stringify_as( 'utf16' );
my $outo = Unicode::String->new();
# and define its hex values
$outo->hex ( join '', @hexout);
# dump the hex values
open ( F, ">$o_out" ) || die;
print F $outo;    
close (F) || die;
[download]

But my input file differs from my output file.

Does anyone have any idea why?

Thanks in advance

Comment on Re: Search and replace in Unicode files Select or Download Code

Replies are listed 'Best First'.
Re: Re: Search and replace in Unicode files by graff (Chancellor) on Jun 17, 2003 at 02:38 UTC
This is a tough post to respond to, because so much important information is scattered over a bunch of replies you've made to your own top node -- but thanks for providing all that info. Still, you haven't said which version of Perl you're using. This is important, because 5.6 has only partial unicode support, compared to 5.8 and its astonishing Encode module, which could make quite a difference here. This comment from your original post really stumped me: This file contains both ASCII and UTF-16 LE... I would normally interpret this to mean that some "characters" in the file are single-byte, while others are UTF-16, and frankly, that would be a really bad idea, unless you have some really clear, reliable signal in the file mark the change-over from one encoding to the other. But then, based on one of your replies, I gather that the entire file is really utf-16, and that some of the 16-bit characters happen to represent ascii values in the unicode table (ie. their high bytes are null). Fine. As for the best way to do what you want, see Perl 5.8 and its Encode module. In this version, utf8 can be used as the "native" internal encoding for strings, and you can read and write to files using this encoding, or have the characters converted to any other chosen (appropriate, supported) non-unicode encoding, either in memory, or via an "IO layer" attached to an input or output file handle (refer to the "PerlIO" modules). I'm sorry not to be more helpful -- I skipped directly from 5.5 to 5.8, and never had the opportunity/need to use the various modules like "Unicode::String" -- I think they are mostly superceded in 5.8. If you have a utf-16LE file, 5.8 would handle it like this: `open( IN, "<:UTF-16LE", "file.in" ); open( OUT ">:utf8", "file.out" ); while (<IN>) { s/=/key1/; # using utf8 internally makes this easy print <OUT>; }` [download] and of course, there would be other ways of doing the same thing, and allowing for things like checking that the input really conforms to the specified input encoding, and checking that the characters really can be converted into the specified output encoding. See the perluniintro, perlunicode, Encode and PerlIO docs. One other point: in your reply where you showed the unicode data in hex, a couple of strings actually have a "NULL" (U0000, right after the "=" that gets replaced by "key1", I think). I wonder if this had any bad impact on your process, or on your ability to check the results of the process...	[reply] [d/l]
Re: Re: Re: Search and replace in Unicode files by bm (Hermit) on Jun 18, 2003 at 09:13 UTC
Encode and 5.8 was exactly what I was after, thanks.	[reply]
Re: Re: Re: Search and replace in Unicode files by bm (Hermit) on Jun 17, 2003 at 17:04 UTC
Hi Graff, But then, based on one of your replies, I gather that the entire file is really utf-16, and that some of the 16-bit characters happen to represent ascii values in the unicode table (ie. their high bytes are null). Fine. Yes, you are correct (I think) - the whole file is utf16, but some of the chars have null high bytes (the 'ascii' like values). I also have access to 5.8 (now), and am going to try to not make my life so hard by ignoring the fact that my data is utf-16LE encoded (other than the IO layers) - as demonstrated by your code snippet. Will reply to this thread if everything works out. I'm sorry not to be more helpful Wrong! You have clarified several things, and given me a direction - including a snippet - to go on. And for that: more and better karma to you. Thank you kindly, bm	[reply]