bm has asked for the wisdom of the Perl Monks concerning the following question:

Before I begin a confession: I am very much a newbie to the Unicode world.

I am attempting to do a search and replace on the following Unicode encoded file:

[KEY] һ ح د ؼ ^ KeyStr=һحدؼ^ KeyMap=һحدؼ^ [DICT]

This file contains both ASCII and UTF-16 LE (CJK Unified Ideograph) encoded characters. I have a hash containing the characters to search for, and their replacement string, for example:

my %conv = ( 'U\+003d' => 'key1', );

where the key is my character to search for in the form of a Unicode hex value, and the value is what I need to replace it with in the form of a normal utf8 ascii string (so the above had is saying that the equals sign '=' - hex U+003d - should be replaced by the string 'key1'). I use Unicode::String to get the hex values for each line from my input file. My approach is to slurp the data from the input file into a list of utf16 objects:

while (<F>) { push ( @raw, utf16($_)); }
Then, for each element in @raw, I get their hex values, and if a key in %conv is found in the hex string, then that hex code is replaced with the replacement string's utf16 hex value:

my @hexout; foreach $line ( @raw ) { # get the hex values for uthe input line my $hvs = $line->hex; # now loop over the conversions foreach my $hexkey ( keys %keyconv ) { # if this char to be replaced is in # this line, do a conversion if ( $hvs =~ /$hexkey/ ) { # replacement chars are utf8... my $newStr8 = utf8($keyconv{$hexkey}); # ...so convert it to utf16 my $newStr16 = $newStr8->utf16; my $Str16Obj = utf16($newStr16); # and get its hex value in utf16 format my $newhex = $Str16Obj->hex; print "Replaced $hexkey with", "$newStr (hex: $newhex) in line\n", "(hex: $hvs)\n" if $debug; # do the conversion $hvs =~ s/$hexkey/$newhex/g; } } # maintain a list of output hex values. push @hexout, $hvs; }

So, by the end of that code snippet, @hexout contains a list of Unicode hex values to be dumped to a new file. This seems to work Ok when I compare @hexout to the hex values in @raw. Only the hex values I intend to change are actually changed (003d using the example above)

My (current :) problem comes when I output the values in @hexout. I have tried this:

Unicode::String->stringify_as( 'utf16' ); my $out16 = Unicode::String->new(); $out16->hex ( join '', @hexout); print OUTFILE $out16; close F || die;
and a similiar approach using utf8 instead of utf16.

Also tried

print OUTFILE chr hex foreach ( split /\s*/, $outhex );

but this seems to lose all formating completely, and wide chars and print make me nervous. But my output file is mangled even though the output hex values are identical (other than the conversions) to my input hex values!

So, if you have go this far, does anyone know how I can output these Unicode hex values into a new file correctly?? My current output files are almost correct, but some chars are not coming out correctly, even though they are not touched during the conversion (the actual conversions look just fine!).

Also, as a secondary question, is there a better way to do a Unicode search and replace other than getting down to hex values? I would like to keep this encoding independent, which sort of steered me away from regex's (plus I don't have MRE handy).

I am using Perl 5.8 on win32. I am viewing the Unicode files in a Unicode editor called SC UniPad or even MS Word 2000.

Your help is greatly appreciated. I feel very lost.

bm

update (broquaint): fixed title typo and added <readmore> tag

Replies are listed 'Best First'.
Re: Search and replace in Unicode filess
by bm (Hermit) on Jun 16, 2003 at 12:13 UTC
    I forgot the actual hex values I am using. These values are from the example input files shown above and using the 003d conversion in %conv also shown above.

    Here are my input file hex values (pre-conversion) from the input file:

    U+feff U+005b U+004b U+0045 U+0059 U+005d U+000d U+000a U+0000 U+4e0d U+000a U+0028 U+4e0d U+000a U+003f U+4e0d U+000a U+0036 U+4e0d U+000a U+005b U+4e0d U+000a U+004b U+0065 U+0079 U+0053 U+0074 U+0072 U+003d U+0000 U+4e28 U+4e3f +U+4e36 U+4e5b U+4e0d U+000a U+004b U+0065 U+0079 U+004d U+0061 U+0070 U+003d U+0000 U+4e28 U+4e3f +U+4e36 U+4e5b U+4e0d U+000a U+005b U+0044 U+0049 U+0043 U+0054 U+005d U+000d U+000a U+0000

    and here are my hex values for outputting (post-conversion) to the new file (where the 003d conversion shown above is the only one):

    U+feff U+005b U+004b U+0045 U+0059 U+005d U+000d U+000a U+0000 U+4e0d U+000a U+0028 U+4e0d U+000a U+003f U+4e0d U+000a U+0036 U+4e0d U+000a U+005b U+4e0d U+000a U+004b U+0065 U+0079 U+0053 U+0074 U+0072 U+006b U+0065 U+0079 U+0031 +U+0000 U+4e28 U+4e3f U+4e36 U+4e5b U+4e0d U+000a U+004b U+0065 U+0079 U+004d U+0061 U+0070 U+006b U+0065 U+0079 U+0031 +U+0000 U+4e28 U+4e3f U+4e36 U+4e5b U+4e0d U+000a U+005b U+0044 U+0049 U+0043 U+0054 U+005d U+000d U+000a U+0000

    Again, many thanks.

Re: Search and replace in Unicode files
by Skeeve (Parson) on Jun 16, 2003 at 12:42 UTC
    I never saw a UTF-16 file. A better way to replace might IMHO be to have all your conversions, as you already have, in a hash. But now not as a hexdump but as "bytesequences". If I'm not wrong, each UTF character is a 2-byte sequence.

    Example:

    my %conv = ( "\0x00\0x3d" => 'key1', );
    The replacement might then be done by:
    $search=join '|', map quotemeta,keys %conv; while (<>) { s/($search)/$conv{$1}/geo; }
    please correct me anyone who sees I'm wrong.
      I'm pretty sure the problem is with my outputting my results, not the conversion method (see my follow up post below).

      Having said, your method looks to be better (certainly quicker) than pattern matching on a list of hex values, which is what mine code above does.

      Appreciate your response

Re: Search and replace in Unicode files
by bm (Hermit) on Jun 16, 2003 at 16:45 UTC
    I have tried to make my question a little more specific.

    The following code does not do what I expect.

    It is intended to:

  • read all hex values in a multibyte encoded file into memory,
  • copy the hex values to a new list,
  • create a new, empty Unicode::String object
  • set the hex values in the unicode object to the parsed hex values
  • open an output file for writing
  • print the Unicode object to the output file

    so basically it is similiar in nature to a plain old copy.

    use Unicode::String; $o_in = 'in.txt'; $o_out = 'out.txt'; # get the input data into utf16 objects open ( F, $o_in ) || die; while (<F>) { push ( @raw, utf16($_)); } close (F) || die; # get a list of hex values for the data map { push (@hexout, $_->hex) } @raw; # now create an empty utf16 object Unicode::String->stringify_as( 'utf16' ); my $outo = Unicode::String->new(); # and define its hex values $outo->hex ( join '', @hexout); # dump the hex values open ( F, ">$o_out" ) || die; print F $outo; close (F) || die;

    But my input file differs from my output file.

    Does anyone have any idea why?

    Thanks in advance

      This is a tough post to respond to, because so much important information is scattered over a bunch of replies you've made to your own top node -- but thanks for providing all that info.

      Still, you haven't said which version of Perl you're using. This is important, because 5.6 has only partial unicode support, compared to 5.8 and its astonishing Encode module, which could make quite a difference here.

      This comment from your original post really stumped me:

      This file contains both ASCII and UTF-16 LE...

      I would normally interpret this to mean that some "characters" in the file are single-byte, while others are UTF-16, and frankly, that would be a really bad idea, unless you have some really clear, reliable signal in the file mark the change-over from one encoding to the other. But then, based on one of your replies, I gather that the entire file is really utf-16, and that some of the 16-bit characters happen to represent ascii values in the unicode table (ie. their high bytes are null). Fine.

      As for the best way to do what you want, see Perl 5.8 and its Encode module. In this version, utf8 can be used as the "native" internal encoding for strings, and you can read and write to files using this encoding, or have the characters converted to any other chosen (appropriate, supported) non-unicode encoding, either in memory, or via an "IO layer" attached to an input or output file handle (refer to the "PerlIO" modules).

      I'm sorry not to be more helpful -- I skipped directly from 5.5 to 5.8, and never had the opportunity/need to use the various modules like "Unicode::String" -- I think they are mostly superceded in 5.8. If you have a utf-16LE file, 5.8 would handle it like this:

      open( IN, "<:UTF-16LE", "file.in" ); open( OUT ">:utf8", "file.out" ); while (<IN>) { s/=/key1/; # using utf8 internally makes this easy print <OUT>; }
      and of course, there would be other ways of doing the same thing, and allowing for things like checking that the input really conforms to the specified input encoding, and checking that the characters really can be converted into the specified output encoding. See the perluniintro, perlunicode, Encode and PerlIO docs.

      One other point: in your reply where you showed the unicode data in hex, a couple of strings actually have a "NULL" (U0000, right after the "=" that gets replaced by "key1", I think). I wonder if this had any bad impact on your process, or on your ability to check the results of the process...

        Encode and 5.8 was exactly what I was after, thanks.
        Hi Graff,

        But then, based on one of your replies, I gather that the entire file is really utf-16, and that some of the 16-bit characters happen to represent ascii values in the unicode table (ie. their high bytes are null). Fine.

        Yes, you are correct (I think) - the whole file is utf16, but some of the chars have null high bytes (the 'ascii' like values).

        I also have access to 5.8 (now), and am going to try to not make my life so hard by ignoring the fact that my data is utf-16LE encoded (other than the IO layers) - as demonstrated by your code snippet. Will reply to this thread if everything works out.

        I'm sorry not to be more helpful

        Wrong! You have clarified several things, and given me a direction - including a snippet - to go on.

        And for that: more and better karma to you. Thank you kindly,

        bm