comment on

Before I begin a confession: I am very much a newbie to the Unicode world.

I am attempting to do a search and replace on the following Unicode encoded file:

[KEY]
一
丨
丿
丶
乛
KeyStr=一丨丿丶乛
KeyMap=一丨丿丶乛
[DICT]
[download]

This file contains both ASCII and UTF-16 LE (CJK Unified Ideograph) encoded characters. I have a hash containing the characters to search for, and their replacement string, for example:

my %conv =  (
    'U\+003d'    =>    'key1',        
);
[download]

where the key is my character to search for in the form of a Unicode hex value, and the value is what I need to replace it with in the form of a normal utf8 ascii string (so the above had is saying that the equals sign '=' - hex U+003d - should be replaced by the string 'key1'). I use Unicode::String to get the hex values for each line from my input file. My approach is to slurp the data from the input file into a list of utf16 objects:

while (<F>) {    
    push ( @raw, utf16($_));
}
[download]

Then, for each element in @raw, I get their hex values, and if a key in %conv is found in the hex string, then that hex code is replaced with the replacement string's utf16 hex value:

my @hexout;
foreach $line ( @raw ) {
        # get the hex values for uthe input line
        my $hvs = $line->hex;
        # now loop over the conversions
        foreach my $hexkey ( keys %keyconv ) {
            # if this char to be replaced is in
            # this line, do a conversion
             if ( $hvs =~ /$hexkey/ ) {
                        # replacement chars are utf8...
                        my $newStr8  = utf8($keyconv{$hexkey});
                        # ...so convert it to utf16
                        my $newStr16 = $newStr8->utf16;
                        my $Str16Obj = utf16($newStr16); 
                        # and get its hex value in utf16 format
                        my $newhex = $Str16Obj->hex; 
                        print "Replaced $hexkey with", 
                           "$newStr (hex: $newhex) in line\n",
                           "(hex: $hvs)\n" if $debug;
                        # do the conversion
                        $hvs =~ s/$hexkey/$newhex/g;
            }
        }
        # maintain a list of output hex values.
        push @hexout, $hvs;
}
[download]

So, by the end of that code snippet, @hexout contains a list of Unicode hex values to be dumped to a new file. This seems to work Ok when I compare @hexout to the hex values in @raw. Only the hex values I intend to change are actually changed (003d using the example above)

My (current :) problem comes when I output the values in @hexout. I have tried this:

Unicode::String->stringify_as( 'utf16' );
my $out16 = Unicode::String->new();
$out16->hex ( join '', @hexout);
print OUTFILE $out16;
close F || die;
[download]

and a similiar approach using utf8 instead of utf16.

Also tried

print OUTFILE chr hex foreach ( split /\s*/, $outhex );
[download]

but this seems to lose all formating completely, and wide chars and print make me nervous. But my output file is mangled even though the output hex values are identical (other than the conversions) to my input hex values!

So, if you have go this far, does anyone know how I can output these Unicode hex values into a new file correctly?? My current output files are almost correct, but some chars are not coming out correctly, even though they are not touched during the conversion (the actual conversions look just fine!).

Also, as a secondary question, is there a better way to do a Unicode search and replace other than getting down to hex values? I would like to keep this encoding independent, which sort of steered me away from regex's (plus I don't have MRE handy).

I am using Perl 5.8 on win32. I am viewing the Unicode files in a Unicode editor called SC UniPad or even MS Word 2000.

Your help is greatly appreciated. I feel very lost.

update (broquaint): fixed title typo and added <readmore> tag

In reply to Search and replace in Unicode files by bm

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.