I am attempting to do a search and replace on the following Unicode encoded file:
[KEY] һ ح د ؼ ^ KeyStr=һحدؼ^ KeyMap=һحدؼ^ [DICT]
This file contains both ASCII and UTF-16 LE (CJK Unified Ideograph) encoded characters. I have a hash containing the characters to search for, and their replacement string, for example:
my %conv = ( 'U\+003d' => 'key1', );
where the key is my character to search for in the form of a Unicode hex value, and the value is what I need to replace it with in the form of a normal utf8 ascii string (so the above had is saying that the equals sign '=' - hex U+003d - should be replaced by the string 'key1'). I use Unicode::String to get the hex values for each line from my input file. My approach is to slurp the data from the input file into a list of utf16 objects:
Then, for each element in @raw, I get their hex values, and if a key in %conv is found in the hex string, then that hex code is replaced with the replacement string's utf16 hex value:while (<F>) { push ( @raw, utf16($_)); }
my @hexout; foreach $line ( @raw ) { # get the hex values for uthe input line my $hvs = $line->hex; # now loop over the conversions foreach my $hexkey ( keys %keyconv ) { # if this char to be replaced is in # this line, do a conversion if ( $hvs =~ /$hexkey/ ) { # replacement chars are utf8... my $newStr8 = utf8($keyconv{$hexkey}); # ...so convert it to utf16 my $newStr16 = $newStr8->utf16; my $Str16Obj = utf16($newStr16); # and get its hex value in utf16 format my $newhex = $Str16Obj->hex; print "Replaced $hexkey with", "$newStr (hex: $newhex) in line\n", "(hex: $hvs)\n" if $debug; # do the conversion $hvs =~ s/$hexkey/$newhex/g; } } # maintain a list of output hex values. push @hexout, $hvs; }
So, by the end of that code snippet, @hexout contains a list of Unicode hex values to be dumped to a new file. This seems to work Ok when I compare @hexout to the hex values in @raw. Only the hex values I intend to change are actually changed (003d using the example above)
My (current :) problem comes when I output the values in @hexout. I have tried this:
and a similiar approach using utf8 instead of utf16.Unicode::String->stringify_as( 'utf16' ); my $out16 = Unicode::String->new(); $out16->hex ( join '', @hexout); print OUTFILE $out16; close F || die;
Also tried
print OUTFILE chr hex foreach ( split /\s*/, $outhex );
but this seems to lose all formating completely, and wide chars and print make me nervous. But my output file is mangled even though the output hex values are identical (other than the conversions) to my input hex values!
So, if you have go this far, does anyone know how I can output these Unicode hex values into a new file correctly?? My current output files are almost correct, but some chars are not coming out correctly, even though they are not touched during the conversion (the actual conversions look just fine!).
Also, as a secondary question, is there a better way to do a Unicode search and replace other than getting down to hex values? I would like to keep this encoding independent, which sort of steered me away from regex's (plus I don't have MRE handy).
I am using Perl 5.8 on win32. I am viewing the Unicode files in a Unicode editor called SC UniPad or even MS Word 2000.
Your help is greatly appreciated. I feel very lost.
bm
update (broquaint): fixed title typo and added <readmore> tag
In reply to Search and replace in Unicode files by bm
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |