Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: Latin Extended Additional and Unicode::String

by bpphillips (Friar)
on Jun 16, 2006 at 14:14 UTC ( [id://555788]=note: print w/replies, xml ) Need Help??


in reply to Latin Extended Additional and Unicode::String

It might help for you to post the actual Unicode::String and Unicode::Map code you've attempted. Otherwise, here are some general principles.

- Encode is the preferred way of dealing with various character encodings, check out the docs
- You need to know the character encoding of the incoming data. From the brief descripion you gave in your post, it appears you might be dealing with UTF-16 data. If that's the case, the conversion is a simple one. Something like this untested code:
use Encode qw(decode); my $utf8_data = decode("UTF-16", $input);
-- Brian

Replies are listed 'Best First'.
Re^2: Latin Extended Additional and Unicode::String
by Anonymous Monk on Jun 16, 2006 at 17:07 UTC
    Thank you guys for your insights. Here is the code I am using. I tried using Encode but I am getting this error message: UTF-16:Unrecognised BOM 5468 at c:/Perl/lib/Encode.pm line 164. I had no idea that Unicode::String is obsolete! is it really?
    use strict; open (OUT, ">c:\\temp\\test.doc"); select OUT; my $string = "Thu'a quư v\1ECB ngu'̣'i Viêt, quư v\1ECB là thành phân" +; #use Unicode::String qw(utf8); #my $record_ref = utf8("$_"); #my $record = $record_ref->utf8; #print $record; use Encode qw(decode); my $utf8_data = decode("UTF-16", $string); print $utf8_data; close OUT; select STDOUT; print "Finished";
      As your code reveals, you're not actually dealing with UTF-16 data and it appears you would benefit from reading perluniintro. Specifically, take a look at how to include utf8 data in your perl program (use utf8, see utf8), how to include Unicode characters for code points above 0xFF ($string = "\x{1ECB}" is the 'i' with a dot, see: Creating Unicode) and changing the encoding of a filehandle (binmode(OUT,':utf8'), see Unicode I/O).

      Unicode is a difficult animal to tackle. I've been dealing with it for the last couple of years and I'm just now feeling like I have a handle on it. I've found the chapter on Unicode in the second edition of Advanced Perl Programming very helpful.

      -- Brian
      The "UTF-16: Unrecognized BOM 5468" is a subtle thing here, but understanding it might help in understanding something about unicode.

      UTF-16 is one way to represent the 16-bit unicode code points (the numbers that stand for the characters), involving the use of straight 16-bit (2-byte) storage units. Problem is, to do this, you have to decide whether to storage is "big-endian" or "little-endian" (that is, you have to choose one or the other byte order, presumably based on what sort of cpu you have).

      That's why unicode defines two specific versions of UTF-16: LE and BE. If you just say that some data is in "UTF-16" without a byte-order specifier (i.e. UTF-16LE or UTF-16BE), then there needs to be an indicator in the data itself to say what the byte order is supposed to be, and this indicater has to be the first two bytes of data.

      The standard indicator that is supposed to be used is called the "Byte-Order-Mark" (BOM), and its code point in unicode is "\x{FEFF}" -- the thing about this value is, if you read the bytes in the wrong order (as "\x{FFFE}"), the resulting value is specifically "undefined" (invalid code point) in unicode. This way, you know right away what the byte order for the data is supposed to be.

      In the case of the code you posted, the first two bytes of your string were "Th", and when you told perl that this was supposed to be a UTF-16 string (which, by the way, is most certainly is not), perl looked at those first two bytes (\x54, \x68) and said "this is not a BOM, and you haven't put "LE" or "BE" next to the "UTF-16" label, so I'm giving up."

      (In order for that string to have been UTF-16, each ASCII character in would have a null byte next to it, to pad it out to 16 bits.)

      Anyway, since you are using Perl 5.8, you can specify "Latin extended" characters using their unicode codepoints like this:

      $string = "there is a diacritic on this letter: \x{1ECB}\n"
      Perl will automatically set an internal flag on that string to indicate that it contains one or more utf8 wide characters; to print it to a file as utf8, you'll want to do  binmode( $fh, ":utf8" ) -- that is, set the output "discipline" for the file handle to be utf8.

      If you need to output the wide-character data in some other encoding (e.g. UTF-16LE), use the name for that encoding in the binmode call, instead of utf8, and perl will convert the data for you on output. The perlunicode man page is a another good source of information on all this.

      it appears that mentioned Unicode::String is commented out in your case :)

      so you're not using obsolete way, so you'll achieve your goal correctly (despite you do not know this :)

      Instead of writing \1ECB you need \x{1ECB} and you'll be fine...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://555788]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (8)
As of 2024-03-28 18:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found