Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Converting Unicode

by BernieC (Pilgrim)
on Dec 01, 2023 at 23:20 UTC ( [id://11156024]=perlquestion: print w/replies, xml ) Need Help??

BernieC has asked for the wisdom of the Perl Monks concerning the following question:

I can't get unicode to work. I have a file that , mixed in with regular text, there are the unicode characters for open quote, close quote and apostrophe. i've been trying to write a little program to replace them with non-unicode characters {" " and ,}. I've tried this test program
these are the constants from cryptfix # use constant APOSTROPHE => "" ; # use constant OPENQUOTE => "" ; # use constant CLOSEQUOTE => "" ; # use constant COMMA => "" ; # This is the version from a hex dump of the crypt text file use constant APOSTROPHE => "\x{e2}\x{80}\x{22}" ; use constant OPENQUOTE => "\x{e2}\x{80}\x{90}" ; use constant CLOSEQUOTE => "\x{e2}\x{80}\x{9d}" ; use constant COMMA => "" ; unless (@ARGV) { die "usage: <crypts-text-file>\n" ; } open(CRYPTS, "<", $ARGV[0]) or die "Can't open $ARGV[0]: $!" ; while (my $line = <CRYPTS>) { say "apostrophe in line $." if $line =~ /@{[APOSTROPHE]}/; say "open quote in line $." if $line =~ /@{[OPENQUOTE]}/; say "close quote in line $." if $line =~ /@{[CLOSEQUOTE]}/; }
and I feed it the text file with the unicode characters in it and it never finds any. I'm not sure what I'm getting wrong.

Replies are listed 'Best First'.
Re: Converting Unicode
by choroba (Cardinal) on Dec 01, 2023 at 23:36 UTC
    I see two problems:
    1. Some of the hex dumps are wrong. Are you sure the file is UTF-8 encoded?
    2. To be able to use the characters literally, you need to use utf8 and specify the encoding when openinig the file.
    The code below shows how to use both types of constants and how they interact with encodings. It reads its own source code, so you don't need any other additional file for testing.
    #! /usr/bin/perl use warnings; use strict; use feature qw{ say }; use utf8; use constant APOSTROPHE => "" ; use constant OPENQUOTE => "" ; use constant CLOSEQUOTE => "" ; use constant COMMA => "" ; # Fixed values. use constant APOSTROPHE2 => "\x{e2}\x{80}\x{99}" ; use constant OPENQUOTE2 => "\x{e2}\x{80}\x{9c}" ; use constant CLOSEQUOTE2 => "\x{e2}\x{80}\x{9d}" ; use constant COMMA2 => "\x{c2}\x{b8}" ; for my $encoding ("", ':encoding(UTF-8)') { open my $self, "<$encoding", __FILE__ or die $!; say "Encoding: $encoding"; while (my $line = <$self>) { say "apostrophe in line $." if $line =~ /@{[APOSTROPHE]}/; say "open quote in line $." if $line =~ /@{[OPENQUOTE]}/; say "close quote in line $." if $line =~ /@{[CLOSEQUOTE]}/; say "comma in line $." if $line =~ /@{[COMMA]}/; say "apostrophe2 in line $." if $line =~ /@{[APOSTROPHE2]}/; say "open quote2 in line $." if $line =~ /@{[OPENQUOTE2]}/; say "close quote2 in line $." if $line =~ /@{[CLOSEQUOTE2]}/; say "comma2 in line $." if $line =~ /@{[COMMA2]}/; } }

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Converting Unicode
by hippo (Bishop) on Dec 01, 2023 at 23:37 UTC

    SSCCE please. You know how this works by now.


Re: Converting Unicode
by BillKSmith (Monsignor) on Dec 07, 2023 at 20:56 UTC
    Returning to your original problem, I would suggest a different approach. This sounds like a problem that has already been solved. If you tell us what program created the offending file, some monk probably could point you to that solution.

    If that does not work, specify your problem clearly by posting the following information.

    • Code points of the three offending characters
    • The hex encoding of each (from the file dump below).
    • Code points of the replacements.
    • Hex dump of the a segment of the offending file. It should contain at least one sample of each of the three characters.
    • Hex dump of the same segment with the required corrections
    • (optional)Your best guess of the file encoding.

    With this much information, we can suggest Perl solutions that will work.

Re: Converting Unicode
by Polyglot (Chaplain) on Dec 02, 2023 at 16:32 UTC
    Perl is not yet fully unicode compatible, despite the fact we will soon ring in the year 2024. Perl's official documents still see security risks with unicode, saying, for example: "Also, the use of Unicode may present security issues that aren't obvious, see 'Security Implications of Unicode' below." There are, however, some ways to get around this. One of those is to include pleas in your own code to use unicode, such as these:
    use utf8; #FOR THE "wide characters" IN YOUR OWN CODE binmode STDIN, ":utf8"; #FOR INCOMING UTF8 binmode STDOUT, ":utf8"; #FOR OUTGOING UTF8 binmode STDERR, ":utf8"; #AND FOR ERRORS SEPARATELY use open qw/:std :utf8/; #THIS ONE CAN BE PROBLEMATIC WITH DATA +BASE INTERACTIONS use open ':encoding(utf8)'; #ANOTHER WAY OF SAYING IT use feature 'unicode_strings'; #ANOTHER PART OF 'TMTOWTDI' FOR PERL U +NICODE
    When it's someone else's code, however, the situation becomes more problematic. Be careful which modules you choose to incorporate.

    Of course, if these options fail, and the UTF8 characters are not quintessential to your application, you can also remove them all and stick with a pure-ASCII solution. This may cause the least headache if UTF8 is not important to you. You could then use virtually any modules, and have no issue with any I/O operations. But it will not be very future-proof.

    I look forward to the day when Perl has advanced to using unicode natively--by default. It's too bad that day is not already here.

    See more here:



      Pretty odd that you mention Perl doesn't support Unicode while showing it does, and pretty odd that you mention security risks then proceed to use :utf8 (whose non-validating nature can produce corrupt scalars) instead of :encoding(UTF-8).

        I didn't say Perl doesn't support Unicode. You're putting words in my mouth. Look closely at what I said.

        Is it true or is it not true that Perl is not using utf8 encodings natively? The very fact that one must tease Perl into using utf8, and the very fact that its developers still see security risks in its use are rather indicative of the fact that Perl is still a ways from fully adopting it.

        Regarding "use utf8;", is there any other incantation which Perl uses for embedding utf8 characters in one's code? As far as I know, this is still the standard way to tell Perl one's code includes utf8 characters. If that means, then, that there are legitimate security risks with putting UTF8 in my code, perhaps I should start learning Python, as the "Polyglot" username is indicative of the type of programming I do.



      I've been reading the documentation for Perl 6, aka "Raku"...and I think I'm falling in love again. Unlike Perl 5, Raku is UTF8-based, both in its code, and its I/O.

      In their words...from the "Lexical Conventions" entry HERE:

      Raku code is Unicode text. Current implementations support UTF-8 as the input encoding. See also Unicode versus ASCII symbols.
      And from the "Normalization" entry HERE:
      Raku applies normalization by default to all input and output except for file names, which are read and written as UTF8-C8; graphemes, which are user-visible forms of the characters, will use a normalized representation.
      Everything I've been reading, fits what I've been needing. Perhaps it's time for a new language. I'm on the verge of taking that plunge. The UTF8 issue has been troublesome for me with Perl5 for a long time, and is the proverbial straw that broke the camel's back--perhaps the pun is fitting.



        FYI Raku is formerly known as Perl 6, not alternatively. Raku is not Perl, despite being described in some places as a "sister language."

        The way forward always starts with a minimal test.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11156024]
Approved by choroba
Front-paged by Corion
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (1)
As of 2024-04-15 21:18 GMT
Find Nodes?
    Voting Booth?

    No recent polls found