Substitute and converting to UTF8

tomred has asked for the wisdom of the Perl Monks concerning the following question:

Dear gurus,

Not for the first time I found myself duelling with character encoding.

I am trying to process CSV files from a variety of sources. The data, once parsed by Text::CSV_XS, is encoded by Cpanel::JSON::XS and stored in a UTF8 database table.

I often get caught by fatal errors, from Cpanel::JSON::XS:

"Error creating json: malformed or illegal unicode character in string"

utf8 "\x92" does not map to Unicode.

I'd like to create a hash of the most often offending characters and their substitutions but I can't seem to find a means of capturing the hex value. In the code below, I've cat'ed a sample into the bottom of the file.

What am I missing? Thanks in advance,

Dermot

use v5.22;
use warnings;

# Causes "Wide Character.." warning for :std
#use utf8;

my %swaps = (
    '91' => '‘',
    '92' => '’',
    '94' => '”',
    '96' => '–',
    'A9' => '©',
);


while (my $line = <DATA>) {
    say "Before=".$line;

    # Works.
    #my ($key) = $line =~ m/(\x92)/;
    # All fail
    #my ($key) = $line =~ /(\\x[[:xdigit:]]{1,3})/;
    #my ($key) = $line =~ m/(\x\d{2})/;
    #my ($key) = $line =~ /(0x[0-9A-F]{1,3})/i;
    #my ($key) = $line =~ m/(0x\d{1,3})/;
    #my ($key) = $line =~ m/(\0x\d{1,3})/;
    my ($key) = $line =~ m/0x(?-i:[\da-f]+)/;


    say "Key=$key";
    $line =~ s/$key/$swaps{$key}/g if $key;
    say "After=".$line;

}


# This sample has \x92 between "heart" and "s".
# In xxd, the line looks like this
# 0000080: 6420 7061 6e73 792c 2068 6561 7274 9273  d pansy, heart.s

__DATA__
"Wild pansy (Viola tricolor), 19th century illustration","19th-century
+ hand painted illustration of wild pansy, heart<92>s ease, or love in
+ idleness (Viola tricolor) flower by Pierre-Joseph Redoute (1759-1840
+). Published in Choix Des Plus Belles Fleurs, Paris (1827).",N/A,"Pan
+sy, pansies, wild, Viola tricolor, 19th century, painted, Engraving, 
+illustration, nobody, no-one, flower, artwork, Pierre Joseph redoute,
+ bloom, blossom, botanical, botanist, bud, flora, floral, history, hi
+storic, horticulture, leaves, petal, petals, plant, vintage, watercol
+or, flower head, painting, stem, victorian style, botanic, flowers, p
+lants, Botany",,C,Fl,N/A,,,,^M
[download]

Comment on Substitute and converting to UTF8 Download Code

Replies are listed 'Best First'.
Re: Substitute and converting to UTF8 by choroba (Cardinal) on Jan 08, 2021 at 14:01 UTC
Rather than fixing the consequences of wrong input, fix the decoding of the input. When reading from the CSV, follow the documentation and use `my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 }); # ~~~~~~~~~~~ open my $fh, "<:encoding(utf8)", "test.csv" or die "test.csv: $!"; # ~~~~~~~~~~~~~~~` [download] When encoding to JSON, be sure to decode the string, as Cpanel::JSON::XS works with UTF-8 encoded strings. #!/usr/bin/perl use strict; use warnings; use feature qw{ say }; use charnames ':full'; use Cpanel::JSON::XS; use Encode qw{ encode }; my $decoded = qq({"yellow horse": "\N{LATIN SMALL LETTER Z WITH CARON} +lu\N{LATIN SMALL LETTER T WITH CARON}ou\N{LATIN SMALL LETTER C WITH C +ARON}k\N{LATIN SMALL LETTER Y WITH ACUTE} k\N{LATIN SMALL LETTER U WI +TH RING ABOVE}\N{LATIN SMALL LETTER N WITH CARON}"}); my $encoded = encode('UTF-8', $decoded); my $structure = decode_json($encoded); binmode STDOUT, ':encoding(UTF-8)'; say $structure->{'yellow horse'}; [download] Note that I also set the encoding of the output handle. I can't show you how to set the encoding/decoding properly for the database as you haven't told us what driver you use. NOTE:* In real life, I'd use `use utf8;` and type "žluťoučký kůň" directly in the script, but PerlMonks can't display code containing non-latin1 characters. Some people even recommend to write your code this way, but I find utf-8 more readable. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^2: Substitute and converting to UTF8 by tomred (Acolyte) on Jan 08, 2021 at 15:15 UTC
The exiting (production) code uses the open pragma. use open qw/:std :encoding(utf8)/; From what I tell, that forces all $fh to UTF-8. After that, I found all hex-type substitution fail. As I've been labouring with the intention of substituting, I've been testing with '<:raw' instead. But it looks like the encode step is what's needed. I can at least encode_json now. use v5.22; use warnings; use Devel::Dwarn; use Cpanel::JSON::XS; use Encode qw/ decode encode /; use Text::CSV_XS; my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 1 }); while (my $line = <DATA>) { say "Before=".$line; my $string = encode('UTF8', $line); my $x = $csv->parse($string); warn $x if !$x; my @data = $csv->fields; Dwarn \@data; my $structure = encode_json(\@data); } __DATA__ IR_Choix_0092.tif,C,psi,,"Wild pansy (Viola tricolor), 19th century il +lustration","19th-century hand painted illustration of wild pansy, he +art<92>s ease, or love in idleness (Viola tricolor) flower by Pierre- +Joseph Redoute (1759-1840). Published in Choix Des Plus Belles Fleurs +, Paris (1827).",N/A,"Pansy, pansies, wild, Viola tricolor, 19th cent +ury, painted, Engraving, illustration, nobody, no-one, flower, artwor +k, Pierre Joseph redoute, bloom, blossom, botanical, botanist, bud, f +lora, floral, history, historic, horticulture, leaves, petal, petals, + plant, vintage, watercolor, flower head, painting, stem, victorian s +tyle, botanic, flowers, plants, Botany",,C,Fl,N/A,,,,^M [download] A big thank you. It's been a very frustrating day. Dermot	[reply] [d/l]
Re: Substitute and converting to UTF8 by hippo (Archbishop) on Jan 08, 2021 at 13:59 UTC
I'd like to create a hash of the most often offending characters and their substitutions You could do it that way but it's a losing battle. Is there some reason not to decode your CSV data from whatever encoding it really uses (as it is clearly not utf-8) and then process that decoded data? That ought to be the simpler approach. 🦛	[reply]