tomred has asked for the wisdom of the Perl Monks concerning the following question:

Dear gurus,

Not for the first time I found myself duelling with character encoding.

I am trying to process CSV files from a variety of sources. The data, once parsed by Text::CSV_XS, is encoded by Cpanel::JSON::XS and stored in a UTF8 database table.

I often get caught by fatal errors, from Cpanel::JSON::XS:

"Error creating json: malformed or illegal unicode character in string"
or
utf8 "\x92" does not map to Unicode.

I'd like to create a hash of the most often offending characters and their substitutions but I can't seem to find a means of capturing the hex value. In the code below, I've cat'ed a sample into the bottom of the file.

What am I missing? Thanks in advance,

Dermot
use v5.22; use warnings; # Causes "Wide Character.." warning for :std #use utf8; my %swaps = ( '91' => '‘', '92' => '’', '94' => '”', '96' => '–', 'A9' => '©', ); while (my $line = <DATA>) { say "Before=".$line; # Works. #my ($key) = $line =~ m/(\x92)/; # All fail #my ($key) = $line =~ /(\\x[[:xdigit:]]{1,3})/; #my ($key) = $line =~ m/(\x\d{2})/; #my ($key) = $line =~ /(0x[0-9A-F]{1,3})/i; #my ($key) = $line =~ m/(0x\d{1,3})/; #my ($key) = $line =~ m/(\0x\d{1,3})/; my ($key) = $line =~ m/0x(?-i:[\da-f]+)/; say "Key=$key"; $line =~ s/$key/$swaps{$key}/g if $key; say "After=".$line; } # This sample has \x92 between "heart" and "s". # In xxd, the line looks like this # 0000080: 6420 7061 6e73 792c 2068 6561 7274 9273 d pansy, heart.s __DATA__ "Wild pansy (Viola tricolor), 19th century illustration","19th-century + hand painted illustration of wild pansy, heart<92>s ease, or love in + idleness (Viola tricolor) flower by Pierre-Joseph Redoute (1759-1840 +). Published in Choix Des Plus Belles Fleurs, Paris (1827).",N/A,"Pan +sy, pansies, wild, Viola tricolor, 19th century, painted, Engraving, +illustration, nobody, no-one, flower, artwork, Pierre Joseph redoute, + bloom, blossom, botanical, botanist, bud, flora, floral, history, hi +storic, horticulture, leaves, petal, petals, plant, vintage, watercol +or, flower head, painting, stem, victorian style, botanic, flowers, p +lants, Botany",,C,Fl,N/A,,,,^M

Replies are listed 'Best First'.
Re: Substitute and converting to UTF8
by choroba (Cardinal) on Jan 08, 2021 at 14:01 UTC
    Rather than fixing the consequences of wrong input, fix the decoding of the input.

    When reading from the CSV, follow the documentation and use

    my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 }); # ~~~~~~~~~~~ open my $fh, "<:encoding(utf8)", "test.csv" or die "test.csv: $!"; # ~~~~~~~~~~~~~~~

    When encoding to JSON, be sure to decode the string, as Cpanel::JSON::XS works with UTF-8 encoded strings.

    #!/usr/bin/perl use strict; use warnings; use feature qw{ say }; use charnames ':full'; use Cpanel::JSON::XS; use Encode qw{ encode }; my $decoded = qq({"yellow horse": "\N{LATIN SMALL LETTER Z WITH CARON} +lu\N{LATIN SMALL LETTER T WITH CARON}ou\N{LATIN SMALL LETTER C WITH C +ARON}k\N{LATIN SMALL LETTER Y WITH ACUTE} k\N{LATIN SMALL LETTER U WI +TH RING ABOVE}\N{LATIN SMALL LETTER N WITH CARON}"}); my $encoded = encode('UTF-8', $decoded); my $structure = decode_json($encoded); binmode *STDOUT, ':encoding(UTF-8)'; say $structure->{'yellow horse'};

    Note that I also set the encoding of the output handle.

    I can't show you how to set the encoding/decoding properly for the database as you haven't told us what driver you use.

    NOTE: In real life, I'd use use utf8; and type "žluťoučký kůň" directly in the script, but PerlMonks can't display code containing non-latin1 characters. Some people even recommend to write your code this way, but I find utf-8 more readable.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      The exiting (production) code uses the open pragma.

      use open qw/:std :encoding(utf8)/;

      From what I tell, that forces all $fh to UTF-8. After that, I found all hex-type substitution fail. As I've been labouring with the intention of substituting, I've been testing with '<:raw' instead.

      But it looks like the encode step is what's needed. I can at least encode_json now.

      use v5.22; use warnings; use Devel::Dwarn; use Cpanel::JSON::XS; use Encode qw/ decode encode /; use Text::CSV_XS; my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 1 }); while (my $line = <DATA>) { say "Before=".$line; my $string = encode('UTF8', $line); my $x = $csv->parse($string); warn $x if !$x; my @data = $csv->fields; Dwarn \@data; my $structure = encode_json(\@data); } __DATA__ IR_Choix_0092.tif,C,psi,,"Wild pansy (Viola tricolor), 19th century il +lustration","19th-century hand painted illustration of wild pansy, he +art<92>s ease, or love in idleness (Viola tricolor) flower by Pierre- +Joseph Redoute (1759-1840). Published in Choix Des Plus Belles Fleurs +, Paris (1827).",N/A,"Pansy, pansies, wild, Viola tricolor, 19th cent +ury, painted, Engraving, illustration, nobody, no-one, flower, artwor +k, Pierre Joseph redoute, bloom, blossom, botanical, botanist, bud, f +lora, floral, history, historic, horticulture, leaves, petal, petals, + plant, vintage, watercolor, flower head, painting, stem, victorian s +tyle, botanic, flowers, plants, Botany",,C,Fl,N/A,,,,^M

      A big thank you. It's been a very frustrating day.

      Dermot
Re: Substitute and converting to UTF8
by hippo (Archbishop) on Jan 08, 2021 at 13:59 UTC
    I'd like to create a hash of the most often offending characters and their substitutions

    You could do it that way but it's a losing battle. Is there some reason not to decode your CSV data from whatever encoding it really uses (as it is clearly not utf-8) and then process that decoded data? That ought to be the simpler approach.


    🦛