in reply to Substitute and converting to UTF8

Rather than fixing the consequences of wrong input, fix the decoding of the input.

When reading from the CSV, follow the documentation and use

my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 }); # ~~~~~~~~~~~ open my $fh, "<:encoding(utf8)", "test.csv" or die "test.csv: $!"; # ~~~~~~~~~~~~~~~

When encoding to JSON, be sure to decode the string, as Cpanel::JSON::XS works with UTF-8 encoded strings.

#!/usr/bin/perl use strict; use warnings; use feature qw{ say }; use charnames ':full'; use Cpanel::JSON::XS; use Encode qw{ encode }; my $decoded = qq({"yellow horse": "\N{LATIN SMALL LETTER Z WITH CARON} +lu\N{LATIN SMALL LETTER T WITH CARON}ou\N{LATIN SMALL LETTER C WITH C +ARON}k\N{LATIN SMALL LETTER Y WITH ACUTE} k\N{LATIN SMALL LETTER U WI +TH RING ABOVE}\N{LATIN SMALL LETTER N WITH CARON}"}); my $encoded = encode('UTF-8', $decoded); my $structure = decode_json($encoded); binmode *STDOUT, ':encoding(UTF-8)'; say $structure->{'yellow horse'};

Note that I also set the encoding of the output handle.

I can't show you how to set the encoding/decoding properly for the database as you haven't told us what driver you use.

NOTE: In real life, I'd use use utf8; and type "žluťoučký kůň" directly in the script, but PerlMonks can't display code containing non-latin1 characters. Some people even recommend to write your code this way, but I find utf-8 more readable.

map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

Replies are listed 'Best First'.
Re^2: Substitute and converting to UTF8
by tomred (Acolyte) on Jan 08, 2021 at 15:15 UTC

    The exiting (production) code uses the open pragma.

    use open qw/:std :encoding(utf8)/;

    From what I tell, that forces all $fh to UTF-8. After that, I found all hex-type substitution fail. As I've been labouring with the intention of substituting, I've been testing with '<:raw' instead.

    But it looks like the encode step is what's needed. I can at least encode_json now.

    use v5.22; use warnings; use Devel::Dwarn; use Cpanel::JSON::XS; use Encode qw/ decode encode /; use Text::CSV_XS; my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 1 }); while (my $line = <DATA>) { say "Before=".$line; my $string = encode('UTF8', $line); my $x = $csv->parse($string); warn $x if !$x; my @data = $csv->fields; Dwarn \@data; my $structure = encode_json(\@data); } __DATA__ IR_Choix_0092.tif,C,psi,,"Wild pansy (Viola tricolor), 19th century il +lustration","19th-century hand painted illustration of wild pansy, he +art<92>s ease, or love in idleness (Viola tricolor) flower by Pierre- +Joseph Redoute (1759-1840). Published in Choix Des Plus Belles Fleurs +, Paris (1827).",N/A,"Pansy, pansies, wild, Viola tricolor, 19th cent +ury, painted, Engraving, illustration, nobody, no-one, flower, artwor +k, Pierre Joseph redoute, bloom, blossom, botanical, botanist, bud, f +lora, floral, history, historic, horticulture, leaves, petal, petals, + plant, vintage, watercolor, flower head, painting, stem, victorian s +tyle, botanic, flowers, plants, Botany",,C,Fl,N/A,,,,^M

    A big thank you. It's been a very frustrating day.

    Dermot