tomred has asked for the wisdom of the Perl Monks concerning the following question:
Dear gurus,
Not for the first time I found myself duelling with character encoding.
I am trying to process CSV files from a variety of sources. The data, once parsed by Text::CSV_XS, is encoded by Cpanel::JSON::XS and stored in a UTF8 database table.
I often get caught by fatal errors, from Cpanel::JSON::XS:
"Error creating json: malformed or illegal unicode character in string"or
utf8 "\x92" does not map to Unicode.
I'd like to create a hash of the most often offending characters and their substitutions but I can't seem to find a means of capturing the hex value. In the code below, I've cat'ed a sample into the bottom of the file.
What am I missing? Thanks in advance,
Dermotuse v5.22; use warnings; # Causes "Wide Character.." warning for :std #use utf8; my %swaps = ( '91' => '‘', '92' => '’', '94' => '”', '96' => '–', 'A9' => '©', ); while (my $line = <DATA>) { say "Before=".$line; # Works. #my ($key) = $line =~ m/(\x92)/; # All fail #my ($key) = $line =~ /(\\x[[:xdigit:]]{1,3})/; #my ($key) = $line =~ m/(\x\d{2})/; #my ($key) = $line =~ /(0x[0-9A-F]{1,3})/i; #my ($key) = $line =~ m/(0x\d{1,3})/; #my ($key) = $line =~ m/(\0x\d{1,3})/; my ($key) = $line =~ m/0x(?-i:[\da-f]+)/; say "Key=$key"; $line =~ s/$key/$swaps{$key}/g if $key; say "After=".$line; } # This sample has \x92 between "heart" and "s". # In xxd, the line looks like this # 0000080: 6420 7061 6e73 792c 2068 6561 7274 9273 d pansy, heart.s __DATA__ "Wild pansy (Viola tricolor), 19th century illustration","19th-century + hand painted illustration of wild pansy, heart<92>s ease, or love in + idleness (Viola tricolor) flower by Pierre-Joseph Redoute (1759-1840 +). Published in Choix Des Plus Belles Fleurs, Paris (1827).",N/A,"Pan +sy, pansies, wild, Viola tricolor, 19th century, painted, Engraving, +illustration, nobody, no-one, flower, artwork, Pierre Joseph redoute, + bloom, blossom, botanical, botanist, bud, flora, floral, history, hi +storic, horticulture, leaves, petal, petals, plant, vintage, watercol +or, flower head, painting, stem, victorian style, botanic, flowers, p +lants, Botany",,C,Fl,N/A,,,,^M
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Substitute and converting to UTF8
by choroba (Cardinal) on Jan 08, 2021 at 14:01 UTC | |
by tomred (Acolyte) on Jan 08, 2021 at 15:15 UTC | |
|
Re: Substitute and converting to UTF8
by hippo (Archbishop) on Jan 08, 2021 at 13:59 UTC |