Dear gurus,
Not for the first time I found myself duelling with character encoding.
I am trying to process CSV files from a variety of sources. The data, once parsed by Text::CSV_XS, is encoded by Cpanel::JSON::XS and stored in a UTF8 database table.
I often get caught by fatal errors, from Cpanel::JSON::XS:
"Error creating json: malformed or illegal unicode character in string"or
utf8 "\x92" does not map to Unicode.
I'd like to create a hash of the most often offending characters and their substitutions but I can't seem to find a means of capturing the hex value. In the code below, I've cat'ed a sample into the bottom of the file.
What am I missing? Thanks in advance,
Dermotuse v5.22; use warnings; # Causes "Wide Character.." warning for :std #use utf8; my %swaps = ( '91' => '‘', '92' => '’', '94' => '”', '96' => '–', 'A9' => '©', ); while (my $line = <DATA>) { say "Before=".$line; # Works. #my ($key) = $line =~ m/(\x92)/; # All fail #my ($key) = $line =~ /(\\x[[:xdigit:]]{1,3})/; #my ($key) = $line =~ m/(\x\d{2})/; #my ($key) = $line =~ /(0x[0-9A-F]{1,3})/i; #my ($key) = $line =~ m/(0x\d{1,3})/; #my ($key) = $line =~ m/(\0x\d{1,3})/; my ($key) = $line =~ m/0x(?-i:[\da-f]+)/; say "Key=$key"; $line =~ s/$key/$swaps{$key}/g if $key; say "After=".$line; } # This sample has \x92 between "heart" and "s". # In xxd, the line looks like this # 0000080: 6420 7061 6e73 792c 2068 6561 7274 9273 d pansy, heart.s __DATA__ "Wild pansy (Viola tricolor), 19th century illustration","19th-century + hand painted illustration of wild pansy, heart<92>s ease, or love in + idleness (Viola tricolor) flower by Pierre-Joseph Redoute (1759-1840 +). Published in Choix Des Plus Belles Fleurs, Paris (1827).",N/A,"Pan +sy, pansies, wild, Viola tricolor, 19th century, painted, Engraving, +illustration, nobody, no-one, flower, artwork, Pierre Joseph redoute, + bloom, blossom, botanical, botanist, bud, flora, floral, history, hi +storic, horticulture, leaves, petal, petals, plant, vintage, watercol +or, flower head, painting, stem, victorian style, botanic, flowers, p +lants, Botany",,C,Fl,N/A,,,,^M
In reply to Substitute and converting to UTF8 by tomred
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |