Dear gurus,

Not for the first time I found myself duelling with character encoding.

I am trying to process CSV files from a variety of sources. The data, once parsed by Text::CSV_XS, is encoded by Cpanel::JSON::XS and stored in a UTF8 database table.

I often get caught by fatal errors, from Cpanel::JSON::XS:

"Error creating json: malformed or illegal unicode character in string"
or
utf8 "\x92" does not map to Unicode.

I'd like to create a hash of the most often offending characters and their substitutions but I can't seem to find a means of capturing the hex value. In the code below, I've cat'ed a sample into the bottom of the file.

What am I missing? Thanks in advance,

Dermot
use v5.22; use warnings; # Causes "Wide Character.." warning for :std #use utf8; my %swaps = ( '91' => '‘', '92' => '’', '94' => '”', '96' => '–', 'A9' => '©', ); while (my $line = <DATA>) { say "Before=".$line; # Works. #my ($key) = $line =~ m/(\x92)/; # All fail #my ($key) = $line =~ /(\\x[[:xdigit:]]{1,3})/; #my ($key) = $line =~ m/(\x\d{2})/; #my ($key) = $line =~ /(0x[0-9A-F]{1,3})/i; #my ($key) = $line =~ m/(0x\d{1,3})/; #my ($key) = $line =~ m/(\0x\d{1,3})/; my ($key) = $line =~ m/0x(?-i:[\da-f]+)/; say "Key=$key"; $line =~ s/$key/$swaps{$key}/g if $key; say "After=".$line; } # This sample has \x92 between "heart" and "s". # In xxd, the line looks like this # 0000080: 6420 7061 6e73 792c 2068 6561 7274 9273 d pansy, heart.s __DATA__ "Wild pansy (Viola tricolor), 19th century illustration","19th-century + hand painted illustration of wild pansy, heart<92>s ease, or love in + idleness (Viola tricolor) flower by Pierre-Joseph Redoute (1759-1840 +). Published in Choix Des Plus Belles Fleurs, Paris (1827).",N/A,"Pan +sy, pansies, wild, Viola tricolor, 19th century, painted, Engraving, +illustration, nobody, no-one, flower, artwork, Pierre Joseph redoute, + bloom, blossom, botanical, botanist, bud, flora, floral, history, hi +storic, horticulture, leaves, petal, petals, plant, vintage, watercol +or, flower head, painting, stem, victorian style, botanic, flowers, p +lants, Botany",,C,Fl,N/A,,,,^M

In reply to Substitute and converting to UTF8 by tomred

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.