I've had no dealings with encoding previously, and am running into a bit of complexity. This may be a bit long, so let me first summarize.

SUMMARY: There are multiple steps data passes through where encoding may come into play. The input step is different between production and testing, but all further steps are the same. Production will import data from a MSSQL Database, whereas testing will input data from Excel spreadsheets. For now, I am focused on the testing process. (Yes, it would be better to also import test data from MSSQL, but we are constrained.)

OUTLINE of the testing source-to-destination steps:

  1. Input data using Excel spreadsheets (.xls)
  2. Store data in SQLite Database
  3. Query SQLite and process the data
  4. Write output to csv files

The input spreadsheet I have contains user data where names and addresses are from some European countries (I do not know which, nor what languages are in play.) I am running into garbled characters from the very first step. I am not sure how to inspect the data at each step to compare data along the way.

I believe Excel stores data using cp1252, so have tried to convert the data to utf8. The code below gives the error "Cannot decode string with wide characters" which makes me think there may be mixed-encoded data in my input; that is, maybe some columns are of different encoding than other columns.

package DB::SQLite::Tables; use warnings; use strict; use Carp; use Encode; use Spreadsheet::ParseExcel::Simple; sub load_test_people_tbl { my ( $the, $path, $readfile, $sqlite_schema, $do_load ) = @_; my $data_obj = Spreadsheet::ParseExcel::Simple->read( "$path/$readfile" ); foreach my $sheet ($data_obj->sheets) { ROW: while ( $sheet->has_data ) { @row = $sheet->next_row; $row_count_ppl++; # get header from input file if ( $row_count_ppl == 1 ) { @fields = @row; map { $_ =~ s{[\s+|-]}{_}xms } @fields; } # Gives error "Cannot decode string with wide characters" map { $_ = decode("utf8", $_) } @row; # Uses DBIx::Class if ( $do_load == 1 ) { $sqlite_schema->populate('Person', [ \@fields, \@row, ]); } } # end ROW } } # end load_test_people_tbl

Example input/output character mangling:

Some name in input xls spreadsheet: Toxv‘rd

later....

Same name in output csv spreadsheet: Toxv‘rd

QUESTIONS:

/dennis


In reply to Mixed character encoding issues by ddaupert

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.