As the sample code below shows, I can assign two (non-ASCII/Latin-1) character string values that are supposed to be the same, and the two versions are stored differently in my database.

Based on what you have posted, I would expect that the two strings you are using are not the same: $text1 is being assigned a utf8 string value, which contains a wide character (this is stored internally as a two-byte utf8 character); but $text2 is being assigned an iso-8859 string containing a single-byte accented character.

At least, when I look closely at the posted code, the value assigned to $text2 contains an accented character that is definitely a single byte and cannot be utf8. If you want to put literal utf8 characters in your perl script, you have to use a utf8-capable editor. Otherwise, you have to stick to using the unicode name references (like you did for $text1), or hex code points (e.g. "\xE4" for ä or "\x0103" for ă etc). update: Or you could use a non-utf8 editor, then run the script though an encoding conversion to change the iso-8859 (or cp-1252?) accented characters to utf8 wide characters.

So you need to check and make sure that the stuff you are loading into the table is in fact encoded in a consistent manner -- if you put different encodings in, then you will obviously get different encodings back, and strings that are supposed to have the same letters will be different.

The database ought to be agnostic as to character encoding -- you give it a string of bytes, it stores them, and you get them back when you ask for them.

As for making sure that you have consistent encoding for all the stuff you feed to the database, I don't think you've told us enough about the problem to give an idea of how hard or easy this might be. Where is the character data coming from? (How many different sources? String literals in your script? Data files from "outsiders"? ...)


In reply to Re: database stores UTF8 strings inconsistently by graff
in thread database stores UTF8 strings inconsistently by robv

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.