And output should also be utf8 encoded unicode. Which it already is so I modified the step to skip the wrong encode step (new step 3) - am I doing it right now?

It would be easier to answer that if you showed us a relevant code snippet. And if you try the snippet yourself, that will probably answer the question. Check out this little unicode tool (shameless plug for a prog I posted recently), in case that helps to validate your data.

For the interested reader: in fact I use storable to serialize my resulting data structure as whole, then I gzip the freeze'd data and write it to disk with a simple binmode (and thus not :utf8) filehandle. Any problems here? utf8 data and utf8-flag should stay intact over the pipeline.

The utf8 flag is strictly a perl-internal attribute of scalar values. Once data is written to any sort of file (including any pipe), it's just data, and what happens to it after that point depends on what sort of process is reading it, and how that process chooses to interpret what is being read.

There is a section of the Storable man page about utf8 (under the heading "FORWARD COMPATIBILITY"), which you should consult. It looks like it will "do the right thing" for you by default (retain the utf8 flag as part of the "freeze"d data structure so that a downstream "thaw" gets it), but it'll be worth testing to be sure. (I haven't used it, so I don't know.)


In reply to Re^2: The unicode / utf8 struggle, part 2: regexes by graff
in thread The unicode / utf8 struggle, part 2: regexes by isync

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.