kgoess has asked for the wisdom of the Perl Monks concerning the following question:

I have two files, each with a serialized hash produced by Storabe::nfreeze. With one file, this works as expected:
use Storable qw/thaw/; use Data::Dump qw/dump/; use Encode qw/decode decode_utf8/; my $file = shift; open (my $fh, "<:raw", $file) || die $!; my $contents = join '', <$fh>; dump thaw $contents;
But the second file fails with "Corrupted storable string (binary v2.7)" *unless* I add this, in which case it works fine
$contents = decode_utf8($contents);

Though with that code the *first* file fails with "Frozen string corrupt - contains characters outside 0-255", and actually doing "decode('utf-8-strict', $contents, 1)" on the first file throws utf8 "\xDD" does not map to Unicode", though it works ok on the second file.

I'm unable to even imagine why Storable::thaw would expect to be working with logical Perl characters instead of bytes, and came here looking for some clue as to what might be going on.

AFAICT there aren't any non-ascii characters in either data structure.

Versions:

$ perl -v This is perl, v5.10.1 (*) built for x86_64-linux-gnu-thread-multi (with 61 registered patches, see perl -V for more detail) ... $ perl -MStorable -le 'print Storable->VERSION' 2.20
No, I can't post the original data here, it's quite large.

Replies are listed 'Best First'.
Re: storable and utf8
by graff (Chancellor) on Mar 06, 2014 at 03:28 UTC
    Just for grins, you might want to try out the script I posted here: unichist -- count/summarize characters in data. Run it on each file, to see whether you have any characters outside the (7-bit) ASCII range, and if so, whether they are properly encoded as utf8 or not, and if they are, what range of code points you have.

    Given that you have an error message that explicitly mentions "\xDD", it seems that the file in question clearly has non-ASCII characters content (i.e. bytes with the 8th bit set, whether or not they also happen to be utf8).

    Also, it seems odd that when your code says use Storable, you get v2.20, but one of your error messages is saying something about v2.7 - what's up with that? How much do you know about the origins or provenance of these files?

Re: storable and utf8
by ww (Archbishop) on Mar 05, 2014 at 19:31 UTC

    So post a sample... of a small portion, including some lines that are fine and some that are borked... and do it INSIDE code tags, so we can help.

    Come, let us reason together: Spirit of the Monastery
Re: storable and utf8
by ikegami (Patriarch) on Mar 06, 2014 at 15:23 UTC

    $contents = decode_utf8($contents); is only correct if you erroneously did $contents = encode_utf8($contents); or equivalent to the frozen contents. You should remove the encoding rather than adding the decoding.

Re: storable and utf8
by sn1987a (Curate) on Mar 05, 2014 at 19:23 UTC
    No, I can't post the original data here, it's quite large.

    Can you find and post a subset that produces the same problem?

Re: storable and utf8
by kgoess (Beadle) on Mar 07, 2014 at 22:53 UTC

    I found the source of my problem. I was adding the serialized data to a tar file via Archive::Tar->add_data. But I had missed the documentation in Archive::Tar that says:

    Unicode strings need to be converted to UTF-8-encoded bytestrings before they are handed off to "add_data()":

    So this change on the serializer

    - $tar->add_data($filename, $serialized_context, { type => Archive: +:Tar::FILE }); + $tar->add_data($filename, encode_utf8($serialized_context), { typ +e => Archive::Tar::FILE });

    and this change on the de-serializer

    - $self->serialized_blob($serialized_context); + $self->serialized_blob(decode_utf8($serialized_context));

    make it all work as expected, and my original problem makes more sense now, though it's not much more than GIGO.