History: I'm working on an application that involves a huge amount of text-processing, scraping useful information from thousands and thousands of files in about 3 dozen different formats. One of the problems I'm trying to overcome is that although these text files were all generated by different versions of the same tool, their contents vary pretty drastically depending on what version of the tool and what version of the system the tool was collecting data for.

So, with that in mind, I wanted to be damn sure that changes made to the extraction tools to correct problems with one version of the data didn't break other versions of the data, so I wrote a script to take the known-working version, and store both the original contents and the known-working output so I could use it later to make sure the same input would produce the same output. Sounds simple enough, right?

Apparently it's only simple if you don't care if that file is human readable or not. I tried using YAML (even though I seem to have a history of discovering new bugs every time I try to use YAML) because the output is so nicely readable. Unfortunately I fairly quickly found a simple case that YAML couldn't roundtrip. I've actually tried several different YAML implementations, and all of them had problems.

This is a sample of the text I was working with:

F S UID PID PPID C PRI NI CMD 040 S root 14 0 0 75 0 [kupdated]

One of the things I liked about using YAML to store this stuff, is that it has a 'block-folding' operator, where this text can (theoretically be stored like this:

content: | F S UID PID PPID C PRI NI CMD 040 S root 14 0 0 75 0 [kupdated]

Nice and readable, and easy to work with, right? Not really, YAML 0.66 serialized it to this:

--- |2 F S UID PID PPID C PRI NI CMD 040 S root 14 0 0 75 0 [kupdated] 040 S root 13 0 0 85 0 [bdflush]

And then puked when attempting to deserialize it...

YAML Error: Inconsistent indentation level Code: YAML_PARSE_ERR_INCONSISTENT_INDENTATION Line: 3 Document: 1 at /usr/local/lib/perl5/site_perl/5.8.8/YAML.pm line 33

Even a very simple string that starts with spaces can't be roundtripeed...

use strict; use warnings; use YAML qw( LoadFile Dump Load ); use Data::Dump qw( dump ); my $in = " FOO\nBAR BAZ\n"; my $out = Load( Dump( $in ) ); print dump( $in, $out ); ------ output ----- ( " FOO\nBAR BAZ\n", " FOO\nBAR BAZ\n", )

So, since there are several different YAML modules in CPAN, I figured one of them must meet my needs. No such luck..

YAML::Syck successfully roundtripped all the examples above, but it did it by not even attempting to do the folding, instead it just wrapped the whole thing in double quotes, and converted all the newlines to "\n", turning the whole document into one long, unreadable string.

YAML::Tiny couldn't even dump the original data, as the output that I'm storing is an object, and it doesn't seem to be able to serialize a blessed object.

So, for the time being I'm doing it the ugly way, and storing the source data in a separate file and using YAML::Syck for the comparison data. This is not an an ideal solution, as it requires that those two files be carefully kept together, so I'm wondering if anyone has suggestions for human-readable serialization formats that actually work on real-world data?


www.jasonkohles.com
We're not surrounded, we're in a target-rich environment!

In reply to Human-readable serialization formats other than YAML? by jasonk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.