comment on

History: I'm working on an application that involves a huge amount of text-processing, scraping useful information from thousands and thousands of files in about 3 dozen different formats. One of the problems I'm trying to overcome is that although these text files were all generated by different versions of the same tool, their contents vary pretty drastically depending on what version of the tool and what version of the system the tool was collecting data for.

So, with that in mind, I wanted to be damn sure that changes made to the extraction tools to correct problems with one version of the data didn't break other versions of the data, so I wrote a script to take the known-working version, and store both the original contents and the known-working output so I could use it later to make sure the same input would produce the same output. Sounds simple enough, right?

Apparently it's only simple if you don't care if that file is human readable or not. I tried using YAML (even though I seem to have a history of discovering new bugs every time I try to use YAML) because the output is so nicely readable. Unfortunately I fairly quickly found a simple case that YAML couldn't roundtrip. I've actually tried several different YAML implementations, and all of them had problems.

This is a sample of the text I was working with:

  F S UID        PID  PPID  C PRI  NI    CMD
040 S root        14     0  0  75   0    [kupdated]
[download]

One of the things I liked about using YAML to store this stuff, is that it has a 'block-folding' operator, where this text can (theoretically be stored like this:

content: |
  F S UID        PID  PPID  C PRI  NI    CMD
040 S root        14     0  0  75   0    [kupdated]
[download]

Nice and readable, and easy to work with, right? Not really, YAML 0.66 serialized it to this:

--- |2
  F S UID        PID  PPID  C PRI  NI CMD
040 S root        14     0  0  75   0 [kupdated]
040 S root        13     0  0  85   0 [bdflush]
[download]

And then puked when attempting to deserialize it...

YAML Error: Inconsistent indentation level
   Code: YAML_PARSE_ERR_INCONSISTENT_INDENTATION
   Line: 3
   Document: 1
 at /usr/local/lib/perl5/site_perl/5.8.8/YAML.pm line 33
[download]

Even a very simple string that starts with spaces can't be roundtripeed...

use strict; use warnings;
use YAML qw( LoadFile Dump Load );
use Data::Dump qw( dump );

my $in = "  FOO\nBAR  BAZ\n";
my $out = Load( Dump( $in ) );
print dump( $in, $out );

------ output -----
(
    "  FOO\nBAR  BAZ\n",
    " FOO\nBAR BAZ\n",
)
[download]

So, since there are several different YAML modules in CPAN, I figured one of them must meet my needs. No such luck..

YAML::Syck successfully roundtripped all the examples above, but it did it by not even attempting to do the folding, instead it just wrapped the whole thing in double quotes, and converted all the newlines to "\n", turning the whole document into one long, unreadable string.

YAML::Tiny couldn't even dump the original data, as the output that I'm storing is an object, and it doesn't seem to be able to serialize a blessed object.

So, for the time being I'm doing it the ugly way, and storing the source data in a separate file and using YAML::Syck for the comparison data. This is not an an ideal solution, as it requires that those two files be carefully kept together, so I'm wondering if anyone has suggestions for human-readable serialization formats that actually work on real-world data?

www.jasonkohles.com
We're not surrounded, we're in a target-rich environment!

In reply to Human-readable serialization formats other than YAML? by jasonk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.