jasonk has asked for the wisdom of the Perl Monks concerning the following question:
History: I'm working on an application that involves a huge amount of text-processing, scraping useful information from thousands and thousands of files in about 3 dozen different formats. One of the problems I'm trying to overcome is that although these text files were all generated by different versions of the same tool, their contents vary pretty drastically depending on what version of the tool and what version of the system the tool was collecting data for.
So, with that in mind, I wanted to be damn sure that changes made to the extraction tools to correct problems with one version of the data didn't break other versions of the data, so I wrote a script to take the known-working version, and store both the original contents and the known-working output so I could use it later to make sure the same input would produce the same output. Sounds simple enough, right?
Apparently it's only simple if you don't care if that file is human readable or not. I tried using YAML (even though I seem to have a history of discovering new bugs every time I try to use YAML) because the output is so nicely readable. Unfortunately I fairly quickly found a simple case that YAML couldn't roundtrip. I've actually tried several different YAML implementations, and all of them had problems.
This is a sample of the text I was working with:
F S UID PID PPID C PRI NI CMD 040 S root 14 0 0 75 0 [kupdated]
One of the things I liked about using YAML to store this stuff, is that it has a 'block-folding' operator, where this text can (theoretically be stored like this:
content: | F S UID PID PPID C PRI NI CMD 040 S root 14 0 0 75 0 [kupdated]
Nice and readable, and easy to work with, right? Not really, YAML 0.66 serialized it to this:
--- |2 F S UID PID PPID C PRI NI CMD 040 S root 14 0 0 75 0 [kupdated] 040 S root 13 0 0 85 0 [bdflush]
And then puked when attempting to deserialize it...
YAML Error: Inconsistent indentation level Code: YAML_PARSE_ERR_INCONSISTENT_INDENTATION Line: 3 Document: 1 at /usr/local/lib/perl5/site_perl/5.8.8/YAML.pm line 33
Even a very simple string that starts with spaces can't be roundtripeed...
use strict; use warnings; use YAML qw( LoadFile Dump Load ); use Data::Dump qw( dump ); my $in = " FOO\nBAR BAZ\n"; my $out = Load( Dump( $in ) ); print dump( $in, $out ); ------ output ----- ( " FOO\nBAR BAZ\n", " FOO\nBAR BAZ\n", )
So, since there are several different YAML modules in CPAN, I figured one of them must meet my needs. No such luck..
YAML::Syck successfully roundtripped all the examples above, but it did it by not even attempting to do the folding, instead it just wrapped the whole thing in double quotes, and converted all the newlines to "\n", turning the whole document into one long, unreadable string.
YAML::Tiny couldn't even dump the original data, as the output that I'm storing is an object, and it doesn't seem to be able to serialize a blessed object.
So, for the time being I'm doing it the ugly way, and storing the source data in a separate file and using YAML::Syck for the comparison data. This is not an an ideal solution, as it requires that those two files be carefully kept together, so I'm wondering if anyone has suggestions for human-readable serialization formats that actually work on real-world data?
|
|---|