in reply to Load large(ish) files with YAML?

I'd want to move a 4MB JPEG

You do realise that jpg is a binary data format?

And that AFAIK, YAML doesn't handle binary data.

The regex in question is probably looking for newlines (anticipating text) so that it can break your 4MB into "lines", but your binary file doesn't contain any within 32k of the start of the file.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Load large(ish) files with YAML?
by rgcosma (Beadle) on Apr 21, 2011 at 22:16 UTC
    Reading through yaml.org, there seem to be just one clearly stated option to send binary data (base64) - although there is wording in that same document that suggests anything can be embedded as long as it starts with a !!customtag and can be represented in Unicode. Puppet doesn't do base64, it sends a long string of \x123\x321 etc (this is the actual string it sends, those are not hex chars but their readable form). A thread on their mailing lists suggests they are aware of the issue and consider it a bug: Google groups

      Essentially what you are saying is that Puppet is sending you non-standard (therefore non-YAML) data, and Perl's YAML modules don't handle it. Understandable :)

      Decoding that long string manually shouldn't be a problem. Though the format you describe means your 4MB jpg will come across as 20MB of text.

      If it were encoded as asciified hex bytes (technical terms:), then decoding it would be simple, if horribly slow. Something like:

      my $jpg = pack 'C*', map hex( '0' . $1 ), $content =~ m[\\(x[0-9a-fA-F +]+)]g;

      But, your example \x123\x321 shows values greater than \xff, which suggests that they are encoding unicode characters rather than bytes. So you'd need something like:

      my $jpg = pack 'U*', map hex( '0' . $1 ), $content =~ m[\\(x[0-9a-fA-F +]+)]g;

      But whether you could then print that to a binary file without getting a bunch of Wide character in print ... warnings or the content messed with by IO layers I have no idea.

      Also, be aware that not only will you have the original 20MB string, and the 4MB result in memory, but also 2 very large lists of scalars. One to the map and one to the pack. How large will depend upon how the unicode decides to split up the binary into 'characters', but each list will be at least 1 million scalars and up to 4 million long.

      Seems like a really silly (slow, clumsy & labourious) way to transfer a file given that LWP::Simple will transfer a 4MB binary file locally in less than second.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
      Puppet doesn't do base64, it sends a long string of \x123\x321 etc (this is the actual string it sends, those are not hex chars but their readable form).
      Wow, what a terrible piece of software -- though I suppose it's pretty typical for Ruby, with its mix of historical ignorance and inefficiency. We've known how to encode binary data as ASCII text with reasonable line lengths for a long time: it's called "uuencode," and pack and unpack handle it just fine.

        While you are right that they don't encode properly, it seems there was an existing solution: the puppet app recognizes an 'Accept: s' header that means it'll return the file as-is.

        The working example would be:
        my $ua = LWP::UserAgent->new(); my $ay = HTTP::Headers->new; $ay->head +er('Accept' => 'YAML'); my $as = HTTP::Headers->new; $as->header('Acc +ept' => 's'); my @osini; my @partini; my $req = HTTP::Request->new('GET', "https://$server/production/file_m +etadata/test/hm.jpg", $ay); my $res = $ua->request($req); die "Something went wrong: $res->status_line" unless $res->is_su +ccess; @osini = YAML::Load($res->content."\n"); my $md5 = $osini[0]->{checksum}; $md5 =~ s/^{md5}//; $req = HTTP::Request->new('GET', "https://$server/production/fil +e_bucket_file/md5/$md5", $as); $res = $ua->request($req); open AAA,'>aaa.jpg'; binmode AAA; print AAA $res->content;