rgcosma has asked for the wisdom of the Perl Monks concerning the following question:

Fellow Monks,

I'm toying with Puppet (a Ruby-based system administration app), and interfacing with it is seemingly done via a REST API. It allows file transfers, format YAML, which do work, but Perl's module chokes on anything larger than 32K while I'd want to move a 4MB JPEG. I do realise that even if it worked, it would be loaded entirely in RAM, but let's suppose that wasteful manner is not an issue. The error I get is "Complex regular subexpression recursion limit (32766) exceeded at YAML/Loader.pm line 519" - why would it even try to apply a regex to the data? The spec simply says everything up to \n should just be swallowed in a string. Smaller files, binary or not, do work - is this an expected limitation of the format?

Sample:
use LWP::UserAgent; use YAML; my $ua = LWP::UserAgent->new; my $ah = HTTP::Headers->new; $ah->header('Accept' => 'yaml'); //required by puppet my $req = HTTP::Request->new('GET', 'https://localhost:8140/production +/file_content/test/afile.jpg', $ah); my $res = $ua->request($req); if(!$res->is_success) { die "Something went wrong: $res->status_line" +} else { my @a = YAML::Load($res->content."\n"); //puppet seems to forget add +ing that trailing newline open(HM, '>hm.jpg'); binmode HM; print HM $a[0]->{content}; }
Update: the "complex regex" part seems to be quite an old bug: https://bugzilla.redhat.com/show_bug.cgi?id=192400
still doesn't explain why the string data should be parsed as such

Replies are listed 'Best First'.
Re: Load large(ish) files with YAML?
by BrowserUk (Patriarch) on Apr 21, 2011 at 17:00 UTC
    I'd want to move a 4MB JPEG

    You do realise that jpg is a binary data format?

    And that AFAIK, YAML doesn't handle binary data.

    The regex in question is probably looking for newlines (anticipating text) so that it can break your 4MB into "lines", but your binary file doesn't contain any within 32k of the start of the file.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Reading through yaml.org, there seem to be just one clearly stated option to send binary data (base64) - although there is wording in that same document that suggests anything can be embedded as long as it starts with a !!customtag and can be represented in Unicode. Puppet doesn't do base64, it sends a long string of \x123\x321 etc (this is the actual string it sends, those are not hex chars but their readable form). A thread on their mailing lists suggests they are aware of the issue and consider it a bug: Google groups

        Essentially what you are saying is that Puppet is sending you non-standard (therefore non-YAML) data, and Perl's YAML modules don't handle it. Understandable :)

        Decoding that long string manually shouldn't be a problem. Though the format you describe means your 4MB jpg will come across as 20MB of text.

        If it were encoded as asciified hex bytes (technical terms:), then decoding it would be simple, if horribly slow. Something like:

        my $jpg = pack 'C*', map hex( '0' . $1 ), $content =~ m[\\(x[0-9a-fA-F +]+)]g;

        But, your example \x123\x321 shows values greater than \xff, which suggests that they are encoding unicode characters rather than bytes. So you'd need something like:

        my $jpg = pack 'U*', map hex( '0' . $1 ), $content =~ m[\\(x[0-9a-fA-F +]+)]g;

        But whether you could then print that to a binary file without getting a bunch of Wide character in print ... warnings or the content messed with by IO layers I have no idea.

        Also, be aware that not only will you have the original 20MB string, and the 4MB result in memory, but also 2 very large lists of scalars. One to the map and one to the pack. How large will depend upon how the unicode decides to split up the binary into 'characters', but each list will be at least 1 million scalars and up to 4 million long.

        Seems like a really silly (slow, clumsy & labourious) way to transfer a file given that LWP::Simple will transfer a 4MB binary file locally in less than second.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        Puppet doesn't do base64, it sends a long string of \x123\x321 etc (this is the actual string it sends, those are not hex chars but their readable form).
        Wow, what a terrible piece of software -- though I suppose it's pretty typical for Ruby, with its mix of historical ignorance and inefficiency. We've known how to encode binary data as ASCII text with reasonable line lengths for a long time: it's called "uuencode," and pack and unpack handle it just fine.
Re: Load large(ish) files with YAML?
by duelafn (Parson) on Apr 21, 2011 at 16:25 UTC