Optimize Large, Complex JSON decoding

Endless has asked for the wisdom of the Perl Monks concerning the following question:

My program parses json files of varying sizes, some over 10 mb; the files contain json arrays that can contain hundreds and even thousands of json elements. I unwind these into a CSV file of just the few columns I want from the JSON file (only a fraction of each JSON element). It seems to take a long time to simply unthread the JSON files, so I'm wondering if there are any tips on optimizing this process. In particular, am I likely to be losing time with using 'slurp' instead of some other read method? I'm not sure, because the files come in just one ridiculous line anyway. Suggestions, or is this as good as it gets?

sub _process_json {
    my $jfile = shift;
    my $json;
    {
    local $/; #Enable 'slurp' mode         # xxx Might have trouble wi
+th larger jsons (10+ mb)
    open my $fh, "<", "$jfile";
    $json = <$fh>;
    close $fh;
    }
    my $json_data = decode_json($json);
        
    # Go through each interaction (twitter message)
    my @interactions = $json_data -> {'interactions'}; # A scalar of a
+n array of hashes
    while ( (my $key, my $value) = each $interactions[0] ) {
    my $tweetid = $value -> {'twitter'} -> {'id'};
    if (exists $duplicates{$tweetid}){
        $duplicate_count++;        
        next;        # Skip duplicates
    }else{
        $duplicates{$tweetid} = ();
        $tweets_file_count++;
    }

    # Dates of form 'Fri, 01 Mar 2013 01:21:14 +0000'
    my $created_at = epoch_sec($value -> {'twitter'} -> {'created_at'}
+);
    my $klout = ($value -> {'klout'} -> {'score'}) // ""; # Optional i
+n DS jsons
    my $screen_name = $value -> {'twitter'} -> {'user'} -> {'screen_na
+me'};
    my $text = decode_entities($value -> {'twitter'} -> {'text'});

    # Formatting for the final output
    $text =~ s/\R/\t/g; # Remove linebreaks
    $text =~ s/"/""/g; # Swap quotations

    print $out_file
        "$tweetid,", 
        "$created_at,",
        "$klout,",
        "$screen_name,",
        "\"$text\"", 
        "\n";
    } #END while (each tweet)
} #END _process_json

use Inline C => q@
int epoch_sec(char * date) {
    char *tz_str = date + 26;
    struct tm tm;
    int tz;

    if (  strlen(date) != 31                           ||
        strptime(date, "%a, %d %b %Y %T", &tm) == NULL ||
          sscanf(tz_str, "%d", &tz) != 1)
    {
        printf("Invalid date %s\n", date);
        return 0;
    }

    return timegm(&tm) - 
        (tz < 0 ? -1 : 1)*(abs(tz)/100*3600 + abs(tz)%100*60);
}
@;
[download]

Comment on Optimize Large, Complex JSON decoding Download Code

Replies are listed 'Best First'.
Re: Optimize Large, Complex JSON decoding by Anonymous Monk on Sep 19, 2013 at 00:20 UTC
It seems to take a long time to simply unthread the JSON files, so I'm wondering if there are any tips on optimizing this process. What is that? How did you determine the bottleneck ? Because on my really old laptop(9yo), it takes 0.96875 to slurp+decode+foreach 189279 "records" from a 21M json file If I add in some Time::Piece strftime/strptime it goes to 9.984375 seconds I don't see room for improvement, although it looks like you could reduce memory requirement with JSON::Streaming::Reader	[reply]
Re: Optimize Large, Complex JSON decoding by Anonymous Monk on Sep 19, 2013 at 00:54 UTC
Do not, for example, copy the "interactions" into a separate array (`@interactions`) ... iterate directly over the elements in the array within the decoded JSON content.	[reply]
Re^2: Optimize Large, Complex JSON decoding by Endless (Beadle) on Sep 19, 2013 at 01:30 UTC
Thanks for your suggestions. As I'm still learning Perl, what do you recommend instead of `my @interactions = $json_data -> {'interactions'}; # A scalar of a +n array of hashes` [download] I am parsing only 11846 records per second.	[reply] [d/l]
Re^3: Optimize Large, Complex JSON decoding by Anonymous Monk on Sep 19, 2013 at 03:35 UTC
read references quick reference, apply the rule `for my $one ( @{ $json->.... } ) { ... }` [download]	[reply] [d/l]