Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Merge hashes in specific format

by ovedpo15 (Pilgrim)
on Jan 12, 2019 at 00:11 UTC ( [id://1228412]=perlquestion: print w/replies, xml ) Need Help??

ovedpo15 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,
I'm asking your wisdom in solving the following problem:
I'm have a list of json files in special format which looks as following:
$VAR1 = { 'data' => { 'dir1' => { 'fileA' => { 'pid' => { '61781' => 1 }, 'total' => 13 }, 'fileB' => { 'pid' => { '61799' => 1 }, 'total' => 12 } } 'dir2' => { 'fileC' => { 'pid' => { '12345' => 1 }, 'total' => 10 }, 'fileA' => { 'pid' => { '61439' => 1 }, 'total' => 5 } } }, 'total' => { 'fileA' => 18, 'fileB' => 12 'fileC' => 10 } };
Those 'JSONS' are not so 'JSONS' rather they just the hashes I inserted into a file with the following code:
sub encode { my ($path,$href) = @_; my @json_arr = JSON::PP->new->encode($href); return convert_file_to_arr($path,@json_arr); }
I use the following sub in order to decode each json:
sub decode { my ($path,$href) = @_; unless (-e $path) { return 0; } my ($json_data,@jarr); convert_file_to_arr($path,\@jarr); # inserts lines as elements of +array $json_data = (join "",@jarr); %{$href} = %{JSON::PP->new->decode($json_data)}; return 1; }
Also, the sub which I iterate through the JSON files:
sub merge_files_and_exec { my ($path,$list_of_dirs,$data_href) = @_; foreach my $dir (@{$list_of_dirs}) { decode($path,$data_href); } }
But of course it won't work, it been overridden every iteration and also it does not merge.
I would like to merge those JSON files into one big hash which contains all the data.
Better to explain with an example. If I would like to merge the data I showed at the start and the following data:
$VAR1 = { 'data' => { 'dir3' => { 'fileA' => { 'pid' => { '616161' => 1 }, 'total' => 6 }, 'fileD' => { 'pid' => { '54321' => 1 }, 'total' => 12 } } 'dir4' => { 'fileE' => { 'pid' => { '15151' => 1 }, 'total' => 3 }, 'fileA' => { 'pid' => { '1718' => 1 }, 'total' => 2 } } }, 'total' => { 'fileA' => 8, 'fileD' => 12 'fileE' => 3 } };
The merged hash should be;
$VAR1 = { 'data' => { 'dir1' => { 'fileA' => { 'pid' => { '61781' => 1 }, 'total' => 13 }, 'fileB' => { 'pid' => { '61799' => 1 }, 'total' => 12 } } 'dir2' => { 'fileC' => { 'pid' => { '12345' => 1 }, 'total' => 10 }, 'fileA' => { 'pid' => { '61439' => 1 }, 'total' => 5 } } }, 'dir3' => { 'fileA' => { 'pid' => { '616161' => 1 }, 'total' => 6 }, 'fileD' => { 'pid' => { '54321' => 1 }, 'total' => 12 } } 'dir4' => { 'fileE' => { 'pid' => { '15151' => 1 }, 'total' => 3 }, 'fileA' => { 'pid' => { '1718' => 1 }, 'total' => 2 } } }, 'total' => { 'fileA' => 26, 'fileB' => 12 'fileC' => 10 'fileD' => 12 'fileE' => 3 } };
In the 'data' key we will add all the directories and their data. And in the 'total' section we will have all the files and their summed numbers.
For now I think that directories are unique and there won't be two hashes which contains the same 'dir' key section, so in that case, we just need to concatenate the 'data'. On the other hand, 'total' section is more difficult because we need to sum the values for the same keys.
I was wondering if the JSON:PP has a merge utility. I tried to search through their docs but without any success.
If it had a merge utility to merge those hashes it will be great. That issue occurred in a project I work on.
Problem is I don't really can use any additional modules - only the standard ones (meaning I can't install any additional modules).
It's a shame because there is probably a good module that can do it. But if it's a basic/standard module, I might have it.
Anyway, what a good and efficient solution can solve this issue? Maybe to change the input structure to be something else? or maybe to add something into it so it will work efficiently? Each of those hashes will contain 10K+ lines so it will be quite a lot (it will be running on a special machine so don't worry about the memory).

Replies are listed 'Best First'.
Re: Merge hashes in specific format
by Discipulus (Canon) on Jan 12, 2019 at 10:44 UTC
    Hello ovedpo15,

    Just a quick review: you are overwriting because of: %{$href} = %{JSON::PP->new->decode($json_data)};

    Ie: as you noted, at each file iteration you are reassigning the content of $haref with new data.

    You must instead do something like:

    # untested foreach my $key ( keys %{JSON::PP->new->decode($json_data)} ){ # some collision check? if ( exists $$href{ $key }){ warn "$key already exists! was [$$href{ $key }] and will be [${JSO +N::PP->new->decode($json_data)}{$key}]\n"; } # assign a new key value pair $$href{$key} = ${JSON::PP->new->decode($json_data)}{$key}; }

    Try, before dealing with your JSONS, with a bounch of test hashes until you get it working.

    Also I'd avoid subs named encode and decode because of the risk of collision and for readability: if I see encode when using a JSON module I think about the module method ;)

    HtH

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: Merge hashes in specific format
by tybalt89 (Monsignor) on Jan 12, 2019 at 16:54 UTC
    #!/usr/bin/perl # https://perlmonks.org/?node_id=1228412 use strict; use warnings; my $one = { data => { dir1 => { fileA => { pid => { 61781 => 1 }, total => 13 } +, fileB => { pid => { 61799 => 1 }, total => 12 } +, }, dir2 => { fileA => { pid => { 61439 => 1 }, total => 5 }, fileC => { pid => { 12345 => 1 }, total => 10 } +, }, }, total => { fileA => 18, fileB => 12, fileC => 10 }, }; my $two = { data => { dir3 => { fileA => { pid => { 616161 => 1 }, total => 6 } +, fileD => { pid => { 54321 => 1 }, total => 12 } +, }, dir4 => { fileA => { pid => { 1718 => 1 }, total => 2 }, fileE => { pid => { 15151 => 1 }, total => 3 }, }, }, total => { fileA => 8, fileD => 12, fileE => 3 }, }; my $newhash = { data => { %{ $one->{data} }, %{ $two->{data} } }, total => do { my %total; for my $href ( $one, $two ) { $total{$_} += $href->{total}{$_} for keys %{ $href->{total} }; } \%total; } }; use Data::Dump 'dd'; dd $newhash;

    Outputs:

    { data => { dir1 => { fileA => { pid => { 61781 => 1 }, total => 13 } +, fileB => { pid => { 61799 => 1 }, total => 12 } +, }, dir2 => { fileA => { pid => { 61439 => 1 }, total => 5 }, fileC => { pid => { 12345 => 1 }, total => 10 } +, }, dir3 => { fileA => { pid => { 616161 => 1 }, total => 6 } +, fileD => { pid => { 54321 => 1 }, total => 12 } +, }, dir4 => { fileA => { pid => { 1718 => 1 }, total => 2 }, fileE => { pid => { 15151 => 1 }, total => 3 }, }, }, total => { fileA => 26, fileB => 12, fileC => 10, fileD => 12, fileE + => 3 }, }

      Hi tybalt89,

      Sorry to post this so late but I couldn't find time to do this earlier. I have a question regards your technique to merge and the one that I posted.

      With the method that I posted I specifically did not choose to use Hash::Merge and also specifically not { ( %{ $ ... }, %{ $ ... } ) } but instead to make it LEFT_PRECEDENT on the dir1 .. dirN. (Even though the design criteria said that these elements would be unique). The method that I created would ignore any new 'dir' element that already existed and so it will not corrupt already collected data.

      If I would have used Hash::Merge I would have probably done something like:

      use strict ; use warnings ; use Data::Dumper ; use List::Util qw { sum0 } ; use Hash::Merge ; my $VAR1 = { # Same as before }; my $VAR2 = { # same as before }; my $merger = Hash::Merge->new('RETAINMENT_PRECEDENT'); my $VAR3 = $merger->merge( $VAR1, $VAR2 ) ; foreach(keys %{$VAR3->{ total }}) { if ( ref $VAR3->{ total }->{ $_ } eq 'ARRAY' ) { $VAR3->{ total }->{ $_ } = sum0 @{$VAR3->{ total }->{ $_ }} ; } } print Dumper( $VAR3 ) ;

      When receiving 'dir' elements again and using the technique of { ( %{ $ ... }, %{ $ ... } ) } would corrupt the data:

      • different pid: data loss
      • calculates wrong totals

      My question basically is, which of the three techniques is in your opinion more robust. Even though I directed this question to tybalt89, anyone else feel free to answer as well of course.

        "the design criteria said that these elements would be unique"

        If the specification changes, then I can charge extra to adapt it :)

      Great, thank you. How to make $newhash to be a real hash? Isn't it a scalar right now?

        It is a scalar anonymous hash reference. If by "real hash" you mean "not an anonymous hash reference", try (untested):
            my %newhash = ( data => { ... }, total => do { ... }, );


        Give a man a fish:  <%-{-{-{-<

Re: Merge hashes in specific format
by Veltro (Hermit) on Jan 12, 2019 at 00:53 UTC

    I tried to find a quick solution for you but I have to go to bed realy. This only works when the items dir1 .. dirN are unique in each hash and also only if the totals are correct:

    use strict ; use warnings ; use Data::Dumper ; my $VAR1 = { 'data' => { 'dir1' => { 'fileA' => { 'pid' => { '61781' => 1 }, 'total' => 13 }, 'fileB' => { 'pid' => { '61799' => 1 }, 'total' => 12 } }, 'dir2' => { 'fileC' => { 'pid' => { '12345' => 1 }, 'total' => 10 }, 'fileA' => { 'pid' => { '61439' => 1 }, 'total' => 5 } } }, 'total' => { 'fileA' => 18, 'fileB' => 12, 'fileC' => 10, } }; my $VAR2 = { 'data' => { 'dir3' => { 'fileA' => { 'pid' => { '616161' => 1 }, 'total' => 6 }, 'fileD' => { 'pid' => { '54321' => 1 }, 'total' => 12 } }, 'dir4' => { 'fileE' => { 'pid' => { '15151' => 1 }, 'total' => 3 }, 'fileA' => { 'pid' => { '1718' => 1 }, 'total' => 2 } } }, 'total' => { 'fileA' => 8, 'fileD' => 12, 'fileE' => 3, } }; foreach ( grep !exists $VAR1->{ data }->{ $_ }, keys %{$VAR2->{ data } +} ) { $VAR1->{ data }->{ $_ } = $VAR2->{ data }->{ $_ } ; foreach my $f ( keys %{$VAR2->{ data }->{ $_ }} ) { $VAR1->{ total }->{ $f } += $VAR2->{ data }->{ $_ }->{ $f }->{ + total } ; } } print Dumper( $VAR1 ) ;

    Veltro

Re: Merge hashes in specific format
by 1nickt (Canon) on Jan 12, 2019 at 14:53 UTC

    Hi,

    You said:

    "Problem is I don't really can use any additional modules - only the standard ones (meaning I can't install any additional modules)."
    That is almost certainly not true. Why do you think so? What research have you done to find out about it? Would you try a construction project without first visiting the tool store?

    Then you said:

    "there is probably a good module that can do it"
    Did you learn Hash::Merge as you were already advised when you asked about this before?

    We can't help you if you won't help yourself. Generally, the first step in that process is to let go of your preconceived ideas, which facilitates the second step, i.e. to listen to the advice you get from those who know. If you had spent the 24 hours since your first post on this topic learning to how to install the recommended modules on your system and how to use them, instead of continuing down the path on which you had already become stuck, you could have solved your problem by now and be ready to tackle the next one with new skills.


    The way forward always starts with a minimal test.
Re: Merge hashes in specific format
by poj (Abbot) on Jan 12, 2019 at 15:31 UTC

    encode returns a scalar, not an array

    sub encode { my ($path,$href) = @_; # my @json_arr = JSON::PP->new->encode($href); # return convert_file_to_arr($path,@json_arr); my $json_text = JSON::PP->new->encode($href); return convert_file_to_arr($path,$json_text); }

    Also, what does convert_file_to_arr() actually do ?

    If as Discipulus advised you want to change your sub names then see here

    poj
Re: Merge hashes in specific format
by kschwab (Vicar) on Jan 12, 2019 at 17:59 UTC
    Another approach would be to merge the JSON together then calculate the totals.
    my $json1 = decode_json($raw_json1); my $json2 = decode_json($raw_json2); push @{ $json1->{whateverkey} }, @{ $json2->{whateverkey} };
Re: Merge hashes in specific format
by LanX (Saint) on Jan 12, 2019 at 19:20 UTC
    I was going to show some generic code to merge nested hashes until I realized that you posted broken data which lacks multiple commas.

    Sorry this is not the first time and I'm not willing to encourage your laziness.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1228412]
Approved by Paladin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-03-28 18:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found