ovedpo15 has asked for the wisdom of the Perl Monks concerning the following question:

I'm tyring to create the following format of hash:
{ "test": [ { "group": "ABC1", "values": [ "abc", "xyz" ] }, { "group": "ABC2", "values": [ "abc2", ] }, { "group": "ABC3", "values": [ "xyz3" ] } ] }
I iterate over my data the perform the following code on each line:
open(my $fh, '<', "$file_path") or return 0; while (my $line = <$fh>) { # ........... # ........... # Additional operations of the lines of the file # ........... # ........... my %a; # temp hash $a{"group"} = $group; if (defined($href->{"values"})) { $a{"values"} = [sort(uniq($value,@{$href->{"values"}}))]; } else { push(@{$a{"values"}},$value); } push(@{$href->{"test"}},\%a); }
I don't like how it looks, it is not clear and a bit mess.
Is there a way to make the code shorter and maybe without using the temp hash and if-else block?
EDIT: I found out that it actually won't do what I need - it will not push the value in the same array of the group. Not sure how to fix it yet.
EDIT-2: I'm sorry if my question was not understandable. I though about it and from the comments I leraned that I need to clerify my question (sorry again). I have talked about the inner structure but actually the final structure which I need is:
{ + "root": [ { "test": [ { "group": "XYZ", "values": [ "1234" ] }, { "group": "ABC", "values": [ "6.13.00" ] } ] }, { "test": [ { "group": "XYZ", "values": [ "tcsh" ] }, { "tool": "WEA", "values": [ "6.13.00" ] } ] }, { "test": [ { "group": "BAB", "values": [ "ASDAS", "12312321" ] }, { "group": "SADA", "values": [ "6.13.00", "1231231" ] } ] } ] }
I want to iterate though a file of data, parse it and extract group and values. The currect way I use:
sub parse_file { my ($file_path,$href) = @_; open(my $fh, '<', "$file_path") or return 0; while (my $line = <$fh>) { chomp($line); unless ($line =~ /\A[^,]+(?:,[^,]+){5}\z/) { next; } my ($key,$group,$value,$version,$file,$count) = split(/,/,$lin +e); push @{ $href->{test} }, { group => $group, values => [ sort uniq $value, @{ $href->{values} // [] } ] }; } close ($fh); return 1; } foreach my $dir (sort(@list_of_dirs)) { my ($all,%data); $all = $dir."/"."galish"; prase_results_raw($all,\%data); push(@all_data,\%data); }
It does what I need when there are no multiple values for the same group.
For multiple values for the same group I'll get:
{ + "root": [ { "test": [ { "group": "XYZ", "values": [ "1234" ] }, { "group": "ABC", "values": [ "6.13.00" ] } ] }, { "test": [ { "group": "XYZ", "values": [ "tcsh" ] }, { "tool": "WEA", "values": [ "6.13.00" ] } ] }, { "test": [ { "group": "BAB", "values": [ "ASDAS", ] }, { "group": "BAB", "values": [ "12312321" ] }, { "group": "SADA", "values": [ "6.13.00", ] } { "group": "SADA", "values": [ "1231231" ] } ] } ] }
If you take a look closely, you'll see duplicates of the same group name in the same test.
How to solve it?

Replies are listed 'Best First'.
Re: Trying to make the code more clear and clean
by davido (Cardinal) on Jul 29, 2019 at 15:49 UTC

    It's possible I missed some nuance but I think you can totally eliminate the temporary hash, and the if conditional if you just guard a single array-dereference point against definedness:

    #!usr/bin/env perl use strict; use warnings; use List::Util 'uniq'; sub do_thing { my ($file_path, $href) = @_; open my $fh, '<', $file_path or return 0; while (my $line = <$fh>) { # stuff my $group = 'some setup for this line.'; my $value = 'some setup for this line.'; push @{$href->{test}}, { group => $group, values => [ $value, sort uniq(@{$href->{values} // []}), # This will be em +pty if the definedness test would have been false. ], }; } }

    For me I like to ask myself if the temporary variable is adding clarity or not. In this case I didn't think it was, and in fact after removing it, it made it more clear to me that the if conditional was overkill. Instead of the conditional, we simply guard against $href->{values} not being defined. If it is undefined, we provide an empty array reference in @{$href->{values} // []}. The functions sort and uniq are not going to complain about being handed an empty list. And your conditional was really only dealing with the possibility that $href->{values} was undefined, or that the list was empty. We've handled both of those cases here.

    If your code can be made to avoid conditional branches, it's usually going to be easier to follow. And often (but not always), temporary variables increase the amount of state that one has to keep in mind.

    I am wondering, though, why $values isn't being passed through the uniq filter. Is it desired behavior to possibly add a duplicate but then filter it out later?

    Update:

    EDIT: I found out that it actually won't do what I need - it will not push the value in the same array of the group. Not sure how to fix it yet.

    EDIT-2: I'm sorry if my question was not understandable. I though about it and from the comments I leraned that I need to clerify my question (sorry again). I have talked about the inner structure but actually the final structure which I need is:

    Now the question really is hard to follow, and hard to answer.


    Dave

Re: Trying to make the code more clear and clean
by AnomalousMonk (Archbishop) on Jul 29, 2019 at 16:15 UTC

    LanX has pointed out that what is shown in the OP is almost JSON. It would be useful to know exactly what the format of the input data file is. Is it in fact almost-JSON? Is it something that could easily be converted to strict JSON? Is it something that could easily be converted to a Perl expression string and fed to eval (Perl is more forgiving of some things than JSON)? A more simple and/or elegant solution might emerge from more detail.

    Update: Some wording changes/enhancements.


    Give a man a fish:  <%-{-{-{-<

Re: Trying to make the code more clear and clean
by 1nickt (Canon) on Jul 29, 2019 at 14:25 UTC

    Hi, often you can skip a `defined` check to decide whether or not you can dereference something, without autovivifying it if it doesn't exist, by supplying a default. Note use of the defined-or operator //.

    $ perl -MData::Dumper -Mstrict -wE 'my $foo = {}; say for @{ $foo->{ba +r} }; say Dumper $foo' $VAR1 = { 'bar' => [] };
    $ perl -MData::Dumper -Mstrict -wE 'my $foo = {}; say for @{ $foo->{ba +r} // [] }; say Dumper $foo' $VAR1 = {};

    This should work if I understand your partial code sample:

    open(my $fh, '<', "$file_path") or return 0; my $href = {}; while (my $line = <$fh>) { # ........... # ........... # Additional operations of the lines of the file # ........... # ........... push @{ $href->{test} }, { group => $group, values => [ sort uniq $value, @{ $href->{values} // [] } ] +, }; }
    (...although it seems a bit odd to have a top-level key with values that are then added to the sub-hashes. Could you build the combined list when you use the contents of $href?)

    Hope this helps!


    The way forward always starts with a minimal test.
      Thanks all for the replies! My question was not understandable enough so I have edited it. please take a look if you can, thanks, and sorry again!
Re: Trying to make the code more clear and clean
by LanX (Saint) on Jul 29, 2019 at 15:09 UTC
    are you aware that the data dump you are showing is not Perl, but rather kind of JSON?

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      Maybe the OP set $Data::Dumper::Pair. Maybe it's a sample of the JSON s/he wants to get.


      The way forward always starts with a minimal test.
Re: Trying to make the code more clear and clean
by BillKSmith (Monsignor) on Jul 29, 2019 at 16:20 UTC
    This duplicates my understanding of your code without the temporary hash.
    use strict; use warnings; use List::Util qw(uniq); use Data::Dumper; my $group = q(ABC1); my $value = q(xyz); my $href = { values=> [qw(abc xyz abc)], }; push @{$href->{test}}, {group=>$group}; $href->{test}[-1]{values} = ( defined $href->{"values"} ) ? [sort(uniq($value,@{$href->{"values"}}))] : $value ; print Dumper $href;
    Bill
Re: Trying to make the code more clear and clean
by hippo (Archbishop) on Jul 29, 2019 at 14:06 UTC
    Is there a way to make the code shorter

    Sure, just remove all the double-quotes. None of those in the excerpt you've shown are required.

      I'll make them constans, left them as strings so it will be more readable for the readers of the thread.

        Personally I find that

        open(my $fh, '<', "$file_path") or return 0;

        is much less readable than

        open(my $fh, '<', $file_path) or return 0;

        With the former one is forced to ponder why the author has enclosed nothing more than a single scalar in double-quotes. Perhaps $file_path isn't really a scalar but is instead some path object requiring stringification. So then we have to go looking for the declaration of $file_path and assignments to it, etc. ... only to find that it's a just a plain string after all and the double-quotes are superfluous.

        Editing is an art for sure but cutting for clarity is one of the easiest tasks.

        Nothing to do with constants. They are always strings, just quoted or not. You don't need to quote them when in use as hash key names.


        The way forward always starts with a minimal test.
Re: Trying to make the code more clear and clean
by bliako (Abbot) on Jul 29, 2019 at 22:53 UTC
    If you take a look closely, you'll see duplicates of the same group name in the same test.
    How to solve it? 

    Because you keep pushing them in without checking if that group exists already:

    push @{ $href->{test} }, { group => $group, values => [ sort uniq $value, @{ $href->{values} // [] } ] };

    Also, this keeps sorting every time you insert a new value:

    $a{"values"} = [sort(uniq($value,@{$href->{"values"}}))];

    uniq() can perhaps be eliminated if you used a hash instead of the array. And you can do the sorting *once*, after you have read/inserted your data, as a second stage processing.

      Thank you for the fast reply. I wanted to use hash at first so the key is group and the value is an array of values. But the problem is that this format should be of JSON that will be sent to Mongo and Mongo does not allow dots and the dollar sign to be in the key name. But we want to allow those special chars to be in groups so we had to rechange the structure and they can't be keys. I can't see any other structure than the one I suggested in the post.

      Anyway, I know that I keep pushing it each time, my question here is how I can make sure it gets to the right place? I'll have to iterate over the array and that does not feel efficient. Maybe you can suggest how the structure should look like? Thanks again!
Re: Trying to make the code more clear and clean
by 1nickt (Canon) on Jul 29, 2019 at 23:11 UTC

    Hi, try changing your output structure, yes your output structure, to a much more manageable hash of hashes (of hashes of arrays). If you want to be able easily to pull out data later by group or by test, you should only use an array for the innermost values. The, um, values, haha. Anyway, you don't need all those layers of key names. If the object is of test results, you don't need to name it "test". Just get in there and index the tests. If a test hash can only have values by group, key the results by group name. It's much more natural and way easier to build. Something like this:

    { + "root": { 1 => { "XYZ": [ "1234" ], "ABC": [ "6.13.00" ] }, 2 => { "BAB": [ "ASDAS", "12312321" ], "SADA": [ "6.13.00", "1231231" ] } ] }
    Then you can change
    push @{ $href->{test} }, { group => $group, values => [ sort uniq $value, @{ $href->{values} // [] } ] };
    to
    my $test_num = 0; for my $file ( @test_files ) { $test_num++; ... for my $line (<FILE>) { push @{ $href->{ $test_num }{ $group } }, $value; } }
    Later when you reuse the data to build a report or whatever, you can give it column headers without having to store them in the data object itself.

    Hope this helps!


    The way forward always starts with a minimal test.
      Thanks for the detailed answer! We can't use groups/values as a key of the hashes because we send this JSON format to MongoDB. But MongoDB can't handle with dots or dollar signs in the key section so it will fail (we want to allow those chars). Also we want users to user those reports so we need them to be readable and indexing the keys feels less readable (at least to me).

        I see. OK, that answers one question, but it is not clear from what you have posted so far what constitutes a "test". Is a 'test" the same as a "file"? You still haven't posted any sample data with the expected output from that data. That would be helpful in trying to understand your situation.

        Also, are you aware of how fragile it is to try to handle comma-separated values manually by just splitting? Use Text::CSV! (In fact you could encapsulate this whole program into a Text::CSV after_parse callback, but that's a different story...)

        Assuming that one file is one test, testing one href, and that you do *not* want multiple hashes each keyed with 'test', but just one, with one listing for each group, maybe you could use something like:

        use strict; use warnings; use feature 'say'; use JSON; my $results = {}; for my $href ('root') { my $by_group = {}; while (my $line = <DATA>) { my ( $key, $group, $value, $version, $file, $count ) = split(/ +,/, $line); push @{ $by_group->{ $group } }, $value; } my $test = []; for my $group ( keys %{ $by_group } ) { push @{ $test }, { group => $group, values => $by_group->{ $gr +oup } } } $results->{ $href } = $test; } say JSON->new->pretty->canonical->encode($results); __DATA__ bla,ABC,6.13.00,bla,bla,bla bla,XYZ,1234,bla,bla,bla bla,XYZ,tcsh,bla,bla,bla bla,WEA,6.13.00,bla,bla,bla bla,BAB,ASDAS,bla,bla,bla bla,BAB,12312321,bla,bla,bla bla,SADA,6.13.00,bla,bla,bla bla,SADA,12312321,bla,bla,bla
        (^^ that's an SSCCE ...)

        $ perl oved.pl { "root" : [ { "group" : "BAB", "values" : [ "ASDAS", "12312321" ] }, { "group" : "ABC", "values" : [ "6.13.00" ] }, { "group" : "XYZ", "values" : [ "1234", "tcsh" ] }, { "group" : "WEA", "values" : [ "6.13.00" ] }, { "group" : "SADA", "values" : [ "6.13.00", "12312321" ] } ] }

        Hope this helps!


        The way forward always starts with a minimal test.