jaypal has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks! I am practicing perl by picking up text parsing questions from Stack Overflow. I came across a question where I had to transpose column such that following dataset

id1 name1 cat1 catname1 id1 name1 cat2 catname2 id2 name2 cat3 catname3 id3 name3 cat1 catname1 id3 name3 cat4 catname4

would become

id1 name1 cat1:catname1, cat2:catname2 id2 name2 cat3:catname3 id3 name3 cat1:catname1, cat4:catname4

I wrote a perl script, and even though it works as expected, I am in need of some wisdom. I wanted to know if there is a more idiomatic way of doing:

my @val = split(" ",$val); my $v = join (":", @val);

and

if (exists $hash{$key}) { $hash{$key} .= $sep.$v; } else { $hash{$key} = $v; }

Here is my complete perl script. Any advice and or suggestion to improve this will be greatly appreciated. This is my first post here at perlmonks, so if I missed on any crucial details, I apologize for that. Looking forward to be hearing from perlmonks.

#!/usr/bin/perl use strict; use warnings; use Data::Dumper; my %hash; my $sep = ", "; while (<DATA>) { chomp; (my $key, my $val) = /(\S*\s*\S*\s*)(\S*\s*\S*\s*)/; my @val = split(" ",$val); my $v = join (":", @val); if (exists $hash{$key}) { $hash{$key} .= $sep.$v; } else { $hash{$key} = $v; } } foreach my $k (sort keys %hash) { print $k.$hash{$k}."\n"; } #print Dumper \%h; __DATA__ id1 name1 cat1 catname1 id1 name1 cat2 catname2 id2 name2 cat3 catname3 id3 name3 cat1 catname1 id3 name3 cat4 catname4

I could have used better variable names but this was just for practice so didn't really focus on that part.

Replies are listed 'Best First'.
Re: Appending values to existing hash key
by GrandFather (Saint) on Mar 21, 2014 at 03:08 UTC

    A number of minor things, most of which are personal preference:

    1. I tend to use statement modifiers where there would only otherwise be one small statement in a controlled block.
    2. remove the trailing white space from the captured strings and require at least one of each character type in the regex to avoid some nasty edge cases
    3. Avoid intermediate variables where the calculation is trivial (@val for example)
    4. Use double quote interpolation instead of concatenation
    5. #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my %hash; while (<DATA>) { (my $key, my $value) = /(\S+\s+\S+)\s+(\S+\s+\S+)/; $value = join ":", split " ", $value; $hash{$key} .= ", " if exists $hash{$key}; $hash{$key} .= $value; } print "$_ $hash{$_}\n" for sort keys %hash; __DATA__ id1 name1 cat1 catname1 id1 name1 cat2 catname2 id2 name2 cat3 catname3 id3 name3 cat1 catname1 id3 name3 cat4 catname4

      Prints:

      id1 name1 cat1:catname1, cat2:catname2 id2 name2 cat3:catname3 id3 name3 cat1:catname1, cat4:catname4
      If the code changes take longer than the time saved, it's fast enough already.

      Thank you so much for replying so fast and hitting all my pain points. This is exactly what I was looking for

      $value = join ":", split " ", $value; # This is awesome $hash{$key} .= ", " if exists $hash{$key}; # I was thinking more i +n lines of a ternary op but this is short and concise too $hash{$key} .= $value; } print "$_ $hash{$_}\n" for sort keys %hash; # I definitely like this a +pproach

      Thank you once again!

Re: Appending values to existing hash key
by NetWallah (Canon) on Mar 21, 2014 at 03:18 UTC
    Arguably more idomatic for the split/join:
    $val=~s/\s+/:/g;
    But this modifies the $val.

    For the second question of appending to hash values, I'd suggest a better structure would be a hash of arrays"

    push @{ $hash{$key} }, $v;
    That way, you do not need to check for existance - it is auto-vivified.
    The difference is - for output, you will have to "join" with $sep:
    ... output for each $key .... join $sep,@{ $hash{$key} }

            What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?
                  -Larry Wall, 1992

      Thank you for replying so fast. Yes, doing a hash of arrays did occur to me. Something along the lines -

      chomp; (my $k, my @v) = /(\S*\s*\S*\s*)(\S*\s*\S*\s*)/; push @{$h{$k}} , shift @v;

      But dropped the idea because I had to iterate twice:

      foreach my $key (sort keys %h) { print "$key"; foreach my $a (@{$h{$key}}) { print "$a "; } print "\n"; }

      But I see how in a bigger data set this should be the way to go. Checking for existence in a larger hash could be counter productive.

        Hi jaypal,
        i totally agreed with ++NetWallah and really you can use a map for the nested for loop like so:

        use warnings; use strict; my %hash; while (<DATA>) { my @data = split; push @{ $hash{ $data[0] }{ $data[1] } } => "$data[2]: $data[3]"; } for my $key ( sort keys %hash ) { print $key, ' ', join( ', ' => map { $_, @{ $hash{$key}{$_} } } keys %{ $hash{$ke +y} } ), $/; } __DATA__ id1 name1 cat1 catname1 id1 name1 cat2 catname2 id2 name2 cat3 catname3 id3 name3 cat1 catname1 id3 name3 cat4 catname4


        UPDATE: Code updated as indicated by NetWallah.

        If you tell me, I'll forget.
        If you show me, I'll remember.
        if you involve me, I'll understand.
        --- Author unknown to me

        But I see how in a bigger data set this should be the way to go. Checking for existence in a larger hash could be counter productive.

        You probably don't have to worry about that, searches in hashes are fast and usually do not depend on the size of the hashes.
Re: Appending values to existing hash key
by wind (Priest) on Mar 21, 2014 at 02:10 UTC
    One potential alternative, ensuring keys stay in the same order:
    #!/usr/bin/perl use strict; use warnings; my @keys; my %vals; while (<DATA>) { chomp; my @data = split; my $key = join ' ', splice @data, 0, 2; push @keys, $key if !$vals{$key}; push @{$vals{$key}}, join ':', @data; } for (@keys) { print "$_ " . join(', ', @{$vals{$_}}) . "\n"; } __DATA__ id1 name1 cat1 catname1 id1 name1 cat2 catname2 id2 name2 cat3 catname3 id3 name3 cat1 catname1 id3 name3 cat4 catname4
    Outputs:
    id1 name1 cat1:catname1, cat2:catname2 id2 name2 cat3:catname3 id3 name3 cat1:catname1, cat4:catname4
    - Miller

      Thanks for the prompt reply. I have to agree, your approach is very clever. Thank you for taking out time to write this alternative solution (a very good one indeed!).

Re: Appending values to existing hash key
by graff (Chancellor) on Mar 21, 2014 at 03:52 UTC
    my @val = split(" ",$val); my $v = join (":", @val);
    Assuming that you've already made sure that $val contains only the fields that need to be delimited by colon instead of whitespace, you could just do this:
    $val =~ s/\s+/:/g
    As for the other snippet, you could use a hash of arrays instead of a hash of concatenated strings, then join the array elements for each hash key after the whole set has been loaded. Here are the relevant loops (with a few other changes):
    my %hash; while (<DATA>) { s/\s+$//; # remove all trailing whitespace (not just newline) s/(\S+\s+\S+\s+)//; # remove (and capture) first two fields my $key = $1; s/\s+/:/g; # join remaining fields with ":" push @{$hash{$key}}, $_; } foreach my $k (sort keys %hash) { print $k.join( ', ', @{$hash{$k}} )."\n"; }
    There's nothing wrong with the way you originally did it - it works, it's clear, it's not wasteful, and that's all good.

      Thank you so much for all the explanation on each code snippet. Appreciate your help!

Re: Appending values to existing hash key
by kcott (Archbishop) on Mar 21, 2014 at 05:50 UTC

    G'day jaypal,

    Welcome to the monastery.

    Probably the only part of this code that I would seriously question is the regex:

    • Instead of matching zero or more times throughout (i.e. with '*'), matching one or more times throughout (i.e. with '+') may be more appropriate.
    • Would '\w+' be a better choice than '\S+': both exhibit the same behaviour with the sample data shown but the former might be more useful in finding bad data (in a real application).
    • The final '\s*' is questionable. Is that something you really want to capture? Also, note that leaving that out of the regex means that you don't need the chomp on the preceding line.

    Your sample data could also have a little more description. The code suggests that spaces are significant in some places and not in others:

    • Spaces between idN & nameN and nameN & catN are significant for the hash key. Is this correct? Should records starting with 'id1    name1' and 'id1 name1' use different hash keys?
    • Spaces between catN & catnameN and after catnameN are not significant for the output. Do you need to capture these?

    [This is slightly off-topic for your specific questions; however, if you're parsing fixed-width data, consider using unpack (there's a tutorial for this: perlpacktut).]

    Your two lines

    my @val = split(" ",$val); my $v = join (":", @val);

    could be written as a single line not requiring the intermediary @val:

    my $v = join ':', split ' ', $val;

    If you're not already familiar with this, take a look at the split documentation to see the difference between the "split /PATTERN/"s ' ', / / and /\s+/.

    In the script below, I've modified the regex so that an explicit split isn't required and the join is performed as part of the assignment to $val. This gets rid of both of those two lines as well as both of the @val and $v variables.

    Your code

    if (exists $hash{$key}) { $hash{$key} .= $sep.$v; } else { $hash{$key} = $v; }

    looks fine as it is. It's easy to both read and maintain. You can often replace this construct by using a ternary operator (aka "Conditional Operator"); however, I think the following looks clunky and is inferior to what you've already written (with respect to readability and maintainability):

    $hash{$key} = exists $hash{$key} ? $hash{$key} . $sep . $v : $v;

    or even

    $hash{$key} = exists $hash{$key} ? "$hash{$key}$sep$v" : $v;

    In the script below, I haven't needed that construct because I've allowed $hash{$key} to autovivify (with push @{$data{$key}}, $val;) and performed the concatenation as part of the final print.

    [Side Note: print takes a list. There's no need to explicitly concatenate its arguments. In the specific case of your code here, you could have written print $k.$hash{$k}."\n"; as just print "$k$hash{$k}\n";; although, that's probably less readable than print $k, $hash{$k}, "\n";.]

    Here's the script I've referred to a couple of times above.

    #!/usr/bin/env perl -l use strict; use warnings; my %data; my $re = qr{^(\w+\s+\w+)\s+(\w+)\s+(\w+)}; while (<DATA>) { my ($key, $val) = map { $_->[0], join ':', @$_[1,2] } [/$re/]; push @{$data{$key}}, $val; } print "$_ ", join ', ', @{$data{$_}} for sort keys %data; __DATA__ id1 name1 cat1 catname1 id1 name1 cat2 catname2 id2 name2 cat3 catname3 id3 name3 cat1 catname1 id3 name3 cat4 catname4

    Output:

    id1 name1 cat1:catname1, cat2:catname2 id2 name2 cat3:catname3 id3 name3 cat1:catname1, cat4:catname4

    -- Ken

      Thank you so much Ken. Your explanation was great and helped a great deal on learning the good and bad pracitices. I am starting to follow how perl takes a natural language construct by saying

      print "$_ ", join ', ', @{$data{$_}} for sort keys %data;

      to do what I was very verbose with. map function is something I am still struggling with. Hopefully with little more practice, I can wrap the concept around my head. Thanks again, your answer was very enlightening.

Re: Appending values to existing hash key
by Lennotoecom (Pilgrim) on Mar 21, 2014 at 06:56 UTC
    maybe similar to something posted above, anyway
    while (@t = split /\s+/, <DATA>){ push (@{$h{$t[0]}{$t[1]}}, "$t[2]:$t[3]"); } foreach $k (sort keys %h){ print "$k ", join(' ', map {$_, @{$h{$k}{$_}}} sort keys %{$h{$k}} +),"\n"; } __DATA__ id1 name1 cat1 catname1 id1 name1 cat2 catname2 id2 name2 cat3 catname3 id3 name3 cat1 catname1 id3 name3 cat4 catname4
Re: Appending values to existing hash key
by locked_user sundialsvc4 (Abbot) on Mar 21, 2014 at 14:05 UTC

    As others have said, I approach these problems by looking for a representative data-structure, first; a way to get the data into that structure, second; a way to produce the desired output from that structure, third.   Perl’s “auto-vivification” features are already designed to make the construction of such data-structures very easy.

    If you try to go too-directly from input to output, soon enough someone will want you to add another output, or to do something else with the data between the input and output stages.