hkates has asked for the wisdom of the Perl Monks concerning the following question:

Here is my input:

foo_1-a foo_2-b foo_3-b foo_4-b bar_1-a bar_2-a bar_3-b bar_4-a bar_5-b

And my desired output:

foo 4 foo_1-a foo_2-b foo_3-b foo_4-b bar 5 bar_1-a bar_2-a bar_3-b bar_4-a bar_5_b

I wish that I had code to show you, but I don't know where to start and am thinking perl may not be the best thing for the job. I want to build a hash for each foo or bar, where foo/bar are the values and the hash keys are the full word. e.g.:

%hash_bar =() bar_1-a => bar bar_2-a => bar bar_3-b => bar bar_4-a => bar bar_5-b => bar %hash_foo =() foo_1-a => foo foo_2-b => foo foo_3-b => foo foo_4-b => foo

Then I want to print any value in the hash (since they are all the same) followed by the number of keys in the hash, followed by the keys for each hash

It's a stretch to call this pseudo code, but just to clarify my question:

open FH, "<file.txt"; while (<FH>) { if (/((\S+)_\S+-\S+)) { #for each unique $1; %hash_$1 =(); # populate hash with keys $2 and values $1 $hash_$1($2)=$1; } }

I'm not expecting anyone to do this for me, but any direction to the function needed for this would be much appreciated.

I know that I wouldn't be able to create the hashes in that if statement. I would need to create a hash of all the unique $1 first (so that I could use the exists function) and then for each key in that hash, read through the file again, creating a new hash for each key in the original hash.

But that seems very inelegant, and I didn't know how to even write the pseudo code. Am I just totally on the wrong track?

Thanks!

Replies are listed 'Best First'.
Re: Create a hash for each unique captured regex variable
by Athanasius (Archbishop) on Jan 29, 2015 at 03:11 UTC

    Hello hkates,

    You’ve already been given code that solves your problem, but I want to give you some pointers on how to develop a Perlish solution.

    I don't know where to start and am thinking perl may not be the best thing for the job.
    ...
    I know that I wouldn't be able to create the hashes in that if statement. I would need to create a hash of all the unique $1 first (so that I could use the exists function) and then for each key in that hash, read through the file again, creating a new hash for each key in the original hash.

    And there’s your problem! You don’t need to create the hash keys first, you can create them on-the-fly as needed; and autovivification makes Perl the perfect tool for this job!.

    But the key to solving this problem is getting the data structure right. It helps to work backwards: what structure will make it easiest to print off the desired output? Some thought, perhaps some trial-and-error, and the answer emerges: a hash of arrays (HoA):

    ( foo => [ foo_1-a, foo_2-b, foo_3-b, foo_4-b ], bar => [ bar_1-a, bar_2-a, bar_3-b, bar_4-a, bar_5_b ], )

    On HoAs, see perldsc. Now that you have the right data structure, the code to write and read it almost writes itself (well, kinda...):

    #! perl use strict; use warnings; use Data::Dump; my %hash; while (<DATA>) { push @{ $hash{$2} }, $1 if / ( ([^\s_]+) _ [^\s-]+ - \S+ ) /x; } print "\nData structure (HoA):\n"; dd \%hash; print "\nOutput:\n"; for (sort keys %hash) { my $array_ref = $hash{$_}; print $_, ' ', scalar @$array_ref, ' ', join(' ', @$array_ref), "\ +n"; } __DATA__ foo_1-a foo_2-b foo_3-b foo_4-b bar_1-a bar_2-a bar_3-b bar_4-a bar_5-b

    I’ve put in code to dump the hash, so you can clearly see the intermediate point at which the data structure has been populated. Here is the output:

    12:43 >perl 1139_SoPW.pl Data structure (HoA): { bar => ["bar_1-a", "bar_2-a", "bar_3-b", "bar_4-a", "bar_5-b"], foo => ["foo_1-a", "foo_2-b", "foo_3-b", "foo_4-b"], } Output: bar 5 bar_1-a bar_2-a bar_3-b bar_4-a bar_5-b foo 4 foo_1-a foo_2-b foo_3-b foo_4-b 12:43 >

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Create a hash for each unique captured regex variable
by CountZero (Bishop) on Jan 29, 2015 at 06:57 UTC
    If the design of your code involves dynamically making a variable (like your %hash_bar and %hash_foo), then you should most probably rethink it. It is almost always a clear sign that there is something wrong with your data-structure.

    Many times your problem can be solved by adding one extra level of keys to your hash: %hash->{'foo'}->... and %hash->{'bar'}->....

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: Create a hash for each unique captured regex variable
by NetWallah (Canon) on Jan 29, 2015 at 01:13 UTC
    Anonymonk's solution would look slightly better ("foo", instead of "o", if the "+" symbol was moved inside the open paren.

    Anyway - here is a similar one-liner alternative:

    > perl -anF_ -e '$F[1] and chomp,push @{$h{$F[0]}},$_}{print "$_ ",sca +lar @{$h{$_}}," ",join " ", @{$h{$_}},"\n" for sort keys %h' test.txt bar 5 bar_1-a bar_2-a bar_3-b bar_4-a bar_5-b foo 4 foo_1-a foo_2-b foo_3-b foo_4-b

            "You're only given one little spark of madness. You mustn't lose it."         - Robin Williams

Re: Create a hash for each unique captured regex variable
by Anonymous Monk on Jan 29, 2015 at 00:24 UTC
    perl -MData::Dump=dd -le " @g= qw/ foo_1-a foo_2-b foo_3-b foo_4-b bar +_1-a bar_2-a bar_3-b bar_4-a bar_5-b /; for(@g){ /([^_])+/ and push @ +{ $f{$1} }, $_; } dd( \%f ); " { o => ["foo_1-a", "foo_2-b", "foo_3-b", "foo_4-b"], r => ["bar_1-a", "bar_2-a", "bar_3-b", "bar_4-a", "bar_5-b"], }
Re: Create a hash for each unique captured regex variable
by MidLifeXis (Monsignor) on Jan 29, 2015 at 13:39 UTC

    It appears to me that the data you have should be bucketized based on the text before the '_' character. If this is correct, perhaps these pointers will help:

    • You can use my ( $bucket, $rest ) = split('_', $line, 2) to chop up your data
    • You can also store the data into the bucket as an Array of Hashes: push @{ $buckets{$bucket} ||=[] }, $line
    • You can then find every bucket you have: keys %buckets
    • You can also get the items in each bucket: @items = @{ $buckets{$bucket}
    • You can also count the number of items in a bucket: $item_count = scalar( @items ), $item_count = scalar( @{ $buckets{$bucket} } )
    • You can join all of your items together: $string = join(" ", @items)

    Given the above and your skeleton code, you should be able to piece them together to accomplish your goals.

    --MidLifeXis