Create a hash for each unique captured regex variable

hkates has asked for the wisdom of the Perl Monks concerning the following question:

Here is my input:

foo_1-a
foo_2-b
foo_3-b
foo_4-b
bar_1-a
bar_2-a
bar_3-b
bar_4-a
bar_5-b
[download]

And my desired output:

foo 4 foo_1-a foo_2-b foo_3-b foo_4-b
bar 5 bar_1-a bar_2-a bar_3-b bar_4-a bar_5_b
[download]

I wish that I had code to show you, but I don't know where to start and am thinking perl may not be the best thing for the job. I want to build a hash for each foo or bar, where foo/bar are the values and the hash keys are the full word. e.g.:

%hash_bar =()
bar_1-a => bar
bar_2-a => bar
bar_3-b => bar
bar_4-a => bar
bar_5-b => bar
%hash_foo =()
foo_1-a => foo
foo_2-b => foo
foo_3-b => foo
foo_4-b => foo
[download]

Then I want to print any value in the hash (since they are all the same) followed by the number of keys in the hash, followed by the keys for each hash

It's a stretch to call this pseudo code, but just to clarify my question:

open FH, "<file.txt";
while (<FH>)
    {
    if (/((\S+)_\S+-\S+))
          {
          #for each unique $1; 
          %hash_$1 =();
          # populate hash with keys $2 and values $1
          $hash_$1($2)=$1;
          }
      }
[download]

I'm not expecting anyone to do this for me, but any direction to the function needed for this would be much appreciated.

I know that I wouldn't be able to create the hashes in that if statement. I would need to create a hash of all the unique $1 first (so that I could use the exists function) and then for each key in that hash, read through the file again, creating a new hash for each key in the original hash.

But that seems very inelegant, and I didn't know how to even write the pseudo code. Am I just totally on the wrong track?

Thanks!

Comment on Create a hash for each unique captured regex variable Select or Download Code

Replies are listed 'Best First'.

Re: Create a hash for each unique captured regex variable
by Athanasius (Archbishop) on Jan 29, 2015 at 03:11 UTC

Hello hkates,

You’ve already been given code that solves your problem, but I want to give you some pointers on how to develop a Perlish solution.

I don't know where to start and am thinking perl may not be the best thing for the job.
...
I know that I wouldn't be able to create the hashes in that if statement. I would need to create a hash of all the unique $1 first (so that I could use the exists function) and then for each key in that hash, read through the file again, creating a new hash for each key in the original hash.

And there’s your problem! You don’t need to create the hash keys first, you can create them on-the-fly as needed; and autovivification makes Perl the perfect tool for this job!.

But the key to solving this problem is getting the data structure right. It helps to work backwards: what structure will make it easiest to print off the desired output? Some thought, perhaps some trial-and-error, and the answer emerges: a hash of arrays (HoA):

(
    foo => [ foo_1-a, foo_2-b, foo_3-b, foo_4-b ],
    bar => [ bar_1-a, bar_2-a, bar_3-b, bar_4-a, bar_5_b ],
)
[download]

On HoAs, see perldsc. Now that you have the right data structure, the code to write and read it almost writes itself (well, kinda...):

#! perl
use strict;
use warnings;
use Data::Dump;

my %hash;

while (<DATA>)
{
    push @{ $hash{$2} }, $1 if / ( ([^\s_]+) _ [^\s-]+ - \S+ ) /x;
}

print "\nData structure (HoA):\n";
dd \%hash;

print "\nOutput:\n";

for (sort keys %hash)
{
    my $array_ref = $hash{$_};
    print $_, ' ', scalar @$array_ref, ' ', join(' ', @$array_ref), "\
+n";
}

__DATA__
foo_1-a
foo_2-b
foo_3-b
foo_4-b
bar_1-a
bar_2-a
bar_3-b
bar_4-a
bar_5-b
[download]

I’ve put in code to dump the hash, so you can clearly see the intermediate point at which the data structure has been populated. Here is the output:

12:43 >perl 1139_SoPW.pl

Data structure (HoA):
{
  bar => ["bar_1-a", "bar_2-a", "bar_3-b", "bar_4-a", "bar_5-b"],
  foo => ["foo_1-a", "foo_2-b", "foo_3-b", "foo_4-b"],
}

Output:
bar 5 bar_1-a bar_2-a bar_3-b bar_4-a bar_5-b
foo 4 foo_1-a foo_2-b foo_3-b foo_4-b

12:43 >
[download]

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re: Create a hash for each unique captured regex variable
by CountZero (Bishop) on Jan 29, 2015 at 06:57 UTC

%hash_bar

%hash_foo

Many times your problem can be solved by adding one extra level of keys to your hash: %hash->{'foo'}->... and %hash->{'bar'}->....

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

My blog: Imperial Deltronics

[reply]
[d/l]
[select]

Re: Create a hash for each unique captured regex variable
by NetWallah (Canon) on Jan 29, 2015 at 01:13 UTC

Anyway - here is a similar one-liner alternative:

> perl -anF_ -e '$F[1] and chomp,push @{$h{$F[0]}},$_}{print "$_ ",sca
+lar @{$h{$_}}," ",join " ", @{$h{$_}},"\n" for sort keys %h' test.txt
bar 5 bar_1-a bar_2-a bar_3-b bar_4-a bar_5-b
foo 4 foo_1-a foo_2-b foo_3-b foo_4-b
[download]

"You're only given one little spark of madness. You mustn't lose it." - Robin Williams

[reply]
[d/l]

Re: Create a hash for each unique captured regex variable
by Anonymous Monk on Jan 29, 2015 at 00:24 UTC

perl -MData::Dump=dd -le " @g= qw/ foo_1-a foo_2-b foo_3-b foo_4-b bar
+_1-a bar_2-a bar_3-b bar_4-a bar_5-b /; for(@g){ /([^_])+/ and push @
+{ $f{$1} }, $_; } dd( \%f ); "
{
  o => ["foo_1-a", "foo_2-b", "foo_3-b", "foo_4-b"],
  r => ["bar_1-a", "bar_2-a", "bar_3-b", "bar_4-a", "bar_5-b"],
}
[download]

[reply]
[d/l]

Re: Create a hash for each unique captured regex variable
by MidLifeXis (Monsignor) on Jan 29, 2015 at 13:39 UTC

It appears to me that the data you have should be bucketized based on the text before the '_' character. If this is correct, perhaps these pointers will help:

You can use my ( $bucket, $rest ) = split('_', $line, 2) to chop up your data
You can also store the data into the bucket as an Array of Hashes: push @{ $buckets{$bucket} ||=[] }, $line
You can then find every bucket you have: keys %buckets
You can also get the items in each bucket: @items = @{ $buckets{$bucket}
You can also count the number of items in a bucket: $item_count = scalar( @items ), $item_count = scalar( @{ $buckets{$bucket} } )
You can join all of your items together: $string = join(" ", @items)

Given the above and your skeleton code, you should be able to piece them together to accomplish your goals.

--MidLifeXis

[reply]
[d/l]
[select]