G'day jaypal,
Welcome to the monastery.
Probably the only part of this code that I would seriously question is the regex:
-
Instead of matching zero or more times throughout (i.e. with '*'),
matching one or more times throughout (i.e. with '+') may be more appropriate.
-
Would '\w+' be a better choice than '\S+':
both exhibit the same behaviour with the sample data shown but the former might be more useful in finding bad data (in a real application).
-
The final '\s*' is questionable.
Is that something you really want to capture?
Also, note that leaving that out of the regex means that you don't need the chomp on the preceding line.
Your sample data could also have a little more description.
The code suggests that spaces are significant in some places and not in others:
-
Spaces between idN & nameN and nameN & catN are significant for the hash key.
Is this correct?
Should records starting with 'id1 name1' and 'id1 name1' use different hash keys?
-
Spaces between catN & catnameN and after catnameN are not significant for the output.
Do you need to capture these?
[This is slightly off-topic for your specific questions; however, if you're parsing fixed-width data, consider using unpack (there's a tutorial for this: perlpacktut).]
Your two lines
my @val = split(" ",$val);
my $v = join (":", @val);
could be written as a single line not requiring the intermediary @val:
my $v = join ':', split ' ', $val;
If you're not already familiar with this, take a look at the split documentation to see the difference between the "split /PATTERN/"s ' ', / / and /\s+/.
In the script below, I've modified the regex so that an explicit split isn't required and the join is performed as part of the assignment to $val.
This gets rid of both of those two lines as well as both of the @val and $v variables.
Your code
if (exists $hash{$key}) {
$hash{$key} .= $sep.$v;
}
else {
$hash{$key} = $v;
}
looks fine as it is.
It's easy to both read and maintain.
You can often replace this construct by using a ternary operator (aka "Conditional Operator"); however, I think the following looks clunky and is inferior to what you've already written (with respect to readability and maintainability):
$hash{$key} = exists $hash{$key} ? $hash{$key} . $sep . $v : $v;
or even
$hash{$key} = exists $hash{$key} ? "$hash{$key}$sep$v" : $v;
In the script below, I haven't needed that construct because I've allowed $hash{$key} to autovivify (with push @{$data{$key}}, $val;) and performed the concatenation as part of the final print.
[Side Note: print takes a list.
There's no need to explicitly concatenate its arguments.
In the specific case of your code here, you could have written print $k.$hash{$k}."\n"; as just print "$k$hash{$k}\n";; although, that's probably less readable than print $k, $hash{$k}, "\n";.]
Here's the script I've referred to a couple of times above.
#!/usr/bin/env perl -l
use strict;
use warnings;
my %data;
my $re = qr{^(\w+\s+\w+)\s+(\w+)\s+(\w+)};
while (<DATA>) {
my ($key, $val) = map { $_->[0], join ':', @$_[1,2] } [/$re/];
push @{$data{$key}}, $val;
}
print "$_ ", join ', ', @{$data{$_}} for sort keys %data;
__DATA__
id1 name1 cat1 catname1
id1 name1 cat2 catname2
id2 name2 cat3 catname3
id3 name3 cat1 catname1
id3 name3 cat4 catname4
Output:
id1 name1 cat1:catname1, cat2:catname2
id2 name2 cat3:catname3
id3 name3 cat1:catname1, cat4:catname4
|