Re: Print something when key does not exist
by kcott (Archbishop) on Apr 05, 2014 at 21:05 UTC
|
G'day jaypal,
I'll first show how I might have coded this.
Then I'll walk through it, line by line, and explain the differences between our versions (answering your questions along the way).
Here's my script. The __DATA__ and output are the same as yours.
#!/usr/bin/env perl -l
use strict;
use warnings;
my (%data, %codes_found);
my $sep = '^';
while (<DATA>) {
next if $. == 1;
chomp;
my ($name, $code, $count) = split /\Q$sep/;
++$codes_found{$code};
$data{$name}{$code} = $count;
}
my @codes = sort keys %codes_found;
print join $sep => 'Name', @codes;
for my $name (sort keys %data) {
print join $sep => $name, map { $data{$name}{$_} || '' } @codes;
}
Here's the walkthrough.
- #!/usr/bin/env perl -l
-
In your shebang line, use whatever identifies the perl you want.
The part I wanted to highlight was the -l switch: this saves you having to add "\n" to all your print statements.
It's not always what you want but is very often useful.
See "perlrun: Command Switches".
- use strict;
- use warnings;
-
Same as your script.
(Except in extremely rare cases, e.g. to demonstrate the effects of their absence, I use these in all my scripts.)
- my (%data, %codes_found);
- my $sep = '^';
-
Not too dissimiliar to what you have.
I didn't need an @flds (see below).
You can declare multiple my variables in a list.
My %data is used for the same function as your $map.
Dereferencing has some minimal overhead and requires a small amount of extra code (i.e. '->'): unless $hashref is really what you want, I find %hash is generally a better choice.
- while (<DATA>) {
-
Unless you really need a $line variable, I usually find the default $_ suffices.
- next if $. == 1;
- chomp;
-
If you intend skipping entire lines without performing any processing on them, do the next before any processing, including chomp.
- my ($name, $code, $count) = split /\Q$sep/;
-
If you're going to split your lines in a while loop such as this, a fresh my @fields variable (on each iteration) is a better choice.
While it makes little difference in a small script like this, getting into the habit of making your variables available in the smallest scope possible means you'll avoid the problems associated with global variables: if the script was changed, additional logic complexity could result in a hard-to-track-down bug where you were perhaps operating on values from a previous iteration.
As we're returning just three, well-defined values, my ($name, $code, $count) = ... ticks the limited scope box and makes the following lines more readable and maintainable.
Also note, I'm using the $sep variable already defined rather than hard-coding a value here: if the separator changed or you wanted to abstract this for multiple data sources with different separators, you only need to change one value. In a more complex situation, you may need a $in_sep and a $out_sep; however, that still equates to changing values in one place rather than having to search your entire script for hard-coded values and make multiple changes.
See quotemeta if you're unfamiliar with \Q.
- ++$codes_found{$code};
-
You asked about this. It's a standard and well-known idiom: it's use is quite appropriate here.
In a simple statement such as this, the use of the prefix or postfix forms doesn't make any difference.
In a more complex statement, they could easily make a difference to the logic, so make sure you understand both forms of autoincrement and autodecrement. See "perlop: Auto-increment and Auto-decrement".
Also consider the readability (of the keys) of $codes_found{$code} vs. $codes{$flds[1]}.
Furthermore, if the input data format changed, the former (with $code) would probably still work as written, while the latter (with $flds[1]) may well need modification.
- $data{$name}{$code} = $count;
-
While this does the same as your $map->{$flds[0]}->{$flds[1]} = $flds[2], consider the same readability and maintainability points that I raised above.
I don't know the purpose of the or next on that line of your script.
In this particular instance, it's a no-op (i.e. $flds[2] is always TRUE, so next is never called) so you were lucky; in another instance, that no-op could well become a bug!
By the way, after the first $hashref->{$key} or $arrayref->[$index], you don't need to keep repeating the '->' to drill down into a complex data structure.
For instance, $hashref->{$key_outer}{$key_inner} would be fine; this works for any complexity, e.g. $arrayref->[$i]{$key}[$j] is also fine.
- my @codes = sort keys %codes_found;
-
You need the codes in that order twice; just generate the list once.
- print join $sep => 'Name', @codes;
-
A join is a simpler solution than a for loop to print that one line.
- for my $name (sort keys %data) {
-
No difference in logic to your script.
$name is more meaningful and exactly mirrors its use earlier.
Compare that single name to your $flds[0] and $k1 and consider the same readability and maintainence issues already raised.
By the way, for and foreach are synonymous.
I go with the laziness virtue on this one and save myself four keystrokes each time I want a foreach loop.
- print join $sep => $name, map { $data{$name}{$_} || '' } @codes;
-
You asked a couple of questions about this part.
As we already have @codes, an additional for is not needed and only a single print statement is required.
While there's other ways to do this, that probably answers your "How can I write this idiomatically." question.
You're quite correct about autovivification.
The exists function is often a good way to avoid this.
In this case, $data{$name}{$_} || '' causes no autovivification.
| [reply] [d/l] [select] |
|
|
Hello Ken,
I can't thank you enough for the time and effort spent on such detailed and fantastic analysis. This is your second answer to my two posts here on perl monks and I can say for sure that I have learnt so much from your answers than I could ever have by reading a book.
In regards to my use of hash reference instead of hash was basically to get a hang of hash reference iteration. I have been reading references lately and thought would be a good idea to practice it here.
Your following points were extremely valuable and helpful:
- #!/usr/bin/env perl -l for new lines
- Using my ($name, $code, $count) = split /\Q$sep/;
instead of array and hard coding separator
- my @codes = sort keys %codes_found;
Instead of looping through the hash
- Using for instead of foreach
- print join $sep => $name, map { $data{$name}{$_} || '' } @codes; Thi
+s is exactly what I was looking for. Idiomatic and very readable.
One follow up question, I never seen fat comma (=>) used in join. Is this because we wanted the $sep to be quoted and fat comma will do that for us?
Also, in the line
print join $sep => $name, map { $data{$name}{$_} || '' } @codes;
Why does auto-vivification does not occur here? What makes perl decide to go the OR (||) route to map null string to join function?
Thank you again. I know I have a long way to go in learning perl and will look forward to your guidance here on perl monks.
Regards,
Jaypal
| [reply] [d/l] [select] |
|
|
... I never seen fat comma (=>) used in join. Is this because we wanted the $sep to be quoted and fat comma will do that for us?
In this case, use of the fat comma is a personal notational convention. $sep is already 'quoted'; that is to say, it is already a string, and there's nothing you could do to make it stringier. If it had been a bareword instead (and assuming strictures and warnings enabled), use of a fat comma would have caused the bareword to be treated as a string.
c:\@Work\Perl\monks>perl -wMstrict -le
"my $s = join foo => 1, 2, 3, 4;
print qq{'$s'};
"
'1foo2foo3foo4'
In fact, any legal expression will be 'stringized' for use by join:
c:\@Work\Perl\monks>perl -wMstrict -le
"my $s = join 8, qw(foo bar baz);
print qq{'$s'};
"
'foo8bar8baz'
| [reply] [d/l] [select] |
|
|
"Hello Ken, I can't thank you enough for the time and effort spent on such detailed and fantastic analysis. This is your second answer to my two posts here on perl monks and I can say for sure that I have learnt so much from your answers than I could ever have by reading a book."
You've shown a genuine desire to learn and have put thought into your questions: I'm more than happy to help.
As I often say, "A better question gets better answers."
"One follow up question, I never seen fat comma (=>) used in join. Is this because we wanted the $sep to be quoted and fat comma will do that for us?"
As ++AnomalousMonk correctly identified, and went on to answer in more detail, this is a personal preference.
There's a quite a few functions whose first argument describes how it should operate, with the remaining arguments indicating what it should operate on.
(Some exmples, just off the top of my head and in no particular order, would include: join, sprintf and pack.)
In these cases, I often use a fat comma to separate the "how-to-operate" and the "what-to-operate-on" arguments.
I find it can improve readability in many cases but, where it doesn't, I don't use it.
It's just a personal style preference: use whatever you feel most comfortable with.
"Why does auto-vivification does not occur here?"
Autovivification has changed as Perl has matured: it used to happen a lot more than it does in more recent versions.
(The changes mostly seem to be for optimisation purposes but that's purely a guess on my part.)
As I've been using Perl since Perl3, and don't always recall every change that has been made, I do find that I sometimes need to check whether autovivification is occurring in a particular context: the script I provided was a case in point and a simple "use Data::Dump; dd \%data;" at the end of the code confirmed what I thought was happening.
$name is known to be a key of %data because it was returned by keys %data;
therefore, $data{$name} doesn't need to be autovivified to check for $data{$name}{$_}.
If there's no $data{$name}{$_} key, it doesn't need to be autovivified because, in the boolean context, it's known to be FALSE and that's all that needs to be known; it's not, for instance, being used as an lvalue or passed as an argument.
If $name had come from another source and $data{$name} didn't exist, it would need to be autovivified; however, $data{$name}{$_} still wouldn't need to be autovivified for the reasons just given.
Here's a quick test to clarify those points:
$ perl -Mstrict -Mwarnings -MData::Dump -le '
my %x = (a => { B => 2 });
for my $key (qw{a b}) {
print join "^" => map { $x{$key}{$_} || "undef" } qw{B C};
dd \%x;
}
'
2^undef
{ a => { B => 2 } }
undef^undef
{ a => { B => 2 }, b => {} }
$x{b} was autovivified where it was needed.
None of $x{a}{C}, $x{b}{B} or $x{b}{C} were autovivified because they weren't needed beyond the boolean context where just knowing that they didn't exist was sufficient.
If I'm ever not 100% sure, I find a quick check is faster than trying to analyse all the code involved and can avoid potential late night debugging sessions.
You'll find some more information on this in perldata and perlref.
"What makes perl decide to go the OR (||) route to map null string to join function?"
There are several things that are deemed to be FALSE in a boolean context.
An undefined value is one of these.
See "perlsyn: Truth and Falsehood" for more details.
| [reply] [d/l] [select] |
|
|
|
|
|
|
Re: Print something when key does not exist
by choroba (Cardinal) on Apr 05, 2014 at 17:34 UTC
|
Here is how I would change your code:
You can profit from the // operator to avoid undefined warnings, but only if your Perl is 5.10+. Also, I extracted the codes from the %map hash (no need to create a hash reference) with a "slice". But all these are just minor things. Moreover, you correctly identified the places where the code didn't flow smoothly.
| [reply] [d/l] |
|
|
Thank you so much Choroba. Your module looks interesting. I have been following your answers on StackOverflow and it's not just perl solution that has been helpful. Have learnt alot from awk, sed and bash as well.
Appreciate it.
Regards,
Jaypal
http://stackoverflow.com/users/970195/js
| [reply] |
Re: Print something when key does not exist
by NetWallah (Canon) on Apr 05, 2014 at 20:04 UTC
|
(Longish) one-liner for this: (Line broken for readability ...)
perl -F/\\^/ -lanE 'next if $.==1 or !@F;
$h{$F[0]}{$F[1]}=$F[2];
$v{$F[1]}++ }
{say join ("^", "Name",@w=sort keys %v);
say join ("^",$_,@{$h{$_}}{@w}) for sort keys %h' data1.txt
Update: FWIW, kcott's(++) detailed explanation below follows exactly the logic in my one-liner, that I was too lazy to document.
What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?
-Larry Wall, 1992
| [reply] [d/l] |
|
|
Thanks so much for this insane one-liner. I have learnt so much about awk by writing and reading awk one liners. This will help a great deal as well.
Two questions:
What does the following mean? Is that for handling blank lines?
or !@F;
How does, index increment for the array in the following to cover every element?
say join ( "^", $_, @{ $h{$_} } {@w} ) for sort keys %h'
| [reply] [d/l] [select] |
|
|
++ for asking good questions.
Yes the !@F handles blank lines. When split, these will leave the @F array unpopulated, so scalar(@F) will be 0, or false, and the line gets skipped.
The %h is a Hash of Hashes. the outer "for sort keys %h" handles the first (outer) hash.
Inside that, I use a "hash slice" to get an array of values for keys specified by @w (which is a sorted array).
This is pretty much the same thing kcott does in his line:
map { $data{$name}{$_} || '' } @codes;
except - my code does not worry about non-existing keys - since I run without 'warnings', my code does the right thing without complaint.
What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?
-Larry Wall, 1992
| [reply] [d/l] |
|
|