in reply to Print something when key does not exist

G'day jaypal,

I'll first show how I might have coded this. Then I'll walk through it, line by line, and explain the differences between our versions (answering your questions along the way).

Here's my script. The __DATA__ and output are the same as yours.

#!/usr/bin/env perl -l use strict; use warnings; my (%data, %codes_found); my $sep = '^'; while (<DATA>) { next if $. == 1; chomp; my ($name, $code, $count) = split /\Q$sep/; ++$codes_found{$code}; $data{$name}{$code} = $count; } my @codes = sort keys %codes_found; print join $sep => 'Name', @codes; for my $name (sort keys %data) { print join $sep => $name, map { $data{$name}{$_} || '' } @codes; }

Here's the walkthrough.

#!/usr/bin/env perl -l

In your shebang line, use whatever identifies the perl you want. The part I wanted to highlight was the -l switch: this saves you having to add "\n" to all your print statements. It's not always what you want but is very often useful. See "perlrun: Command Switches".

use strict;
use warnings;

Same as your script. (Except in extremely rare cases, e.g. to demonstrate the effects of their absence, I use these in all my scripts.)

my (%data, %codes_found);
my $sep = '^';

Not too dissimiliar to what you have. I didn't need an @flds (see below). You can declare multiple my variables in a list. My %data is used for the same function as your $map. Dereferencing has some minimal overhead and requires a small amount of extra code (i.e. '->'): unless $hashref is really what you want, I find %hash is generally a better choice.

while (<DATA>) {

Unless you really need a $line variable, I usually find the default $_ suffices.

next if $. == 1;
chomp;

If you intend skipping entire lines without performing any processing on them, do the next before any processing, including chomp.

my ($name, $code, $count) = split /\Q$sep/;

If you're going to split your lines in a while loop such as this, a fresh my @fields variable (on each iteration) is a better choice. While it makes little difference in a small script like this, getting into the habit of making your variables available in the smallest scope possible means you'll avoid the problems associated with global variables: if the script was changed, additional logic complexity could result in a hard-to-track-down bug where you were perhaps operating on values from a previous iteration.

As we're returning just three, well-defined values, my ($name, $code, $count) = ... ticks the limited scope box and makes the following lines more readable and maintainable.

Also note, I'm using the $sep variable already defined rather than hard-coding a value here: if the separator changed or you wanted to abstract this for multiple data sources with different separators, you only need to change one value. In a more complex situation, you may need a $in_sep and a $out_sep; however, that still equates to changing values in one place rather than having to search your entire script for hard-coded values and make multiple changes. See quotemeta if you're unfamiliar with \Q.

++$codes_found{$code};

You asked about this. It's a standard and well-known idiom: it's use is quite appropriate here.

In a simple statement such as this, the use of the prefix or postfix forms doesn't make any difference. In a more complex statement, they could easily make a difference to the logic, so make sure you understand both forms of autoincrement and autodecrement. See "perlop: Auto-increment and Auto-decrement".

Also consider the readability (of the keys) of $codes_found{$code} vs. $codes{$flds[1]}. Furthermore, if the input data format changed, the former (with $code) would probably still work as written, while the latter (with $flds[1]) may well need modification.

$data{$name}{$code} = $count;

While this does the same as your $map->{$flds[0]}->{$flds[1]} = $flds[2], consider the same readability and maintainability points that I raised above.

I don't know the purpose of the or next on that line of your script. In this particular instance, it's a no-op (i.e. $flds[2] is always TRUE, so next is never called) so you were lucky; in another instance, that no-op could well become a bug!

By the way, after the first $hashref->{$key} or $arrayref->[$index], you don't need to keep repeating the '->' to drill down into a complex data structure. For instance, $hashref->{$key_outer}{$key_inner} would be fine; this works for any complexity, e.g. $arrayref->[$i]{$key}[$j] is also fine.

my @codes = sort keys %codes_found;

You need the codes in that order twice; just generate the list once.

print join $sep => 'Name', @codes;

A join is a simpler solution than a for loop to print that one line.

for my $name (sort keys %data) {

No difference in logic to your script. $name is more meaningful and exactly mirrors its use earlier. Compare that single name to your $flds[0] and $k1 and consider the same readability and maintainence issues already raised.

By the way, for and foreach are synonymous. I go with the laziness virtue on this one and save myself four keystrokes each time I want a foreach loop.

print join $sep => $name, map { $data{$name}{$_} || '' } @codes;

You asked a couple of questions about this part.

As we already have @codes, an additional for is not needed and only a single print statement is required. While there's other ways to do this, that probably answers your "How can I write this idiomatically." question.

You're quite correct about autovivification. The exists function is often a good way to avoid this. In this case, $data{$name}{$_} || '' causes no autovivification.

-- Ken

Replies are listed 'Best First'.
Re^2: Print something when key does not exist
by jaypal (Beadle) on Apr 06, 2014 at 00:40 UTC

    Hello Ken, I can't thank you enough for the time and effort spent on such detailed and fantastic analysis. This is your second answer to my two posts here on perl monks and I can say for sure that I have learnt so much from your answers than I could ever have by reading a book.

    In regards to my use of hash reference instead of hash was basically to get a hang of hash reference iteration. I have been reading references lately and thought would be a good idea to practice it here.

    Your following points were extremely valuable and helpful:

    - #!/usr/bin/env perl -l for new lines - Using my ($name, $code, $count) = split /\Q$sep/; instead of array and hard coding separator - my @codes = sort keys %codes_found; Instead of looping through the hash - Using for instead of foreach - print join $sep => $name, map { $data{$name}{$_} || '' } @codes; Thi +s is exactly what I was looking for. Idiomatic and very readable.

    One follow up question, I never seen fat comma (=>) used in join. Is this because we wanted the $sep to be quoted and fat comma will do that for us?

    Also, in the line

    print join $sep => $name, map { $data{$name}{$_} || '' } @codes;

    Why does auto-vivification does not occur here? What makes perl decide to go the OR (||) route to map null string to join function?

    Thank you again. I know I have a long way to go in learning perl and will look forward to your guidance here on perl monks.

    Regards, Jaypal

      ... I never seen fat comma (=>) used in join. Is this because we wanted the $sep to be quoted and fat comma will do that for us?

      In this case, use of the fat comma is a personal notational convention.  $sep is already 'quoted'; that is to say, it is already a string, and there's nothing you could do to make it stringier. If it had been a bareword instead (and assuming strictures and warnings enabled), use of a fat comma would have caused the bareword to be treated as a string.

      c:\@Work\Perl\monks>perl -wMstrict -le "my $s = join foo => 1, 2, 3, 4; print qq{'$s'}; " '1foo2foo3foo4'

      In fact, any legal expression will be 'stringized' for use by join:

      c:\@Work\Perl\monks>perl -wMstrict -le "my $s = join 8, qw(foo bar baz); print qq{'$s'}; " 'foo8bar8baz'
      "Hello Ken, I can't thank you enough for the time and effort spent on such detailed and fantastic analysis. This is your second answer to my two posts here on perl monks and I can say for sure that I have learnt so much from your answers than I could ever have by reading a book."

      You've shown a genuine desire to learn and have put thought into your questions: I'm more than happy to help. As I often say, "A better question gets better answers."

      "One follow up question, I never seen fat comma (=>) used in join. Is this because we wanted the $sep to be quoted and fat comma will do that for us?"

      As ++AnomalousMonk correctly identified, and went on to answer in more detail, this is a personal preference.

      There's a quite a few functions whose first argument describes how it should operate, with the remaining arguments indicating what it should operate on. (Some exmples, just off the top of my head and in no particular order, would include: join, sprintf and pack.) In these cases, I often use a fat comma to separate the "how-to-operate" and the "what-to-operate-on" arguments. I find it can improve readability in many cases but, where it doesn't, I don't use it. It's just a personal style preference: use whatever you feel most comfortable with.

      "Why does auto-vivification does not occur here?"

      Autovivification has changed as Perl has matured: it used to happen a lot more than it does in more recent versions. (The changes mostly seem to be for optimisation purposes but that's purely a guess on my part.) As I've been using Perl since Perl3, and don't always recall every change that has been made, I do find that I sometimes need to check whether autovivification is occurring in a particular context: the script I provided was a case in point and a simple "use Data::Dump; dd \%data;" at the end of the code confirmed what I thought was happening.

      $name is known to be a key of %data because it was returned by keys %data; therefore, $data{$name} doesn't need to be autovivified to check for $data{$name}{$_}. If there's no $data{$name}{$_} key, it doesn't need to be autovivified because, in the boolean context, it's known to be FALSE and that's all that needs to be known; it's not, for instance, being used as an lvalue or passed as an argument.

      If $name had come from another source and $data{$name} didn't exist, it would need to be autovivified; however, $data{$name}{$_} still wouldn't need to be autovivified for the reasons just given.

      Here's a quick test to clarify those points:

      $ perl -Mstrict -Mwarnings -MData::Dump -le ' my %x = (a => { B => 2 }); for my $key (qw{a b}) { print join "^" => map { $x{$key}{$_} || "undef" } qw{B C}; dd \%x; } ' 2^undef { a => { B => 2 } } undef^undef { a => { B => 2 }, b => {} }

      $x{b} was autovivified where it was needed. None of $x{a}{C}, $x{b}{B} or $x{b}{C} were autovivified because they weren't needed beyond the boolean context where just knowing that they didn't exist was sufficient.

      If I'm ever not 100% sure, I find a quick check is faster than trying to analyse all the code involved and can avoid potential late night debugging sessions.

      You'll find some more information on this in perldata and perlref.

      "What makes perl decide to go the OR (||) route to map null string to join function?"

      There are several things that are deemed to be FALSE in a boolean context. An undefined value is one of these. See "perlsyn: Truth and Falsehood" for more details.

      -- Ken

        ... I often use a fat comma to separate the "how-to-operate" and the "what-to-operate-on" arguments.

        My favorite personal "off-label" usage of  => is in an OO-code statement like
            return bless $objectref => $class;
        for what I imagine to be its self-documenting qualities: "bless object reference into class."

        Thanks Ken for the kind words. That example you provided for auto-vivification was really helpful. Have a great rest of the weekend. :)