ozboomer has asked for the wisdom of the Perl Monks concerning the following question:

Hi again, all... and apologies straight-up for the weirdo tiitle(!)

I'm now working on a project where I want to do some two-way checking for 'missing values' and I think the start of the process involves building some hashes.

I think the code below is a start... and is easier to deal with when compared to trying to see if an element I want to add to an anonymous hash already exists in that anonymous hash, even though there's a couple of lots of processing involved.

The code works Ok... but there are a couple of things still to be worked out:-

So... to the code:-

use Data::Dumper; %data_hash = (); %output_hash = (); while( <DATA> ) { # Build list of unique (sit +e:dsk) items ($site, $buf) = split(/,/, $_); @input_item = split(/:/, $buf); foreach $input_field (@input_item) { # EX: "VAR8=36!206!207!" @dsk_list = ($input_field =~ /([0-9]+)!([0-9]+)!$/); # Get last + 2 of 3 items foreach $dsk (@dsk_list) { # Each dsk item in the inpu +t... next if ($dsk == 0); # Skip '0' dsk items $key = $site . ":" . $dsk; # Build composite key $data_hash{$key}++; # ...and save it } } } foreach $key ( sort keys %data_hash ) { # Build list of dsk -> (mul +ti sites) ($site, $dsk) = split(/:/, $key); push( @{$output_hash{$dsk} }, $site ); } foreach $dsk (sort {$a <=> $b} keys %output_hash) { # Show list of si +tes for each dsk printf("Dsk: %d:\n", $dsk); foreach $site (sort {$a <=> $b} @{$output_hash{$dsk}}) { printf(" %d\n", $site); } printf("\n"); } __DATA__ 1108,VAR6=36!204!205!:VAR8=36!206!207!:VAR13=36!70!0!:VAR14=36!70!71!: +VAR15=36!71!0! 377,VAR12=36!97!96! 512,VAR6=36!90!91!:VAR8=36!92!93!:VAR11=36!0!70!:VAR12=36!189!190! 587,VAR2=36!550!0!:VAR4=36!554!0!:VAR6=36!551!0!

...and some example output:-

Dsk: 70: 512 1108 Dsk: 71: 1108 Dsk: 90: 512 Dsk: 91: 512 Dsk: 92: 512 Dsk: 93: 512 Dsk: 96: 377 Dsk: 97: 377 Dsk: 189: 512 Dsk: 190: 512 Dsk: 204: 1108 Dsk: 205: 1108 Dsk: 206: 1108 Dsk: 207: 1108 Dsk: 550: 587 Dsk: 551: 587 Dsk: 554: 587

Ultimately, I expect to use defined() to see if an element exists or not, which will let me display the 'missing items' I mentioned at the start... or I could use some sort of 'union/intersection' construct on the arrays of keys...

Would appreciate any clues on how to approach this...

Thanks...

Replies are listed 'Best First'.
Re: How to Check Hashes for Missing Items when Keys can be Values and vice versa
by Athanasius (Archbishop) on Jul 26, 2017 at 08:38 UTC

    Hello ozboomer,

    1. The code allows me to see the sites used within each "dsk" item... but I also want to see the "dsk" items used at each site. Can I do that with a single hash... or (as I expect) I'll need to maintain at least a couple of hashes?

    Yes, unless you change to a different approach (e.g. a database), you’ll need another hash for this. But building it is easy: just add another line to your second foreach loop:

    ... my %site_2_dsk; foreach my $key ( sort keys %data_hash ) { my ($site, $dsk) = split /:/, $key; push @{ $output_hash{$dsk} }, $site; push @{ $site_2_dsk{$site} }, $dsk; } ...

    BTW, note the use of my above. Why aren’t you useing strict (and warnings)??

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: How to Check Hashes for Missing Items when Keys can be Values and vice versa
by haukex (Archbishop) on Jul 26, 2017 at 08:35 UTC

    Personally I would probably build a hash of hashes, plus its inverse (no problem if the input data isn't too big):

    use warnings; use strict; my (%sites,%dsks); while (<DATA>) { my ($site, $buf) = split /,/; for (split /:/, $buf) { for my $dsk (grep {$_!=0} /([0-9]+)!([0-9]+)!$/) { $sites{$site}{$dsk}++; $dsks{$dsk}{$site}++; } } } for my $dsk (sort {$a<=>$b} keys %dsks) { print "Dsk $dsk: ", join(", ", sort {$a<=>$b} keys %{ $dsks{$dsk} } ), "\n"; } for my $site (sort {$a<=>$b} keys %sites) { print "Site $site: ", join(", ", sort {$a<=>$b} keys %{ $sites{$site} } ), "\n"; }

    As for your question about missing values, that's definitely a case of TIMTOWTDI. See for example the FAQ How do I compute the difference of two arrays? How do I compute the intersection of two arrays? You can also just iterate over the list of expected keys and check their existence in the target hash via exists, but that's the brute force method. There's also a trick I sometimes like to use that involves deleteing a hash slice (again, only if the input data isn't too big, because it's not the most efficient method), here I'll demonstrate by listing the "dsk"s (disks?) that are missing from each site:

    my %alldsks = map {$_=>1} keys %dsks; for my $site (sort {$a<=>$b} keys %sites) { my @sitedsks = keys %{ $sites{$site} }; my %missingdsks = %alldsks; delete @missingdsks{@sitedsks}; print "Site $site MISSING: ", join(", ", sort {$a<=>$b} keys %missingdsks), "\n"; } __END__ Site 377 MISSING: 70, 71, 90, 91, 92, 93, 189, 190, 204, 205, 206, 207 +, 550, 551, 554 Site 512 MISSING: 71, 96, 97, 204, 205, 206, 207, 550, 551, 554 Site 587 MISSING: 70, 71, 90, 91, 92, 93, 96, 97, 189, 190, 204, 205, +206, 207 Site 1108 MISSING: 90, 91, 92, 93, 96, 97, 189, 190, 550, 551, 554

    I'd also recommend you play it safer and Use strict and warnings.

Re: How to Check Hashes for Missing Items when Keys can be Values and vice versa
by sn1987a (Curate) on Jul 26, 2017 at 11:21 UTC
    Ultimately, I expect to use defined() to see if an element exists or not

    In addion to the other, excellent comments:
    To determine if a key exists in a hash use exists. The function defined is used to test for definedness (i.e.not undef).

Re: How to Check Hashes for Missing Items when Keys can be Values and vice versa
by ozboomer (Friar) on Jul 26, 2017 at 12:07 UTC

    Many thanks, everyone, for the useful responses.

    I've had a bit of a go with some of the suggestions... and I have something that does what I need (I think - more testing required, as usual). The updated sample code follows:-

    # Ref: http://www.perlmonks.com/?node_id=1196078 use Data::Dumper; %data_hash = (); %output_hash = (); @master_dsks = ( 70, 71, 75, 90, 91, 92, 93, 96, 97, 98, 99, 190, 204, 205, 550, 551 ); @master_sites = ( 350, 377, 510, 512, 580, 587, 590, 1100, 1105, 1107, 1108 ); # ---- printf("All Known Dsks:\n"); # Show ALL the known dsks foreach (@master_dsks) { printf("%s ", $_); } printf("\n\n"); # ---- printf("All Known Sites:\n"); # Show ALL the known sites foreach (@master_sites) { printf("%s ", $_); } printf("\n\n"); # ---- while( <DATA> ) { # Build list of unique (sit +e:dsk) items ($site, $buf) = split(/,/, $_); @input_item = split(/:/, $buf); foreach $input_field (@input_item) { # EX: "VAR8=36!206!207!" @dsk_list = ($input_field =~ /([0-9]+)!([0-9]+)!$/); # Get last + 2 of 3 items foreach $dsk (@dsk_list) { # Each dsk item in the inpu +t... next if ($dsk == 0); # Skip '0' dsk items $key = $site . ":" . $dsk; # Build composite key $data_hash{$key}++; # ...and save it } } } foreach $key ( sort keys %data_hash ) { # Build list of dsk -> (mul +ti sites) ($site, $dsk) = split(/:/, $key); push( @{ $output_hash{$dsk} }, $site ); # ... dsk -> (multi sites) push( @{ $site_2_dsk{$site} }, $dsk ); # !!! ADDITION !!! ... site + -> (multi dsks) } # ---- printf("List of sites for each used dsk:\n"); foreach $dsk (sort {$a <=> $b} keys %output_hash) { # Show list of si +tes for each dsk printf("Dsk: %d: ... ", $dsk); foreach $site (sort {$a <=> $b} @{$output_hash{$dsk}}) { printf(" %d ", $site); } printf("\n"); } printf("\n"); printf("List of dsks for each used site:\n"); foreach $site (sort {$a <=> $b} keys %site_2_dsk) { # Show list of ds +ks for each site printf("Site: %d: ... ", $site); foreach $dsk (sort {$a <=> $b} @{$site_2_dsk{$site}}) { printf(" %d ", $dsk); } printf("\n"); } printf("\n"); # ---- my %master_dsks_hash = map { $_ , "" } @master_dsks; # Hash of ALL d +sks delete @master_dsks_hash{keys %output_hash}; # Delete the US +ED dsks @unused_dsks = (keys %master_dsks_hash); # ...leaving th +e UNUSED dsks printf("Dsks that are known but unused:\n"); foreach (sort {$a<=>$b} @unused_dsks) { printf("%s ", $_); } printf("\n\n"); # ---- my %master_sites_hash = map { $_ , "" } @master_sites; # Hash of ALL s +ites delete @master_sites_hash{keys %site_2_dsk}; # Delete the US +ED sites @unused_sites = (keys %master_sites_hash); # ...leaving th +e UNUSED sites printf("Sites that are known but unused:\n"); foreach (sort {$a<=>$b} @unused_sites) { printf("%s ", $_); } printf("\n\n"); __DATA__ 1108,VAR6=36!204!205!:VAR8=36!206!207!:VAR13=36!70!0!:VAR14=36!70!71!: +VAR15=36!71!0! 377,VAR12=36!97!96! 512,VAR6=36!90!91!:VAR8=36!92!93!:VAR11=36!0!70!:VAR12=36!189!190! 587,VAR2=36!550!0!:VAR4=36!554!0!:VAR6=36!551!0!

    ....and the output:-

    All Known Dsks: 70 71 75 90 91 92 93 96 97 98 99 190 204 205 550 551 All Known Sites: 350 377 510 512 580 587 590 1100 1105 1107 1108 List of sites for each used dsk: Dsk: 70: ... 512 1108 Dsk: 71: ... 1108 Dsk: 90: ... 512 Dsk: 91: ... 512 Dsk: 92: ... 512 Dsk: 93: ... 512 Dsk: 96: ... 377 Dsk: 97: ... 377 Dsk: 189: ... 512 Dsk: 190: ... 512 Dsk: 204: ... 1108 Dsk: 205: ... 1108 Dsk: 206: ... 1108 Dsk: 207: ... 1108 Dsk: 550: ... 587 Dsk: 551: ... 587 Dsk: 554: ... 587 List of dsks for each used site: Site: 377: ... 96 97 Site: 512: ... 70 90 91 92 93 189 190 Site: 587: ... 550 551 554 Site: 1108: ... 70 71 204 205 206 207 Dsks that are known but unused: 75 98 99 Sites that are known but unused: 350 510 580 590 1100 1105 1107

    BTW.. Not using the 'warnings' and 'strict' pragmas is fair enough comment.. but this is isolated, sample code... so I'm not too fussed about using them in this context.

    Similarly, as I've been cutting code since the 1970s or something, I tend to pre-declare constants, variables, etc at the top of a block or module and then I know where to find all the initializations and comments about the identifiers I use in the code... instead of trying to find the 'first instance' (the 'my' declaration) of an identifier's use in some part of a mass of code when debugging/trying to understand some code - 'tis just easier for me.

    ..and Re: the issue of 'exists' ... trying to understand the perldoc description gives me too much of a headache:-

      A hash or array element can be true only if it's defined and defined only if it exists, but the reverse doesn't necessarily hold true.

    ...but I take the point.

    Thanks again for the most useful posts.

      > I know where to find all the initializations and comments about the identifiers

      With good variable names, no comments are needed. And there shouldn't be a block larger than one screen, so you don't have to scroll to find the initialization. See Skimmable Code by schwern.

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      I agree that you habit of pre-declaring variables largely defeats the advantage of using strict (although it is still useful in detecting misspelled variables). Your suggestion of defining them at "first instance" is not much better. If fact, it can introduce errors (which strict can detect). A better strategy is to declare all variables in the "smallest possible scope". This does require some effort in writing new code and it does little to help your reader find the declarations. The advantage to you is that it greatly reduces the possibility of misusing a variable. The advantage to your reader is that, when he comes to the end of the scope, he knows for certain that he has seen all the references to the variable.
      Bill

      trying to understand the perldoc description givees me too much of a headache:-

      A hash or array element can be true only if it's defined and defined only if it exists, but the reverse doesn't necessarily hold true.

      here's a Venn(ish) diagram in beautiful ASCII art that may or may not help, with examples on the side

      universe of possible hash elements in Perl +--------------------------------------+ | elements that exist | $hash{element_exis +ts}; # this example: exists, undefined, false | +--------------------------------+ | | | elements that are defined | | $hash{element_defi +ned} = function_def();# this example: exists, defined, unknown false/ +true | | +--------------------------+ | | | | | elements that are true | | | $hash{element_true +} = 1; # this example: exists, defined, true | | +--------------------------+ | | | | | | | | +--------------------------+ | | | | : elements that are false : | | $hash{element_fals +e} = function_false();# this example: exists, defined, false | | : +-------------------+ : | | | | : | false but defined | : | | $hash{element_fals +e_defined} = 0; # this example: exists, defined, false | | : +-------------------+ : | | | +--:--------------------------:--+ | | : : | | +--:--------------------------:--+ | | | : +-------------------+ : | | | | : | false but undef | : | | $hash{element_fals +e_undefined} = undef; # this example: exists, undefined, false | | : +-------------------+ : | | | | +--------------------------+ | | | | elements that are undefined | | $hash{element_unde +fined}; # this example: exists, undefined, false | +--------------------------------+ | +--------------------------------------+

      Note that the ASCII art combined with wanting space for labels sometimes implies there is room in the Perl universe for combinations that aren't actually possible: for example, there are no elements that are undefined but not false, because perl coerces undefined to false.

Re: How to Check Hashes for Missing Items when Keys can be Values and vice versa
by thanos1983 (Parson) on Jul 26, 2017 at 13:52 UTC

    Hello ozboomer,

    This is not a big improvement but just in case you are interested you can replace the foreach loops with while loops. See sample of code bellow:

    #!/usr/bin/perl use strict; use warnings; # use Benchmark qw(:all) ; # WindowsOS use Benchmark::Forking qw( timethese cmpthese ); # UnixOS my @preserved = @ARGV; sub while_test { my %data_hash = (); my %output_hash = (); @ARGV = @preserved; # restore original @ARGV while (<>) { # Build list of unique (site +:dsk) items my ($site, $buf) = split(/,/); my @input_item = split(/:/, $buf); while ( defined ( my $input_field = shift @input_item ) ) { # EX: +"VAR8=36!206!207!" my @dsk_list = ($input_field =~ /([0-9]+)!([0-9]+)!$/); # Get + last 2 of 3 items while ( defined ( my $dsk = shift @dsk_list ) ) { # Each dsk i +tem in the input... next if ($dsk == 0); # Skip '0' dsk items my $key = $site . ":" . $dsk; # Build composite key $data_hash{$key}++; # ...and save it } } } my @sort_data_keys = sort keys %data_hash; while ( defined ( my $key = shift (@sort_data_keys) ) ) { # Build +list of dsk -> (multi sites) my ($site, $dsk) = split(/:/, $key); push( @{$output_hash{$dsk} }, $site ); } my @sort_output_hash = sort {$a <=> $b} keys %output_hash; while ( defined ( my $dsk = shift (@sort_output_hash) ) ) { # Show + list of sites for each dsk # printf("Dsk: %d:\n", $dsk); foreach my $site (sort {$a <=> $b} @{$output_hash{$dsk}}) { # printf(" %d\n", $site); } # printf("\n"); } } sub foreach_test { my %data_hash = (); my %output_hash = (); @ARGV = @preserved; # restore original @ARGV while(<>) { # Build list of unique (site: +dsk) items my ($site, $buf) = split(/,/); my @input_item = split(/:/, $buf); foreach my $input_field (@input_item) { # EX: "VAR8=36!206!207! +" my @dsk_list = ($input_field =~ /([0-9]+)!([0-9]+)!$/); # Get + last 2 of 3 items foreach my $dsk (@dsk_list) { # Each dsk item in the + input... next if ($dsk == 0); # Skip '0' dsk items my $key = $site . ":" . $dsk; # Build composite key $data_hash{$key}++; # ...and save it } } } foreach my $key ( sort keys %data_hash ) { # Build list of dsk +-> (multi sites) my ($site, $dsk) = split(/:/, $key); push( @{$output_hash{$dsk} }, $site ); } foreach my $dsk (sort {$a <=> $b} keys %output_hash) { # Show lis +t of sites for each dsk # printf("Dsk: %d:\n", $dsk); foreach my $site (sort {$a <=> $b} @{$output_hash{$dsk}}) { # printf(" %d\n", $site); } # printf("\n"); } } my $results = timethese(1000000, { While => \&while_test, ForEach => \&foreach_test, }, 'none'); cmpthese( $results ); __END__ $ perl test.pl in.txt Rate While ForEach While 14286/s -- -10% ForEach 15898/s 11% --

    Keep in mind that all the arrays that we use in the while loops are destroyed because of shift. In case that you do not need to use the arrays again try this it should give a small boost.

    Hope this helps, BR.

    Seeking for Perl wisdom...on the process of learning...not there...yet!
Re: How to Check Hashes for Missing Items when Keys can be Values and vice versa
by ozboomer (Friar) on Jul 31, 2017 at 00:52 UTC

    For what it's worth.. and I don't know if it's "too clever for my own good", here's something I've built to assist in the creation of the '2-way' structures.

    ...and all I'll need to do as time goes on is add to the conditional where there's a '@B_list = ' in the code...

    Just for the curious :) ...

    # -------------------------------------------------------------------- +-------------- # Build_Joined_Hashes - Build hashes to assist '2-way' queries # # Description: # Create hashes where B -> (A1, A2, ...) and A -> (B1, B2, ...) # That is: # 1. given B, get a list of A's that refer to B # 2. given A, get a list of B's that refer to A # # Uses Globals: # # Notes: # - VP DSK hash: 2324(LX INT) -> VAR2=36!550!0!:VAR4=36!554!0!:VAR6 +=36!551!0! # - TC DSK hash: 9705(LX TC) -> 11,13-JAN-2014:13-JAN-2014 # -------------------------------------------------------------------- +-------------- sub Build_Joined_Hashes { my ($input_hash_ref, $type, $A_B_hash_ref, $B_A_hash_ref) = @_; my ($buf, $A_item, $input_field, $B_item, $key); my (@input_item, @B_list); my (%tmp_hash); %tmp_hash = (); # hash: A:B -> (count) %$A_B_hash_ref = (); # List of Bs that are used in As %$B_A_hash_ref = (); # List of As that are used in Bs foreach $A_item (keys %$input_hash_ref) { # For each 'A' +item... $buf = $$input_hash_ref{$A_item}; # ..get 'B usag +e list' record @input_item = split(/:/, $buf); # Get each 'B u +sage' item foreach $input_field (@input_item) { # For each 'B u +sage' item... if ($type eq "VAR") { # Get list of ' +B' instances... @B_list = ($input_field =~ # ... when VAR. +.. /([0-9]+)!([0-9]+)!$/); } elsif ($type eq "TC") { @B_list = ($input_field =~ # ... when TC.. +. /^([0-9]+),/); } foreach $B_item (@B_list) { # Each 'B' inst +ance... next if ($B_item == 0); # Skip '0' item +s $key = $A_item . ":" . $B_item; # Make composit +e key: A:B -> (count) $tmp_hash{$key}++; } } } foreach $key ( sort keys %tmp_hash ) { # For every 'B +usage in A' instance... ($A_item, $B_item) = split(/:/, $key); push( @{ $$A_B_hash_ref{$B_item} }, $A_item ); # Build list of + B -> (A1, A2, ...) push( @{ $$B_A_hash_ref{$A_item} }, $B_item ); # ... and + A -> (B1, B2, ...) } return; } # end Build_Joined_Hashes

    Perhaps it would be a better approach(?) to simply do things using DBD::CSV and treat everything as a database(!) and use SQL or sumfin'...