Unique Values within AOH

dirtdog has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Unique Values within AOH by hippo (Archbishop) on Oct 30, 2019 at 14:37 UTC
Instead of de-duping after the fact you could make sure your lists are unique as you build them. Simpler and shorter. `use strict; use Data::Dumper; my $team; my $player; my %teamAccts; my %seen; while (<DATA>) { if (/^T:(\S+)/) { $team = $1; next; } if (/^P:(\S+)/) { $player = $1; push @{$teamAccts{$team}}, $player unless $seen{"$team$player"}++; } } print Dumper(\%teamAccts); __DATA__ T:REDS P:GRIFFEY P:GRIFFEY P:PEREZ P:ROSE P:BENCH T:PHILLIES P:ROSE P:ROSE T:MARINERS P:GRIFFEY P:PEREZ` [download]	[reply] [d/l]
Re^2: Unique Values within AOH (updated) by AnomalousMonk (Archbishop) on Oct 30, 2019 at 18:24 UTC
`unless $seen{"$team$player"}++;` One small nit. Because the `%seen` uniqification hash is common to all teams, it's possible to confuse certain team/player records, e.g.: `Team Player` `ABC DEFGH` `ABCD EFGH` which both become the key `'ABCDEFGH'`. This is easily avoided by joining the two strings with some character or character sequence that (you hope!) cannot possibly occur in team or player names: `unless $seen{"$team\x00$player"}++;` (Update: Actually, for the given order of concatenation, it's only necessary that the separator character or character sequence cannot appear in the team name.) This nit is very unlikely to bite, but may be very difficult to debug (or even see in large data sets) if it does. Update: Another, possibly more significant nit. All team and player name data in the OPed example is uppercase. If there may be any mixing of case, then, e.g., `'Rose'` will be distinct from `'ROSE'` and de-duplication may fail. In this case, or even as a general precaution, team/player names can be common-cased: `unless $seen{"\U$team\x00$player"}++;` See Quote and Quote-like Operators in perlop for `\U \L` et al. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: Unique Values within AOH (updated) by hippo (Archbishop) on Oct 30, 2019 at 23:56 UTC
This is easily avoided by joining the two strings with some character or character sequence that (you hope!) cannot possibly occur in team or player names Or, take the guesswork out of it and add an extra layer of depth to the hash: `unless $seen{$team}{$player}++;` [download] Another, possibly more significant nit. All team and player name data in the OPed example is uppercase. If there may be any mixing of case ... True, but to be fair to the OP it was stated to be just an example. I'm not convinced that dirtdog is actually applying this to teams and players. If he were then he would have bigger problems with real-world data such as the current All Blacks XV which has featured all three Barretts in recent weeks. You can't go de-duplicating three different players who share the same surname.	[reply] [d/l]
Re^4: Unique Values within AOH (updated) by AnomalousMonk (Archbishop) on Oct 31, 2019 at 01:40 UTC
Re^2: Unique Values within AOH by dirtdog (Monk) on Oct 30, 2019 at 14:54 UTC
thanks hippo!...works perfectly.	[reply]
Re: Unique Values within AOH by NetWallah (Canon) on Oct 30, 2019 at 16:46 UTC
Obligatory one-liner: `perl -F: -lane "$F[0] eq q\|T\| and $team=$F[1]; $F[0] eq q\|P\| and $h{$ +team}{$F[1]}++}{for my $t(sort keys %h){print $t, qq\|\t\[\|, join( qq\| +, \|, sort keys %{$h{$t}}),qq\|\]\|}" Your-Data.txt MARINERS [GRIFFEY, PEREZ] PHILLIES [ROSE] REDS [BENCH, GRIFFEY, PEREZ, ROSE]` [download] Substitute single quotes for double, if running on Linux. "From there to here, from here to there, funny things are everywhere." -- Dr. Seuss	[reply] [d/l]
Re: Unique Values within AOH by johngg (Canon) on Oct 30, 2019 at 15:50 UTC
Rather than reading the data line by line you could slurp all of it into a single scalar string then split into per-team chunks at each point in the string that is followed by a "T" at the start of a line. Then for each team get rid of the `[PT]:` characters globally and split again on newlines into the team line and player lines. This gets rid of the need to test each line for a "T" looking for a new team. Finally, as hippo suggests, de-dup before building your HoA; use 5.026; use warnings; use Data::Dumper; open my $dataFH, q{<}, \ <<__EOD__ or die $!; T:REDS P:GRIFFEY P:GRIFFEY P:PEREZ P:GRIFFEY P:PEREZ P:ROSE P:BENCH T:PHILLIES P:ROSE P:ROSE T:MARINERS P:GRIFFEY P:PEREZ __EOD__ my $data = do { local $/; <$dataFH>; }; close $dataFH or die $!; my @teams = split m{(?=^T)}m, $data; my %teamAccts; foreach my $teamData ( @teams ) { $teamData =~ s{[PT]:}{}g; my( $teamLine, @playerLines ) = split m{\n}, $teamData; $teamAccts{ $teamLine } = [ do { my %seen; grep { ! $seen{ $_ } ++ } @playerLines; } ]; } print Data::Dumper ->new( [ \ %teamAccts ], [ qw{ teamAccts } ] ) ->Sortkeys( 1 ) ->Dumpxs(); [download] The output. `%teamAccts = ( 'MARINERS' => [ 'GRIFFEY', 'PEREZ' ], 'PHILLIES' => [ 'ROSE' ], 'REDS' => [ 'GRIFFEY', 'PEREZ', 'ROSE', 'BENCH' ] );` [download] I hope this is of interest. Update:* Clarified wording slightly. Cheers, JohnGG	[reply] [d/l] [select]
Re^2: Unique Values within AOH by afoken (Chancellor) on Oct 30, 2019 at 20:48 UTC
Rather than reading the data line by line you could slurp all of it into a single scalar string then split into per-team chunks ... OK for small files, begging for trouble as soon as files grow beyond the amount of free RAM. On a 32-bit perl, you simply can not have a scalar larger than 4 GBytes, because you have no more address lines. The real limit may be much less, depending on operating system and other factors. So you are limited to files smaller than that. Reading line by line allows processing Petabytes of data without running out of memory. Even a 64-bit perl will be limited to the amount of free RAM and free swap space. Once all RAM and swap is used up and the machine has come to a grinding halt, you are lost. Again, reading line by line allows processing much more data. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re^3: Unique Values within AOH by johngg (Canon) on Oct 30, 2019 at 22:46 UTC
OK for small files, begging for trouble as soon as files grow beyond the amount of free RAM. Granted, but for this particular problem I doubt that there are enough teams out there for the data to run up against the physical RAM limit, even on quite elderly systems. Cheers, JohnGG	[reply]
Re: Unique Values within AOH by choroba (Cardinal) on Oct 30, 2019 at 18:22 UTC
Use `uniq` from List::Util: `use List::Util qw{ uniq }; %teamAcctsUniq = map { $_ => [ uniq(@{ $teamAccts{$_} }) ] } keys %teamAccts;` [download] Or you can assign the whole hash at once: `@teamAcctsUniq{ keys %teamAccts } = map [uniq(@$_)], values %teamAccts +;` [download] `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re: Unique Values within AOH by GrandFather (Saint) on Oct 31, 2019 at 02:18 UTC
Isn't this just a case of the wrong data structure? As shown it is a hash of arrays. If it were a Hash of Hashes the problem goes away. A better question might be, why a HoH or HoA at all. Shouldn't data like this be in a database? Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond	[reply]