Which data structure should I use?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks!
I have a file with the following format:

ID  FIRST_NUMBER   SECOND_NUMBER
1    1               1
2    1               2
3    2               3
4    2               4
5    3               5
6    3               6
7    4               7
8    4               6
9    5               8
10    5               9
11    6              10
12    6               9
[download]

and I want to store this data in a data structure that I can easily browse. I cannot go with hashes, because neither ID nor NUMBER are unique, as you see.
By reading various tutorials, I thought that Hash of Hashes would be the thing for me, but can't seem to understand how to use it efficiently.
What I need to do is be able to read the data based on the SECOND_NUMBER. My guess is that ID, which is unique, could be of assistance, but I can't figure out how.
What I have tried is:

if($_=~/^(\d+)\t(.*)/)
{    
    $id=$1;
    $rest=$2;

    @split_rest = split(/\t/, $rest);
    $first_num=$split_rest[0];
    $second_num=$split_rest[1];

    $HoH{$id}{$first_num} = $second_num;
[download]

The HoH works OK, but how can I search it when the only thing that I am given will be the FIRST_NUMBERS and NOT the IDs? For instance, if you take line 12(last line) my target is to be able to retrieve 9 (SECOND_NUMBER) when I am given 6 (FIRST_NUMBER) but not mistake 9 for 10 (which is in the above line).

Comment on Which data structure should I use? Select or Download Code

Replies are listed 'Best First'.
Re: Which data structure should I use? by ikegami (Patriarch) on Sep 06, 2009 at 23:38 UTC
What I need to do is be able to read the data based on the SECOND_NUMBER. Sounds like you want an array of records for each value SECOND_NUMBER can take. `my $header = <>; chomp($header); my @field_names = split /\t/, $header; my %grouped_by_num2; while (<>) { chomp; my %rec; @rec{ @field_names } = split /\t/; push @{ $grouped_by_num2{$rec{SECOND_NUMBER}} }, \%rec; } use Data::Dumper; print(Dumper(\%grouped_by_num2));` [download]	[reply] [d/l]
Re^2: Which data structure should I use? by Anonymous Monk on Sep 06, 2009 at 23:43 UTC
So, in the code you suggest, ID won't be used at all? I don't need to use ID after all, I just want to store the pairs `SECOND_NUMBER<->FIRST_NUMBER`, so It suits me better... However, how would you use the code you provide in order to get the pair `9<->6` in line 12? If I gave you `$wanted_second_number=9`, how would you print `$first_number=6`?	[reply] [d/l] [select]
Re^3: Which data structure should I use? by ikegami (Patriarch) on Sep 07, 2009 at 00:47 UTC
So, in the code you suggest, ID won't be used at all It's there, just not used as the index. However, how would you use the code you provide The output of Data::Dumper illustrates that quite well. If I gave you $wanted_second_number=9, how would you print $first_number=6? No really, look at the structure yourself before continuing. `my @matching_recs = $grouped_by_num2{$wanted_second_number}; for my $rec (@matching_recs) { my $first_number = $rec->{FIRST_NUMBER}; print("$first_numner\n"); # 5, 6 }` [download]	[reply] [d/l]
Re: Which data structure should I use? by ysth (Canon) on Sep 07, 2009 at 01:53 UTC
What all types of lookups do you want to do? (E.g. find all ids with a given first number; find all first numbers for a given second number, find first and second number for a given id.) That's what needs to drive your choice of data structure, and you haven't said what you want. -- Online Fortune Cookie Search Office Space merchandise	[reply]
Re: Which data structure should I use? by ig (Vicar) on Sep 07, 2009 at 06:16 UTC
Your comments suggest you want to be able to look up by either first or second number. For this, if you want to use hashes, you can build two hashes to support lookup by either number. Alternatively, if you don't have too many records, you could simply populate an array with all the records and select from this array as necessary. Both approaches are demonstrated in the following example: use strict; use warnings; use Data::Dumper; my %hash; my @records; while (<DATA>) { chomp; my ($id, $first, $second) = split /\s+/; my $record = [ $id, $first, $second ]; # Build hashes indexed by first and second numbers push(@{$hash{first}{$first}}, $record); push(@{$hash{second}{$second}}, $record); # Build array of records push(@records, $record); } print "\%hash:\n"; print Dumper(\%hash); print "\n\n"; print "\@pairs:\n"; print Dumper(\@records); print "\n\n"; print "records with first number 6, from hash:\n"; foreach my $record (@{$hash{first}{6}}) { print "\t", join(',',@$record), "\n"; } print "records with first number 6, from array:\n"; foreach my $record (grep { $_->[1] == 6 } @records) { print "\t", join(',',@$record), "\n"; } __DATA__ 1 1 1 2 1 2 3 2 3 4 2 4 5 3 5 6 3 6 7 4 7 8 4 6 9 5 8 10 5 9 11 6 10 12 6 9 [download] The hash will provide faster lookup if you have many records but the difference will be small if you have only a few records, as in your example.	[reply] [d/l]
Re: Which data structure should I use? by ikegami (Patriarch) on Sep 06, 2009 at 23:31 UTC
You said ID isn't unique, and you said ID is unique. Which one is it? It's a crucial detail in determining the answer. (Answered over here as I was typing)	[reply]
Re^2: Which data structure should I use? by Anonymous Monk on Sep 06, 2009 at 23:39 UTC
Yes, my mistake... The format is as you see it, that is ID is UNIQUE, while FIRST_NUMBER and SECOND_NUMBER are not unique...	[reply]
Re: Which data structure should I use? by bluescreen (Friar) on Sep 07, 2009 at 03:18 UTC
If you are to filter by either first number, second or id I'd suggest to use `$hoh->{$id} = { first_number => $first_num, second_number => $second_num};` [download] and then to filter simply use grep `# Retrieve all IDS were first number is 2 my @ids = grep { $hoh->{$_}->{first_number} == 2 } keys $hoh;` [download]	[reply] [d/l] [select]
Re: Which data structure should I use? by Anonymous Monk on Sep 06, 2009 at 23:28 UTC
Typo mistake: "I cannot go with hashes, because neither ID nor NUMBER are unique, as you see." should be "I cannot go with hashes, because neither FIRST_NUMBER nor SECOND_NUMBER are unique, as you see.	[reply]