The Art of Hashing

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

So I have this script and .txt file that iv'e been messing with, brand new to hashing...i'm trying to make the script search for a partial email address, say I would type smith and it would return both email addresses for smith. I have made it so it can read the phone, email and name if I type, Smith,John, but not sure how to make (hash?) to just type smith and get both emails.

#!/usr/bin/perl -w

open( PH, "customers.txt" ) or die "Cannot open customers.txt: $!\n";
while (<PH>) {
    chomp;
    ( $customer, $number, $email ) = ( split( /\s/, $_ ) ) [0,1,2];
    $Customer{$customer} = $_;
    $Phone{$number} = $_;
    $Email{$email} = $_;
}
close(PH);

print "Type 'q' to exit\n";
while (1) {
    print "\nCustomer? ";
    $customer = <STDIN>;
    chomp ($customer);
    $address = "";
    $number = "";
    if ( !$customer) {
        print "E-Mail? ";
        $address = <STDIN>; 
        chomp $address;
        if (! $address) {
        print "Number? ";
        $number = <STDIN>; 
        chomp $number;
        }        
    }
    
    next if ( !$customer and !$address and !$number );
    last if ( $customer eq 'q' or $address eq 'q' or $number eq 'q' );
    
    if ( $customer and exists $Customer{$customer} ) {
        print "Customer: $Customer{$customer}\n";
        print "Customer: $Customer{$customer}\n";
        next;
    }
    if ($address and exists $Email{$address} ) {
        print "Customer: $Email{$address}\n";
        next;
    }
    if ($number and exists $Phone{$number} ) {
        print "Phone: $Phone{$number}\n";
        next;
    }
    print "Customer record not found. \n";
    next;
}
print "\nAll done.\n";
[download]

Smith,John (248)-555-9430 jsmith@aol.com
Hunter,Apryl (810)-555-3029 april@showers.org
Stewart,Pat (405)-555-8710 pats@starfleet.co.uk
Ching,Iris (305)-555-0919 iching@zen.org
Doe,John (212)-555-0912 jdoe@morgue.com
Jones,Tom (312)-555-3321 tj2342@aol.com
Smith,John (607)-555-0023 smith@pocahontas.com
Crosby,Dave (405)-555-1516 cros@csny.org
Johns,Pam (313)-555-6790 pj@sleepy.com
Jeter,Linda (810)-555-8761 netless@earthlink.net
Garland,Judy (305)-555-1231 ozgal@rainbow.com
[download]

Comment on The Art of Hashing Select or Download Code

Replies are listed 'Best First'.
Re: The Art of Hashing by BrowserUk (Patriarch) on Jun 15, 2014 at 11:51 UTC
brand new to hashing...i'm trying to make the script search for a partial email address Hashing (as in perl's hashes) provides for fast lookup using exact matching only; and is entirely the wrong mechanism for partially matching anything. If your keys are -- like your example "smith" -- actually always whole words, then you could index your data by whole surnames: [0] Perl> @x = split( '[, ]', $_ ), push @{ $bySurname{ $x[0] }{ $x[1] + } }, [ @x[ 2, 3 ] ] for split /\n\s/, <<'END' Smith,John (248)-555-9430 jsmith@aol.com Hunter,Apryl (810)-555-3029 april@showers.org Stewart,Pat (405)-555-8710 pats@starfleet.co.uk Ching,Iris (305)-555-0919 iching@zen.org Doe,John (212)-555-0912 jdoe@morgue.com Jones,Tom (312)-555-3321 tj2342@aol.com Smith,John (607)-555-0023 smith@pocahontas.com Crosby,Dave (405)-555-1516 cros@csny.org Johns,Pam (313)-555-6790 pj@sleepy.com Jeter,Linda (810)-555-8761 netless@earthlink.net Garland,Judy (305)-555-1231 ozgal@rainbow.com END ;; [0] Perl> pp %bySurname;; ( "Jeter", { Linda => [["(810)-555-8761", "netless\@earthlink.net"]] }, "Ching", { Iris => [["(305)-555-0919", "iching\@zen.org"]] }, "Smith", { John => [ ["(248)-555-9430", "jsmith\@aol.com"], ["(607)-555-0023", "smith\@pocahontas.com"], ], }, "Crosby", { Dave => [["(405)-555-1516", "cros\@csny.org"]] }, "Jones", { Tom => [["(312)-555-3321", "tj2342\@aol.com"]] }, "Doe", { John => [["(212)-555-0912", "jdoe\@morgue.com"]] }, "Johns", { Pam => [["(313)-555-6790", "pj\@sleepy.com"]] }, "Hunter", { Apryl => [["(810)-555-3029", "april\@showers.org"]] }, "Garland", { Judy => [["(305)-555-1231", "ozgal\@rainbow.com\n"]] }, "Stewart", { Pat => [["(405)-555-8710", "pats\@starfleet.co.uk"]] }, ) [download] Which would allow you to find all those with "smith" (provided you lc the keys, which I didn't above), but won't let you find those with "jo*" in the name. For small numbers of lines -- a few thousands or so -- I'd keep them in a single string and using a simple text search. For a fully wild-card search of many more than that, I'd probably build a 2 or 3 consecutive characters index. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re: The Art of Hashing by poj (Abbot) on Jun 15, 2014 at 21:06 UTC
For partial matches, an alternative approach would be to treat the text file as a database table using DBD::CSV and query with the SQL 'LIKE' operator. #!perl use strict; use DBI; # set up database my $dbh = DBI->connect ("dbi:CSV:", undef, undef, { csv_tables => { customer => { f_file => "customers.txt" , col_names => ['NAME','TELNO','EMAIL'], } }, csv_sep_char => " ", }) or die $DBI::errstr; print "Type 'q' to exit\n"; my @field = ('Name','TelNo','Email'); my $i=0; my $input; while (1) { do { # get input $i = 0 if $field[$i] eq ''; print $field[$i]."? "; chomp($input = <STDIN>); ++$i; } until ($input) ; --$i; last if (lc $input eq 'q'); # query database my $fld = $field[$i]; print "Search for $fld contains '$input'\n"; my $sth = $dbh->prepare("SELECT * FROM customer WHERE $fld CLIKE ?"); # ignore case $sth->execute('%'.$input.'%'); my $count; while (my @f = $sth->fetchrow_array){ ++$count; print "$count $fld : $f[$i] -> $f[0] $f[1] $f[2]\n"; } print "Customer record not found. \n" unless ($count); $input=''; } [download] poj	[reply] [d/l]
Re: The Art of Hashing by shmem (Chancellor) on Jun 16, 2014 at 09:11 UTC
... but not sure how to make (hash?) to just type smith and get both emails. Two points here: 1) partial match and 2) multiple values to one key 1. For perl hashes, the only way to get matching keys of a hash for a given string, is performing a pattern match on the entire set of keys. `my @matching_keys = grep /$customer/, keys %Customer;` [download] There are other hash implementations close to the perl core which live in extensions. For instance, there is DB_File which interfaces to Berkeley DB and provides binary trees. This implementation has partial match built in. 2. A perl hash consists of key/value pairs. If you want to store more than one item in the value slot, you have to store the reference to a container - a reference to an anonymous array or hash, which you later dereference. `# $Customer{$customer} = $_; push @{$Customer{$customer}}, $_; # use value slot as anonymous ar +ray # later # print "Customer: $Customer{$customer}\n"; print join("\n", "Customer:", @{$Customer{$customer}}), "\n" +;` [download] Again, the DB_File module is an alternative here. Its BTREE file type optionally allows a single key to be associated with an arbitrary number of values. File isn't necessarily a external file, since Berkeley DB allows the creation of in-memory databases. See grep, push, join and DB_File. perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'	[reply] [d/l] [select]
Re^2: The Art of Hashing by BrowserUk (Patriarch) on Jun 16, 2014 at 09:24 UTC
1. For perl hashes, the only way to get matching keys of a hash for a given string, is performing a pattern match on the entire set of keys. `my @matching_keys = grep /$customer/, keys %Customer;` [download] As some guy who used to be famous around here once said: "that's like buying and Uzi and using it to club your enemy to death". It's also very misleading because the hash is serving no useful purpose in that construct. It would be considerably cheaper to put the list of keys into an array and grep that than force Perl to re-walk the hash structure to generate the list of strings for grep each time around. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^3: The Art of Hashing by shmem (Chancellor) on Jun 16, 2014 at 13:01 UTC
Of course I can't but fully agree with you and thank you for the critic. Yes, storing the keys in an array is much cheaper - and partial match isn't what perl hashes are made for; that's better done with BTREE structures. perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'	[reply] [d/l]
Re^3: The Art of Hashing by CountZero (Bishop) on Jun 16, 2014 at 16:49 UTC
that's like buying and Uzi and using it to club your enemy to death Indeed, an UZI is a very bad choice for a club. I'd suggest an AK47 or an M16. A Lee Enfield or a Mauser 98K, is an even better choice. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply]