hoyt has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I recently started on a project to improve a bit of my code and thought this would be an easy point, but am not finding much information to address what I'm trying to do. I have a MySQL database that I populate with data points from text files. Today I do most of the gleaning of that information from a hodge-podge of PHP code. I was trying to move the searching and matching to perl instead, figuring I could work in a lot of improvements at the same time. For instance, I have a table in the database holding Artists. In my perl script, I connect to MySQL, get the artists out and put them into a hash. Next I open the relevant text file, grab every line and place that into an array. I then want to check each item in the array and see if it matches an item from the hash. This works wonderfully like this:
for (my $i=0;$i<=$#lines;$i++){ if(exists $artists{$lines[$i]})
But I was hoping to add some ranking of the match to present the user with the most likely artist for that text file. I would eventually write the matches back into a MySQL table to present later. I'm thinking an exact full line match = 256, a case insensitive full line match = 128, a partial word match = 64, etc. Then I could add up all of the matches for one artist and rank them for the user. I'm surprised to not find a method to do this without re-inventing the wheel. Am I going about this wrong? Any wisdom to point me in a different direction? Thanks!

Replies are listed 'Best First'.
Re: Hash Search Ranking
by Your Mother (Archbishop) on Jan 09, 2016 at 19:45 UTC

    I would start here, Building a Vector Space Search Engine in Perl. Then maybe look into Lucy, Search::Elasticsearch, and Search::Tools and the many tangents you will encounter at each junction. This is a fun but deceptively deep problem space. Stemming, tokenizing, substrings, case, encoding, the actual definition of what a word/token is, that nothing any more is plain text to start with but some kind of markup or document format… and making it work with speed and reasonable scoring is incredibly difficult despite the fact that a vanilla inverted index or vector search is not that hard.

    Have fun. :P

Re: Hash Search Ranking
by GrandFather (Saint) on Jan 09, 2016 at 19:51 UTC

    On a completely unrelated note, in Perl your for loop is better written:

    for my $i (0 .. $#lines)

    or if you don't need the index:

    for my $line (@lines)
    Premature optimization is the root of all job security
Re: Hash Search Ranking
by tangent (Parson) on Jan 09, 2016 at 23:46 UTC
    For your first two cases the solution is straight forward, but you need to do a bit more to cover the third. Here is one way, maybe not so efficient but something to build on. I am assuming your focus is on the artists - i.e. you want to rank each artist as opposed to ranking each line.
    use Data::Dumper; $Data::Dumper::Sortkeys = 1; my %artists = ( 'William Blake' => 1, 'David Hockney' => 1, 'Francis Blake' => 1, 'David Lynch' => 1, ); my @lines = ( 'William Blake', 'Blake Morrison', 'david lynch', 'francis bacon', 'William Blake and Blake Morrison', ); my %rank; # make artists lower-case for case-insensitive match my %artists_lc = map { lc($_) => $_ } keys %artists; # map all the artist words my %artist_words; for my $artist ( keys %artists ) { my @words = split( /\s+/, $artist ); for my $word ( @words ) { $artist_words{lc($word)}{$artist}++; } } # have a look at the map print Dumper(\%artist_words); for my $i ( 0 .. $#lines ) { my $line = $lines[$i]; my $artist; my @words = split( /\s+/, lc($line) ); for my $word ( @words ) { if ( my $hash = $artist_words{$word} ) { $rank{$_}{$i}++ for keys %$hash; } } # deal with exact and case-insensitive if ( $artists{$line} ) { $rank{$line}{$i} = 256; } elsif ( $artist = $artists_lc{lc($line)} ) { $rank{$artist}{$i} = 128; } } print Dumper(\%rank);
    Output:
    { Artist => { line_index => score } { 'David Hockney' => { '2' => 1 }, 'David Lynch' => { '2' => 128 }, 'Francis Blake' => { '0' => 1, '1' => 1, '3' => 1, '4' => 2 }, 'William Blake' => { '0' => 256, '1' => 1, '4' => 3 } };
Re: Hash Search Ranking
by hoyt (Acolyte) on Jan 09, 2016 at 19:20 UTC
    Right after I posted this I did a search for "perl fuzzy match" and am finding much more than I did with "perl match ranking." Seems like there are many modules that I can learn from that I am now finding...
Re: Hash Search Ranking
by anonymized user 468275 (Curate) on Jan 14, 2016 at 14:51 UTC
    Some wheels, in this case the array of hash two-pass algorithm, are meant to be re-invented day in, day out. The variation for this case is that it's a hash of hash of array e.g.
    { artist => { 256 => [matches...], 128 => [matches...], ... }, ... }
    that needs two passes - one to populate it and then to read it sorted by calculated rank using a custom sort routine.