Help with words scrambled

deluxaran has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Help with words scrambled by Perlbotics (Archbishop) on Nov 01, 2011 at 13:56 UTC
Perhaps the following modules, together with a dictionary and a few lines of Perl will do? String::Approx - Perl extension for approximate matching (fuzzy matching) Text::Levenshtein - An implementation of the Levenshtein edit distance Text::Soundex - Implementation of the soundex algorithm. (for names) HTH	[reply]
Re^2: Help with words scrambled by GrandFather (Saint) on Nov 01, 2011 at 19:56 UTC
Except that you probably shouldn't use Text::Soundex for almost anything. It is a hashing technique for matching names that follow common patterns for English given names and surnames. It works by dropping various letters, combining various letter sequences and truncating the result. It is an interesting but fairly blunt instrument good for grouping names or adding "see also" entries in a telephone book in England and a few other English speaking countries, but much less useful for any other purpose. True laziness is hard work	[reply]
Re: Help with words scrambled by ww (Archbishop) on Nov 01, 2011 at 14:09 UTC
"I assume that the length of the word remains unchanged in the scrambled form." Bad assumption. It's merely anecdotal evidence, but EVERY consumer grade OCR I've tested in the past two decades has chaÔçged [some unrecognized] wo rd l en†t s as well as borking at least some major part of very common, 12 point fonts.	[reply]
Re: Help with words scrambled by BrowserUk (Patriarch) on Nov 01, 2011 at 14:23 UTC
Even if your "same length" criteria is true, then based of the number of matching characters -- sk0oid .v. should -- there are 543 words in my dictionary that could be considered matches for sk0oid: `c:\test>junk33 sk0oid \| find "33.33%" \| wc -l 543` [download] These range from: c:\test>junk33 sk0oid \| find "33.33%" sk0oid has a 33.33% chance of being: abroad sk0oid has a 33.33% chance of being: acarid sk0oid has a 33.33% chance of being: accord ... sk0oid has a 33.33% chance of being: shoots sk0oid has a 33.33% chance of being: shored sk0oid has a 33.33% chance of being: should sk0oid has a 33.33% chance of being: shoved sk0oid has a 33.33% chance of being: showed sk0oid has a 33.33% chance of being: shrewd sk0oid has a 33.33% chance of being: shroff sk0oid has a 33.33% chance of being: shrove sk0oid has a 33.33% chance of being: shuted ... sk0oid has a 33.33% chance of being: uphold sk0oid has a 33.33% chance of being: upload sk0oid has a 33.33% chance of being: verbid sk0oid has a 33.33% chance of being: vespid sk0oid has a 33.33% chance of being: vetoed sk0oid has a 33.33% chance of being: viscid sk0oid has a 33.33% chance of being: zeroed [download] You would need to come up with a heuristic that reflects the types of mistakes that your OCR program has a habit of making in order to get anything like good results. The program used above: `#! perl -slw use strict; open W, '<', 'words.txt' or die $!; my @words = <W>; close W; chomp @words; chomp( my $bad = lc shift() ); for my $good ( @words ) { next if length $good != length $bad; my $mask = $good ^ $bad; my $match = $mask =~ tr[\0][]; next unless $match; printf "$bad has a %.2f%% chance of being: $good \n", $match / length( $good ) * 100; }` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re: Help with words scrambled by zentara (Cardinal) on Nov 01, 2011 at 18:02 UTC
Probably cheaper and more accurate in the long run, to just hire an entry level worker to visually scan them :-) I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh	[reply]
Re: Help with words scrambled by pvaldes (Chaplain) on Nov 02, 2011 at 00:43 UTC
Is there a way to find the closest word from a dictionary that is close in form to 'sk0oid'? I think is a questionable technique, you could change completely the meaning of the text without notice unless you add some people to the equation and check line by line (and even in this case checking should be much more easier before with the original OCR!). If this is not possible in any case, maybe you could improve the method if you repeat the OCR several times and do the match several times in different copies of the same page and/or extract a dictionary with all the words used in your files, count the words, sort the similar words by probability of appearing in the file and thus do the replace taking in mind this. Forget the idea of a complete dictionary with a lot or very rare words. Is much more error prone.	[reply]
Re: Help with words scrambled by Anonymous Monk on Nov 01, 2011 at 23:09 UTC
Depending on your font and how drunk you are, the letters "b", "h", and "k" might look alike. So remap them all to the letter "b". Do this for any other look-alikes. This is sort of like Text::Soundex, except with the look instead of sound of a word. `#!/usr/bin/perl use warnings; use strict; my %dict; @ARGV = 'words.txt'; while (<>) { chomp; my $word = $_; tr/gq9xz2mwbhk68ilj1acenosu05/gggzzzmmbbbbbiiiiaaaaaaaaa/; $dict{$_} .= "$word "; } while (<DATA>) { chomp; my $word = $_; tr/gq9xz2mwbhk68ilj1acenosu05/gggzzzmmbbbbbiiiiaaaaaaaaa/; my $matches = $dict{$_} \|\| '~'; print "$word: $matches\n"; } __END__ sk0oid` [download]	[reply] [d/l]