deluxaran has asked for the wisdom of the Perl Monks concerning the following question:

Hello

I have a large batch of text files that went trough the OCR process and not all of them are in a nice form.

For example instead of "should" there may be 'sk0oid'. Is there a way to find the closest word from a dictionary that is close in form to 'sk0oid'?

I assume that the length of the word remains unchanged in the scrambled form.

Replies are listed 'Best First'.
Re: Help with words scrambled
by Perlbotics (Archbishop) on Nov 01, 2011 at 13:56 UTC

    Perhaps the following modules, together with a dictionary and a few lines of Perl will do?

    HTH

      Except that you probably shouldn't use Text::Soundex for almost anything. It is a hashing technique for matching names that follow common patterns for English given names and surnames. It works by dropping various letters, combining various letter sequences and truncating the result. It is an interesting but fairly blunt instrument good for grouping names or adding "see also" entries in a telephone book in England and a few other English speaking countries, but much less useful for any other purpose.

      True laziness is hard work
Re: Help with words scrambled
by ww (Archbishop) on Nov 01, 2011 at 14:09 UTC
    "I assume that the length of the word remains unchanged in the scrambled form."
    Bad assumption. It's merely anecdotal evidence, but EVERY consumer grade OCR I've tested in the past two decades has chaÔçged [some unrecognized] wo   rd l en†t s as well as borking at least some major part of very common, 12 point fonts.
Re: Help with words scrambled
by BrowserUk (Patriarch) on Nov 01, 2011 at 14:23 UTC

    Even if your "same length" criteria is true, then based of the number of matching characters -- sk0oid .v. should -- there are 543 words in my dictionary that could be considered matches for sk0oid:

    c:\test>junk33 sk0oid | find "33.33%" | wc -l 543

    These range from:

    c:\test>junk33 sk0oid | find "33.33%" sk0oid has a 33.33% chance of being: abroad sk0oid has a 33.33% chance of being: acarid sk0oid has a 33.33% chance of being: accord ... sk0oid has a 33.33% chance of being: shoots sk0oid has a 33.33% chance of being: shored sk0oid has a 33.33% chance of being: should sk0oid has a 33.33% chance of being: shoved sk0oid has a 33.33% chance of being: showed sk0oid has a 33.33% chance of being: shrewd sk0oid has a 33.33% chance of being: shroff sk0oid has a 33.33% chance of being: shrove sk0oid has a 33.33% chance of being: shuted ... sk0oid has a 33.33% chance of being: uphold sk0oid has a 33.33% chance of being: upload sk0oid has a 33.33% chance of being: verbid sk0oid has a 33.33% chance of being: vespid sk0oid has a 33.33% chance of being: vetoed sk0oid has a 33.33% chance of being: viscid sk0oid has a 33.33% chance of being: zeroed

    You would need to come up with a heuristic that reflects the types of mistakes that your OCR program has a habit of making in order to get anything like good results.

    The program used above:

    #! perl -slw use strict; open W, '<', 'words.txt' or die $!; my @words = <W>; close W; chomp @words; chomp( my $bad = lc shift() ); for my $good ( @words ) { next if length $good != length $bad; my $mask = $good ^ $bad; my $match = $mask =~ tr[\0][]; next unless $match; printf "$bad has a %.2f%% chance of being: $good \n", $match / length( $good ) * 100; }

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Help with words scrambled
by zentara (Cardinal) on Nov 01, 2011 at 18:02 UTC
    Probably cheaper and more accurate in the long run, to just hire an entry level worker to visually scan them :-)

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
Re: Help with words scrambled
by pvaldes (Chaplain) on Nov 02, 2011 at 00:43 UTC
    Is there a way to find the closest word from a dictionary that is close in form to 'sk0oid'?

    I think is a questionable technique, you could change completely the meaning of the text without notice unless you add some people to the equation and check line by line (and even in this case checking should be much more easier before with the original OCR!).

    If this is not possible in any case, maybe you could improve the method if you repeat the OCR several times and do the match several times in different copies of the same page

    and/or

    extract a dictionary with all the words used in your files, count the words, sort the similar words by probability of appearing in the file and thus do the replace taking in mind this. Forget the idea of a complete dictionary with a lot or very rare words. Is much more error prone.

Re: Help with words scrambled
by Anonymous Monk on Nov 01, 2011 at 23:09 UTC

    Depending on your font and how drunk you are, the letters "b", "h", and "k" might look alike. So remap them all to the letter "b". Do this for any other look-alikes.

    This is sort of like Text::Soundex, except with the look instead of sound of a word.

    #!/usr/bin/perl use warnings; use strict; my %dict; @ARGV = 'words.txt'; while (<>) { chomp; my $word = $_; tr/gq9xz2mwbhk68ilj1acenosu05/gggzzzmmbbbbbiiiiaaaaaaaaa/; $dict{$_} .= "$word "; } while (<DATA>) { chomp; my $word = $_; tr/gq9xz2mwbhk68ilj1acenosu05/gggzzzmmbbbbbiiiiaaaaaaaaa/; my $matches = $dict{$_} || '~'; print "$word: $matches\n"; } __END__ sk0oid