tmaly has asked for the wisdom of the Perl Monks concerning the following question:

Monks, has anyone come across an OCR solution that can be used by perl on windows via OLE that can be used on all MS office formats and PDF formats? I see something on omnipage but there is not enough detail there to determine if that would work. Best Regards Ty
Solution:
I came up with a OCR solution for image PDFs that worked on windows XP with a copy of MS office 2003, strawberry perl 5.8.X, imagemagick, and ghostscript.
Using imagemagick and ghostscript, I converted the PDF that contained the image text to a high resolution tiff convert.exe -density 400x400 test.pdf test.tiff
Next I used Win32::OLE with the Microsoft Office Document Imaging library that is capable of OCRing tiffs
use strict; use Win32::OLE; use Win32::OLE::Const; use Win32::OLE::Enum; my $wd = Win32::OLE::Const->Load("Microsoft Office Document Imaging 11 +\.0 Type Library"); my $o = Win32::OLE->new('MODI.Document', sub {$_[0]->Quit;}) or die("Cannot create modi document\n"); if (!defined($o)) { die("no object!\n"); } $o->Create('test.tiff'); # ocr the text then save it back into the tiff # you only have to do the next two steps once and the # ocr text will be saved for later use if you wish to # open the tiff at a later time $o->OCR(); $o->Save(); my $im = $o->{Images}; my $en = Win32::OLE::Enum->new($im); my @ims = $en->All; # now print out the text from each page foreach my $i (@ims) { print $i->{Layout}{Text} . "\n"; }

Replies are listed 'Best First'.
Re: Win32 and OCR via OLE
by marto (Cardinal) on Apr 10, 2008 at 11:03 UTC
    What do you mean OCR MS office formats? As far as I am aware office saves files in each applications native format, .doc for Word, .xls for Excel etc. These are not images so using the term OCR in this context makes no sense. If you want to get the text from them so you can text index them or whatever, thats a different question really.

    Regards actually OCRing and text stripping PDFs (and images) see Re: Extracting content text from PDFs in response to Extracting content text from PDFs and remember that super search is your friend.

    Hope this helps

    Martin
Re: Win32 and OCR via OLE
by ww (Archbishop) on Apr 10, 2008 at 11:00 UTC

    Have used OmniPage thru next to latest version.

    Wish I hadn't.

    Getting decent results is generally is about as much work as re-entering the data would have been and no less error-prone.

    And why would you want to OCR "MS Office formats" anyway? They're already character data. Worst case, for Word & Excel would be to export to text (and unless you catch Omnipage on sale, there's a subset of MSO available for about the same price.

    To extract from the .pdf's, see the various pdf modules.

      I was looking to parse MS outlook attachments that could be .doc, .xls, .pdf and determine via a bayesian classifier what to do with the documents. Since I do not know which of the formats will be in used in the attachments, I was hoping for a general OCR solution that I could use to pull all the text from the documents regardless of it they contained text, binary formats, or image text.

        OCR will NOT help with .doc or .xls. Neither is an image for input to Optical Character Recognition.

        Please read the replies from Corion and Marto as already posted... and see Marto's for .pdf

        And if you're thinking about "pull(ing)...from (unknown) binary formats or image text" you better start thinking about how to deal with malware.

        Do you understand what the letters OCR actually mean? I do not think that you do, when I don't know something I research it. Various people have replied to you explaining essentially what OCR means.

        Obviously there is no 'general ocr solution' to do what you want to do, given that you can't OCR things that are not images. You mean you want to text strip the attachments. You have been given links which discuss the topics of OCRing, using super search will return plenty of resources regards text stripping various types of files.

        Martin
Re: Win32 and OCR via OLE
by Corion (Patriarch) on Apr 10, 2008 at 10:57 UTC

    Last I looked, the MS Office products all had text as text and not as graphics, so you wouldn't need any OCR there. But maybe I'm misunderstanding your question completely. Maybe you can explain what you want to accomplish?

Re: Win32 and OCR via OLE
by talexb (Chancellor) on Apr 10, 2008 at 14:04 UTC

    OCR means Optical Character Recognition -- a method whereby a specific character is inferred from its grahpical representation. If you start with a document in MS Office, then OCR isn't necessary -- the document is stored as text, not a graphical representation that needs conversion.

    If instead, you are talking about discovering the document structure, I know that my former employer is in the that business, taking PDFs, analyzing them and generating XML. They had a workflow that took DOC files, created PDFs and then used those to generate XML. This XML contains all of the textual content of the document, wrapped in tags that dictate the document structure.

    Let me know if you need more information.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: Win32 and OCR via OLE
by igelkott (Priest) on Apr 10, 2008 at 22:22 UTC
    "Comment on"

    Quite an array of replies but many seem to believe that you don't know what OCR is. My interpretation is that you were making a Request for Comment on the Code Snippet or Cool Use for Perl that you wrote, right?

    Anyway, to answer the question I think you asked, PDF::OCR::Thorough reportedly does this. It calls OCR (through Tesseract) when needed but will otherwise just extract the text.

    Haven't used this module but had planned to. Last thing I needed to 'OCR', I used something very similar to your method. I'm hoping for better results than I got.