in reply to Re: Win32 and OCR via OLE
in thread Win32 and OCR via OLE

I was looking to parse MS outlook attachments that could be .doc, .xls, .pdf and determine via a bayesian classifier what to do with the documents. Since I do not know which of the formats will be in used in the attachments, I was hoping for a general OCR solution that I could use to pull all the text from the documents regardless of it they contained text, binary formats, or image text.

Replies are listed 'Best First'.
Re^3: Win32 and OCR via OLE
by ww (Archbishop) on Apr 10, 2008 at 13:41 UTC

    OCR will NOT help with .doc or .xls. Neither is an image for input to Optical Character Recognition.

    Please read the replies from Corion and Marto as already posted... and see Marto's for .pdf

    And if you're thinking about "pull(ing)...from (unknown) binary formats or image text" you better start thinking about how to deal with malware.

Re^3: Win32 and OCR via OLE
by marto (Cardinal) on Apr 10, 2008 at 13:47 UTC
    Do you understand what the letters OCR actually mean? I do not think that you do, when I don't know something I research it. Various people have replied to you explaining essentially what OCR means.

    Obviously there is no 'general ocr solution' to do what you want to do, given that you can't OCR things that are not images. You mean you want to text strip the attachments. You have been given links which discuss the topics of OCRing, using super search will return plenty of resources regards text stripping various types of files.

    Martin
      Martin, I do understand what OCR means. I was using it in the context to refer to pulling text out of non-text PDFs. I was not saying that I wanted to apply it to word documents. However, I was looking for a general application that could OCR the pdfs, but also extract text from the word documents via usage of the office libraries. -Ty
        That's not what you said the first time, nor the second.

        Clarification can be a "good thing" but Re^4: Win32 and OCR via OLE is not clarification; it's an attempt to disavow your previous two posts.

        Hence, --.

        Ok, once more with feeling.

        You said:

        'I was hoping for a general OCR solution that I could use to pull all the text from the documents regardless of it they contained text, binary formats, or image text.'

        OCR has no context with non image documents. I have previously told you that you should not be using this term in conjunction with non image documents. It has no context here.

        If you are looking to code a 'general application' to achieve this goal then you have been given sufficient information to get you started, at least now you should know what you should be learning, where to look, and the pseudo code for your application. If you have any problems changing this pseudo code into functioning Perl post the code and we will try to help.

        Martin
Re^3: Win32 and OCR via OLE (MS Office Document Imaging)
by BrowserUk (Patriarch) on Apr 11, 2008 at 00:11 UTC