Re: Win32 and OCR via OLE

Have used OmniPage thru next to latest version.

Wish I hadn't.

Getting decent results is generally is about as much work as re-entering the data would have been and no less error-prone.

And why would you want to OCR "MS Office formats" anyway? They're already character data. Worst case, for Word & Excel would be to export to text (and unless you catch Omnipage on sale, there's a subset of MSO available for about the same price.

To extract from the .pdf's, see the various pdf modules.

Comment on Re: Win32 and OCR via OLE

Replies are listed 'Best First'.
Re^2: Win32 and OCR via OLE by tmaly (Monk) on Apr 10, 2008 at 13:32 UTC
I was looking to parse MS outlook attachments that could be .doc, .xls, .pdf and determine via a bayesian classifier what to do with the documents. Since I do not know which of the formats will be in used in the attachments, I was hoping for a general OCR solution that I could use to pull all the text from the documents regardless of it they contained text, binary formats, or image text.	[reply]
Re^3: Win32 and OCR via OLE by ww (Archbishop) on Apr 10, 2008 at 13:41 UTC
OCR will NOT help with .doc or .xls. Neither is an image for input to Optical Character Recognition. Please read the replies from Corion and Marto as already posted... and see Marto's for .pdf And if you're thinking about "pull(ing)...from (unknown) binary formats or image text" you better start thinking about how to deal with malware.	[reply]
Re^3: Win32 and OCR via OLE by marto (Cardinal) on Apr 10, 2008 at 13:47 UTC
Do you understand what the letters OCR actually mean? I do not think that you do, when I don't know something I research it. Various people have replied to you explaining essentially what OCR means. Obviously there is no 'general ocr solution' to do what you want to do, given that you can't OCR things that are not images. You mean you want to text strip the attachments. You have been given links which discuss the topics of OCRing, using super search will return plenty of resources regards text stripping various types of files. Martin	[reply]
Re^4: Win32 and OCR via OLE by tmaly (Monk) on Apr 10, 2008 at 14:04 UTC
Martin, I do understand what OCR means. I was using it in the context to refer to pulling text out of non-text PDFs. I was not saying that I wanted to apply it to word documents. However, I was looking for a general application that could OCR the pdfs, but also extract text from the word documents via usage of the office libraries. -Ty	[reply]
Re^5: Win32 and OCR via OLE by ww (Archbishop) on Apr 10, 2008 at 14:39 UTC
Re^5: Win32 and OCR via OLE by marto (Cardinal) on Apr 10, 2008 at 14:21 UTC
Re^3: Win32 and OCR via OLE (MS Office Document Imaging) by BrowserUk (Patriarch) on Apr 11, 2008 at 00:11 UTC
This is possible if you are an MS Office user, and then you will already have all the tools you need. See this blog page for how to do it manually. But you're on your own working out how to drive those 5 steps via OLE :) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]