in reply to Win32 and OCR via OLE

OCR means Optical Character Recognition -- a method whereby a specific character is inferred from its grahpical representation. If you start with a document in MS Office, then OCR isn't necessary -- the document is stored as text, not a graphical representation that needs conversion.

If instead, you are talking about discovering the document structure, I know that my former employer is in the that business, taking PDFs, analyzing them and generating XML. They had a workflow that took DOC files, created PDFs and then used those to generate XML. This XML contains all of the textual content of the document, wrapped in tags that dictate the document structure.

Let me know if you need more information.

Alex / talexb / Toronto

"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds