You might want to read Re: CAM::PDF did't extract all pdf's content for some info why it is so difficult to extract text from .pdf-files (in addition to the way of coding bart is assuming).
I your case, I would suggest to try another program (e.g. pdf2txt, or some ocr-software) in parallel and compare the output. In case your program identifies mismatches, you could try to use plausibility-checks and/or dictionary-lookups ... depending on how much effort you want to spend.
HTH, Rata
| [reply] |
I assume you mean pdftotext, not pdf2txt, which appers to be commercial software.
Nothing wrong with that, but pdftotext is free & open source and probably not any worse.
Pdftotext is part of XPDF.
| [reply] |
thanks for your response.
Actually my process is, catch soft hyphen words in the pdf. so that only i have tried CAM::PDF. using before i said function and get the text in every page. then read line by line and split every single word and find the hyphenate word. In this case, some words are fully merged in the line. In this area unable to split the single word.
Please advise your thoughts on this.
Regards
Suresh
| [reply] |
I'm not exactly sure what you are doing, rather than have us guess could you please post exactly the code you are using? What version of PDF is the file in question? See the compatibility notes in the module documentation.
| [reply] |
My guess is the individual words are placed in the PDF file, all independently. Spaces are not included.
You could call it a bug..L alright, it is. But I'm expecting the same problem to occur if you try to get the text out of the PDF file in another way, for example using copy/paste. | [reply] |