in reply to words are merging while extracting the text from pdf

You might want to read Re: CAM::PDF did't extract all pdf's content for some info why it is so difficult to extract text from .pdf-files (in addition to the way of coding bart is assuming).

I your case, I would suggest to try another program (e.g. pdf2txt, or some ocr-software) in parallel and compare the output. In case your program identifies mismatches, you could try to use plausibility-checks and/or dictionary-lookups ... depending on how much effort you want to spend.

HTH, Rata
  • Comment on Re: words are merging while extracting the text from pdf

Replies are listed 'Best First'.
Re^2: words are merging while extracting the text from pdf
by elef (Friar) on Jan 04, 2011 at 11:56 UTC
    I assume you mean pdftotext, not pdf2txt, which appers to be commercial software. Nothing wrong with that, but pdftotext is free & open source and probably not any worse. Pdftotext is part of XPDF.
Re^2: words are merging while extracting the text from pdf
by sureshrps (Novice) on Jan 04, 2011 at 14:41 UTC
    thanks for your response. Actually my process is, catch soft hyphen words in the pdf. so that only i have tried CAM::PDF. using before i said function and get the text in every page. then read line by line and split every single word and find the hyphenate word. In this case, some words are fully merged in the line. In this area unable to split the single word. Please advise your thoughts on this. Regards Suresh