sureshrps has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I need to get the text from pdf. hence i have used the CAM::PDF module. In this case, i have to used the "$pdfinfo->getPageText(1)" function for extracting the text. Extracting is good. but facing some problem. i need to find and split the every word. so that i have to reading the line by line. but some lines are looks, the all words are merging. for eg: "lower-boundaryconditionstoprovokemodicationsoftheat". how can i get text without merging words or how i can i split separate words in the merging line? Anyone know any idea..? Waiting for valuable response. thanks Suresh
  • Comment on words are merging while extracting the text from pdf

Replies are listed 'Best First'.
Re: words are merging while extracting the text from pdf
by Ratazong (Monsignor) on Jan 04, 2011 at 11:46 UTC

    You might want to read Re: CAM::PDF did't extract all pdf's content for some info why it is so difficult to extract text from .pdf-files (in addition to the way of coding bart is assuming).

    I your case, I would suggest to try another program (e.g. pdf2txt, or some ocr-software) in parallel and compare the output. In case your program identifies mismatches, you could try to use plausibility-checks and/or dictionary-lookups ... depending on how much effort you want to spend.

    HTH, Rata
      I assume you mean pdftotext, not pdf2txt, which appers to be commercial software. Nothing wrong with that, but pdftotext is free & open source and probably not any worse. Pdftotext is part of XPDF.
      thanks for your response. Actually my process is, catch soft hyphen words in the pdf. so that only i have tried CAM::PDF. using before i said function and get the text in every page. then read line by line and split every single word and find the hyphenate word. In this case, some words are fully merged in the line. In this area unable to split the single word. Please advise your thoughts on this. Regards Suresh
Re: words are merging while extracting the text from pdf
by marto (Cardinal) on Jan 04, 2011 at 11:36 UTC

    I'm not exactly sure what you are doing, rather than have us guess could you please post exactly the code you are using? What version of PDF is the file in question? See the compatibility notes in the module documentation.

Re: words are merging while extracting the text from pdf
by bart (Canon) on Jan 04, 2011 at 11:39 UTC
    My guess is the individual words are placed in the PDF file, all independently. Spaces are not included.

    You could call it a bug..L alright, it is. But I'm expecting the same problem to occur if you try to get the text out of the PDF file in another way, for example using copy/paste.