in reply to extract text from pdf

have you tried File::Extract::PDF?!
it uses CAM::PDF internally, but maybe you have better luck with it.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*women.pm

Replies are listed 'Best First'.
Re^2: extract text from pdf
by jeteve (Pilgrim) on Nov 08, 2006 at 13:32 UTC
    I did try both of those .. without success.

    I got a pdf I've created with openoffice and pdftotext is able to extract text from it, whereas CAM::PDF (or File::Extract::PDF) gives me messy characters.

    [jerome@saab pdf]$ getpdftext.pl -v ~/faxTaxHabitation2005.pdf                                                  ! " #  $  % # & ' ( "  ) * + + + ...
    And pdftotext:
    [jerome@saab pdf]$ pdftotext ~/faxTaxHabitation2005.pdf txt [jerome@saab pdf]$ tail txt Merci de bien vouloir me confirmer ces informations par retour de fax +afin que je puisse proceder au paiment le plus rapidement possible au + numero suivant : ************* Cordiales salutations. ...

    The ideal would be a perl module linked to the xpdf C code .. :)

    -- Nice photos of naked perl sources here !