gnum has asked for the wisdom of the Perl Monks concerning the following question:

Hi, Could someone please help me. I have a PDF file with some texts in it. Is it possible to, somehow, search-n-replace a text and regenerate the PDF? Or,, is it possible to convert the PDF document to some XML format and convert it back to PDF after replacing the text..? Hope I'm not asking too much here.. Thank you.

Replies are listed 'Best First'.
Re: Editing text in PDF file
by kvale (Monsignor) on Apr 14, 2004 at 03:09 UTC
    Check out PDF::Reuse. It can use existing PDFs as templates and alter them in various ways to produce new PDFs.

    -Mark

Re: Editing text in PDF file
by toma (Vicar) on Apr 14, 2004 at 03:07 UTC
    Since you need to do a round trip, it is probably easiest to use ghostscript to convert the pdf to postscript, then do the text filtering on the postscript with perl, and then convert the postscript back into pdf with ghostscript again.

    If you have ImageMagick set up properly, you can convert it with:

    convert myfile.pdf myfile.ps
    and back with:
    convert myfile.ps myfile.pdf
    You can convert the pdf to xml with pdftohtml using the -xml option, but I don't know how to make the resulting xml back into pdf. Perhaps one of the perl pdf modules would be able to do most of the work.

    You can also work with the pdf data directly. The format is nicely documented by Adobe. I have read and written pdf files directly with low-level perl code, but now there are modules that make this much easier.

    If you want to learn more about pdf I recommend pdfzone, which includes information about both commercial and open-source tools for working with pdf.

    It should work perfectly the first time! - toma
      Thanks a lot for all the info.. toma, you said you have read and written pdf files directly with low-level perl code.Would you like to share this info a bit more in detail?
        It would be much better for you to learn the CPAN modules rather than use my old code. My code is from 1998, and is probably for perl5.005 and pdf for Acrobat 3. If you try the CPAN module and it is unsuitable, please let me know what the problem is, and if my old code handles it you will be welcome to have it!

        I started writing pdf files by using print statements to generate the examples in the pdf documentation from Adobe. Then I started substituting my own commands for the ones in the example. There were some checksums that I needed to compute for the PDF, if I recall correctly.

        It is straightforward to position text and draw vectors in PDF, although not quite as easy as it is in postscript. Text positioning is easiest to compute with a fixed-width font. Otherwise, you will need some other routine to determine how wide your text will come out.

        For reading PDF, the process is just reversed. Many documents only use a few PDF commands. The documents that I was parsing were typically designed to be compatible with Acrobat 2.0, so they tended to be simpler. As the new readers became more widely used the documents tended to make more use of compression, which I don't believe I ever successfully decoded. The compression formats are not unusual or undocumented, but my work no longer required it. Instead, we switched to using a C-language API that we bought from Adobe, and we built our application on top an example that came with this API. Adobe discontinued the API product a few months after we bought it, and would not answer questions about their example. Not good after you spend $50,000 on software, and don't get the source code!

        Since then, the open-source solutions have become proficient enough at PDF that they handle all my needs.

        It should work perfectly the first time! - toma
Re: Editing text in PDF file
by saintmike (Vicar) on Apr 14, 2004 at 01:28 UTC
    Cameron Laird has published an article on this recently. Bottom line: There's some commercial miracle software that is able to extract text from PDFs, but it's quite obscure. Here's some more notes on this. Unrelated to Perl.
Re: Editing text in PDF file
by Popcorn Dave (Abbot) on Apr 14, 2004 at 16:25 UTC
    Adobe offers an online conversion tool that allows you to convert a web based PDF document. I used this in conjunction with LWP::Simple to convert tax charts in an app that I wrote. You can find the web page here. The only drawback is that you're going to have to put your PDF up on a website as the interface requires a URL to tell it where the file resides.

    It should then be a fairly easy exercise to make your conversions and then rewrite the file as a PDF using PDF::Create or some such PDF module.

    The one thing to be aware of, however, is that the Adobe utility is not the smartest when it comes to making a text file from a PDF. I ran in to a situation where the town of San Luis Obispo came though as

    San Luis

    Obispo

    so you want to be aware of that. You may have to manually check your text file for that kind of thing.

    Hope that helps!

    There is no emoticon for what I'm feeling now.

Re: Editing text in PDF file
by chanio (Priest) on Apr 15, 2004 at 03:34 UTC
    I have found in PDFZone an article that explains the close relationship between LaTeX , Tex and PDF files.(Using LaTeX to create PDF documents over the Web)

    I think that they are both from the XML family. You should also search for these sort of conversions.