in reply to Editing text in PDF file

Since you need to do a round trip, it is probably easiest to use ghostscript to convert the pdf to postscript, then do the text filtering on the postscript with perl, and then convert the postscript back into pdf with ghostscript again.

If you have ImageMagick set up properly, you can convert it with:

convert myfile.pdf myfile.ps
and back with:
convert myfile.ps myfile.pdf
You can convert the pdf to xml with pdftohtml using the -xml option, but I don't know how to make the resulting xml back into pdf. Perhaps one of the perl pdf modules would be able to do most of the work.

You can also work with the pdf data directly. The format is nicely documented by Adobe. I have read and written pdf files directly with low-level perl code, but now there are modules that make this much easier.

If you want to learn more about pdf I recommend pdfzone, which includes information about both commercial and open-source tools for working with pdf.

It should work perfectly the first time! - toma

Replies are listed 'Best First'.
Re: Re: Editing text in PDF file
by gnum (Novice) on Apr 15, 2004 at 00:52 UTC
    Thanks a lot for all the info.. toma, you said you have read and written pdf files directly with low-level perl code.Would you like to share this info a bit more in detail?
      It would be much better for you to learn the CPAN modules rather than use my old code. My code is from 1998, and is probably for perl5.005 and pdf for Acrobat 3. If you try the CPAN module and it is unsuitable, please let me know what the problem is, and if my old code handles it you will be welcome to have it!

      I started writing pdf files by using print statements to generate the examples in the pdf documentation from Adobe. Then I started substituting my own commands for the ones in the example. There were some checksums that I needed to compute for the PDF, if I recall correctly.

      It is straightforward to position text and draw vectors in PDF, although not quite as easy as it is in postscript. Text positioning is easiest to compute with a fixed-width font. Otherwise, you will need some other routine to determine how wide your text will come out.

      For reading PDF, the process is just reversed. Many documents only use a few PDF commands. The documents that I was parsing were typically designed to be compatible with Acrobat 2.0, so they tended to be simpler. As the new readers became more widely used the documents tended to make more use of compression, which I don't believe I ever successfully decoded. The compression formats are not unusual or undocumented, but my work no longer required it. Instead, we switched to using a C-language API that we bought from Adobe, and we built our application on top an example that came with this API. Adobe discontinued the API product a few months after we bought it, and would not answer questions about their example. Not good after you spend $50,000 on software, and don't get the source code!

      Since then, the open-source solutions have become proficient enough at PDF that they handle all my needs.

      It should work perfectly the first time! - toma