Check out PDF::Reuse. It can use existing PDFs as templates and alter them in various ways to produce new PDFs.
| [reply] |
Since you need to do a round trip,
it is probably easiest to use
ghostscript to convert the pdf to postscript,
then do the text filtering on the postscript with perl,
and then convert the postscript back into pdf
with ghostscript again.
If you have ImageMagick set up properly, you can
convert it with:
convert myfile.pdf myfile.ps
and back with:
convert myfile.ps myfile.pdf
You can convert the pdf to xml with
pdftohtml
using the -xml option, but I don't know how to make
the resulting xml back into pdf. Perhaps one of the
perl pdf modules would be able to do most of the work.
You can also work with the pdf data directly. The format
is nicely documented by Adobe. I have read and written
pdf files directly with low-level perl code, but
now there are modules
that make this much easier.
If you want to learn more about pdf I recommend
pdfzone, which includes
information about both commercial and open-source
tools for working with pdf.
It should work perfectly the first time! - toma
| [reply] [d/l] [select] |
Thanks a lot for all the info..
toma, you said you have read and written pdf files directly with low-level perl code.Would you like to share this info a bit more in detail?
| [reply] |
It would be much better for you to learn the CPAN
modules rather than use my old code. My
code is from 1998, and is probably for perl5.005
and pdf for Acrobat 3. If you try the CPAN module
and it is unsuitable, please let me know what the
problem is, and if my old code handles it you
will be welcome to have it!
I started writing pdf files by using print statements
to generate the examples in the pdf documentation
from Adobe. Then I started substituting my own
commands for the ones in the example.
There were some checksums that I needed
to compute for the PDF, if I recall correctly.
It is straightforward to position text and draw vectors
in PDF, although not quite as easy as it is in
postscript. Text positioning is easiest to compute
with a fixed-width font. Otherwise, you will need
some other routine to determine how wide your text
will come out.
For reading PDF, the process is just reversed. Many
documents only use a few PDF commands. The documents
that I was parsing were typically designed
to be compatible with Acrobat 2.0, so they tended to be
simpler. As the new readers became more widely used the
documents tended to make more use of compression,
which I don't believe I ever successfully decoded.
The compression formats are not unusual or undocumented,
but my work no longer required it. Instead, we
switched to using a C-language API that we bought
from Adobe, and we built our application on top an
example that came with this API. Adobe discontinued
the API product a few months after we bought it,
and would not answer questions about their example.
Not good after you spend $50,000 on software,
and don't get the source code!
Since then, the open-source solutions have become
proficient enough at PDF that they handle all my
needs.
It should work perfectly the first time! - toma
| [reply] |
Cameron Laird has published an article on this recently. Bottom line: There's some commercial miracle software that is able to extract text from PDFs, but it's quite obscure.
Here's some more notes on this.
Unrelated to Perl. | [reply] |
Adobe offers an online conversion tool that allows you to convert a web based PDF document. I used this in conjunction with LWP::Simple to convert tax charts in an app that I wrote. You can find the web page here. The only drawback is that you're going to have to put your PDF up on a website as the interface requires a URL to tell it where the file resides.
It should then be a fairly easy exercise to make your conversions and then rewrite the file as a PDF using PDF::Create or some such PDF module.
The one thing to be aware of, however, is that the Adobe utility is not the smartest when it comes to making a text file from a PDF. I ran in to a situation where the town of San Luis Obispo came though as
San Luis
Obispo
so you want to be aware of that. You may have to manually check your text file for that kind of thing.
Hope that helps!
There is no emoticon for what I'm feeling now.
| [reply] |
I have found in PDFZone an article that explains the close relationship between LaTeX , Tex and PDF files.(Using LaTeX to create PDF documents over the Web)
I think that they are both from the XML family. You should also search for these sort of conversions. | [reply] |