Since you need to do a round trip,
it is probably easiest to use
ghostscript to convert the pdf to postscript,
then do the text filtering on the postscript with perl,
and then convert the postscript back into pdf
with ghostscript again.
If you have ImageMagick set up properly, you can
convert it with:
convert myfile.pdf myfile.ps
and back with:
convert myfile.ps myfile.pdf
You can convert the pdf to xml with
pdftohtml
using the -xml option, but I don't know how to make
the resulting xml back into pdf. Perhaps one of the
perl pdf modules would be able to do most of the work.
You can also work with the pdf data directly. The format
is nicely documented by Adobe. I have read and written
pdf files directly with low-level perl code, but
now there are modules
that make this much easier.
If you want to learn more about pdf I recommend
pdfzone, which includes
information about both commercial and open-source
tools for working with pdf.
It should work perfectly the first time! - toma
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.