Basilides has asked for the wisdom of the Perl Monks concerning the following question:

Hi

Is there a quick, simple way of doing a global substitution on a few words in a PDF file? I'm having a look at Text::PDF, but it's really complicated & I'm having trouble getting my head round it. My task is really simple--don't need to change formatting or anything, just basically s/this/that/g;, so I was wondering if there's an easier way?

Cheers
Dennis

Replies are listed 'Best First'.
Re: Replacing text in a PDF file
by nite_man (Deacon) on Jun 18, 2003 at 12:18 UTC

    Try to look at a module PDF::Reuse - Perl interface to the PDF files for manipulation of them elements.

    Also, look at PDF::Reuse::Tutorial, which consists examples of using this module.

          
    --------------------------------
    SV* sv_bless(SV* sv, HV* stash);
    
Re: Replacing text in a PDF file
by Reverend Phil (Pilgrim) on Jun 18, 2003 at 14:09 UTC
    I've been using PDF::API2 for some time, but for PDF creation, not modification. The author is constantly working on it, and there is an active message board where some quite knowledgable people are eager to help out. I found a couple of message like this one which talk of modifying the streams directly.. some working with Text::PDF instead of PDF::API2.

    I don't know much about PDF streams myself, and haven't had to modify existing PDF's, but this module and the people working on/with it can probably give you some useful guidance.

    Good luck,
    -=rev=-
Re: Replacing text in a PDF file
by gellyfish (Monsignor) on Jun 18, 2003 at 10:01 UTC

    I doubt if you will find an easier way to do this as PDF is a page description language and inasmuch you will not generally find the plain text of your document in the file in order to do a substitution on it.

    /J\
      In that case, could anyone give me a code example of how to search for a string in a PDF file. Sorry for being so lame: I'm trying to RTFM but it's too FC.
Re: Replacing text in a PDF file - NON Perl solution
by Popcorn Dave (Abbot) on Jun 18, 2003 at 22:49 UTC
    I've actually run up against this in the past. I had a PDF file that I no longer had a copy of the original document of. The way I modified it was with Adobe's Illustrator. It is a workable solution if you have access to Illustrator, but it's a giant pain in the seat.

    My document did have graphics that I had to work around - they were box labels with a logo.

    Granted it's like hitting a fly with a sledgehammer, but if all else fails, it will do the job for you.

    Good luck!

    There is no emoticon for what I'm feeling now.

Re: Replacing text in a PDF file
by CountZero (Bishop) on Jun 18, 2003 at 10:25 UTC

    Isn't the data in a PDF-file zipped? If so, this would make it almost impossible to do a straight replacement.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      Not at all, where did you get that idea from? (Try looking at one in a normal editor, its just a description language similar to postscript)

      Update: Apparently not quite true, though I've never seen one with binary data in yet..

      C.

        ...some of it is, but the default for most apps is to compress the text/graphic streams (page description etc. seems to be left alone). It would seem though, that you can switch this off in some PDF-creation tools (eg Illustrator, apparently), leaving plain-text strings in the file suitable for simple regex-ing.

        Cheers,Ben.

        Well, that's exactly what I did: the ones that I looked at did not have any recognizable text.

        CountZero

        "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Replacing text in a PDF file
by mkenney (Beadle) on Jun 19, 2003 at 03:23 UTC
    I've been wondering why there isn't a way to template the forms like HTML pages can be. There is probably something I'm not getting about PDFs, I use them but do not know their coding. I would love to be able to layout the document in any app, put in say %firstname% %lastname% and then later insert this data. Is there anything out there that does this? Might be ultimately what you are looking for?? This would be the solution of a lifetime for me...
      I think PDF::Reuse will be just what you need. I had a look at it yesterday after nite_man's recommendation. It looks like an excellent module, and has the very useful PDF::Reuse::Tutorial to accompany it. Example 15 of this tutorial appears to be very much along the lines of what you're describing. The only thing is that it uses absolute x/y positions for insertion, rather than searching and replacing a specified string, so it's no good for me.
        This PDF::Reuse looks pretty good from what I can see so far, 5 minutes. Looks like I would just need to specify where I want to put my info on the form, text and font etc. My ultimate goal is to let the graphics department freely(almost) manipulate the layout and fonts, etc. without me having to make constant changes. If they suddenly think the name or address need to be in a different font, just change the template and the problem is solved. I know it might be a pipe dream but it seems like this is a much needed function. There are LOTS of forms used everyday, why not be able to lay them out in anything and then fill them out as you go...
      The Acrobat FDF Toolkit might be what you're after. And they claim there's a Perl API there, too.

      I dunno if Adobe are still as actively pushing PDF+FDF as a consumer web form solution as they used to be. Some basic information is in Thomas Merz's "Web Publishing with Acrobat/PDF" (Springer-Verlag, 1998, ISBN 3540637621).

      --
      bowling trophy thieves, die!

Re: Replacing text in a PDF file
by aquarium (Curate) on Jun 19, 2003 at 04:25 UTC
    Depends what you want to end up with:
    (a) a long term solution = perl program using PDF::API2 the url for the API2 forum is PDF API2 module forum
    OR
    (b) you just need change this one-of document and re-print as pdf = use pdf2html utility, and edit html with regex etc directly
Re: Replacing text in a PDF file
by Popcorn Dave (Abbot) on Jun 19, 2003 at 16:02 UTC
    After giving this some thought last night, this solution *may* do what you're looking for but it seems a bit convoluted to me.

    Adobe offers a PDF to HTML conversion utility here. I've used this to convert a PDF to HTML so that I could parse out the info I needed for an app I was working on.

    If you've just got straight text in the PDF, theoretically you could do the conversion, do a HTML to text conversion, make your changes, then write the whole mess back to a PDF using the PDF modules.

    Like I said, that seems like a long way around to the solution, but it may work for you.

    Update: I seem to remember, that depending on your version of Acrobat that you can edit PDF files if you've got the originals? I believe it was Version 5. Version 6 may do that too, I don't know.

    There is no emoticon for what I'm feeling now.

Re: Replacing text in a PDF file
by Anonymous Monk on Jan 15, 2004 at 22:06 UTC
    Here's something you may want to check out..

    http://www.cs.berkeley.edu/~phelps/Multivalent/index.html

    http://www.cs.berkeley.edu/~phelps/Multivalent/Tools/pdf/Uncompress.html

    These tools are written in Java and require JRE 1.4.1
    I haven't tried any of them yet, but should be easy to implement with system calls or possibly Inline::Java

    From the documentation:

    Uncompress for Hand Editing or Examination

    The written file leaves content streams uncompressed and available for inspection or hand editing. With reference to Adobe's PDF Reference, available online, you can arbitrarily change the PDF, anything from correcting bad OCR, to fixing typos on pages without having the generating application (text is not reflowed), to adding title and keywords, to authoring annotations to diagnosing problems.

    Uncompressed content streams are pretty printed to better show structure and objects are labelled with the page numbers they're used on. For Western languages, it is straightforward to identify the character strings and edit them. One must be careful to edit with a text editor, such as Emacs, that can handle binary data and does not translate unfamiliar characters or line endings.

    The edited file written by the text editor should be passed through the Compress tool to recompress the streams and rebuild the cross-reference table.

    Hopefully someone will find these useful

Re: Replacing text in a PDF file
by zengargoyle (Deacon) on Jun 19, 2003 at 23:02 UTC

    in general i don't think it's possible to do in the easy way we all want to do it. from what i've seen of PDF, the layout is done by the creating application. by this i mean that whereas in Postscript somebody might have:

    (This is the Title) centered-bold

    a PDF file would have more something like:

    x y moveto (This is the Title) show

    my code is way off, but the idea is that the application determines the absolute position of the text to show, so even if you can read the PDF and unpack the text strings from whatever format they're in, if you change the text then you change the layout and your PDF is going to be all messed up.

    Postscript tends more to having code in the file to handle the layout, and you pass a chunk of text to the code and it get's laid out as the program is run on the printer.so if you change a word in the middle of the paragraph it won't cause the line to run off the edge of the page. the absolute position isn't determined untill run-time.

    this is based on looking at a bunch of PDF modules and always seeing things like put_text(x,y,text) which returns how many x's are left in the line. it seems that the PDF creator is responsible for things like wrapping/centering text.

    does this make any sense?