sureshrps has asked for the wisdom of the Perl Monks concerning the following question:

Hi Is it possible to remove and replace all ligatures like fi and fl with the actual letters in a pdf file? When I copy/paste text from some pdf files, ligatures are exported as unwanted letters and impossible to know what letters it should be. So I am thinking that if I could convert the ligatures before I copy/paste then my problem would be solved. Expecting your valuable response. Thanks Suresh Kumar. P

Replies are listed 'Best First'.
Re: how to remove Ligatures in pdf
by roboticus (Chancellor) on Jun 01, 2011 at 10:58 UTC

    sureshrps:

    What exactly are you having trouble with? You've been here long enough to know that you should ask a question with enough supporting information. Please review the following for future posts: Ask questions the smart way, I know what I mean. Why don't you? and How (Not) To Ask A Question.

    So what code are you having problems with? If it's actually a cut & paste question, then you should direct it to the help group for the OS you're using, or perhaps a PDF support group.

    If you have a text file and need to modify it, then I'd suggest making a table-based program to search for problem data and replace it with the data you want. Then, as you find the ligatures, enter the appropriate data in your table and stuff your file through the program.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: how to remove Ligatures in pdf
by LanX (Saint) on Jun 01, 2011 at 11:00 UTC
    in PDF letters are placed by absolute positions. So replacing glyphs with different metrics will most probably result in broken documents rendering.

    You should consider exporting to a textfile and doing the replacement there. (see Parsing PDFs by text position?)

    Or to run a little script on the pasted text.

    E.g. with emacs you could run a hook on every insert, which could do the transformation.

    Or maybe ... did you check if your pdf reader allows to run hooks on copies... ?

    > and impossible to know what letters it should be.

    I doubt this could you give us some examples?

    Cheers Rolf

Re: how to remove Ligatures in pdf
by Anonymous Monk on Jun 01, 2011 at 10:58 UTC
      It's a symbol that represents a combination of letters. You can still recognize the original letters, but they're drawn slightly differently. examples

      In PDF, these ligatures are commonly represented by a single replacement character. Thus, if you copy/paste text from a PDF file, you usually get a bogus character code that isn't used for anything else in that document. In doesn't even have to be the same character code in different PDF files...