Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Re: Reading PDF files

by RollyGuy (Chaplain)
on Jul 21, 2003 at 14:02 UTC ( #276283=note: print w/replies, xml ) Need Help??

in reply to Reading PDF files

Disclaimer: I don't know the answer
However, I would like to add a word of caution. PDF's can store information as images as well, so if you are trying to parse a PDF of images of text, it will be very difficult and quite a different problem than parsing.

Replies are listed 'Best First'.
Re: Re: Reading PDF files
by Helter (Chaplain) on Jul 21, 2003 at 14:18 UTC
    When I open the PDF using acrobat, I can use the text selection tool to grab some text, so unless they have some FAST OCR software running in there I don't believe it's an image, but a very good warning.
      Depending on the app that created such PDF and the settings/fonts used, you may end up with a pdf that is a bunch of font character bitmaps in sequence or blocked with no underlining text information. Most adobe applications will enbed textual versions in the PDF so the text selection tool can be used to grab plain text from segments in the pdf. I think the point is unles you can be certian how the pdf is generated and you are comfortable with them -- parsing data from them is going to be a large pain in the butt.


Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://276283]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2022-08-07 20:08 GMT
Find Nodes?
    Voting Booth?

    No recent polls found