RE: Reading PDF Files?

Replies are listed 'Best First'.
RE: RE: Reading PDF Files? by athomason (Curate) on Jun 07, 2000 at 21:37 UTC
I disagree for a number of reasons. First, I think the reason most people release files as PDF documents is because they have a need for precise formatting which HTML doesn't give. This includes much better print control. Second, PDF doesn't protect text at all; you can select text in even Adobe's PDF viewers for copying. The only viewers out there really designed for such protection are e-book readers. Lastly, I don't think it's the domain of Monks to judge someone's intentions with a project. I'd say that if you don't feel comfortable giving advice to someone, just don't give it. I especially think it's inappropriate to come down condemning someone without any knowledge of how the project will be used. I'd be inclined to think the OP intends to write an engine for searching through PDFs on an intranet, given the insanity of indexing anything more (in Perl, no less). All the above just MHO.	[reply]
RE: RE: RE: Reading PDF Files? by neshura (Chaplain) on Jun 07, 2000 at 21:54 UTC
You are probably close to the mark. I don't know if either of us can generalize as to why people release their work in PDF (I should have said, "In my experience/opinion"). I have found that if you copy and paste PDF text, it drops a given letter from every word, so reconstructing the text is awfully time-consuming. I do not know if this behavior is universal or an optional behavior set at the time of creation. I -did- mean to imply that the seeker was trying to crack PDFs; I've done the same thing, for (what I believed to be) legitimate reasons -- namely laziness :) not wanting to type in all that damn text. But I suppose you are correct in that without knowing the legality or motive behind the poster's code, it's wrong to wimp out on answering questions. (You know the old saw about ass-u-me) e-mail neshura	[reply]
(jcwren) RE: RE: Reading PDF Files? by jcwren (Prior) on Jun 07, 2000 at 21:52 UTC
PDF stands for (P)ortable (D)ocument (F)ormat. I'm not going to conmment one way or the other on the merit or advisability of using PDFs to secure material. The primary point of PDF was to be able to create portable documents (multi-platform) with scalable graphics and fonts (i.e, no bit-map images, unless you're a complete hack). There are a number of third party Windows based tools for manipulating PDF files, so I'll assume there's not an issue about securing the content, by default. I believe that 4.0 supports some security features. You can turn off the ability to cut and paste (at least, out of the Windows viewers. Dunno about other platforms). There are, in fact, some tools that do create searchable indexes for PDFs. All the ones I've seen are Windows based, so I can't comment if there are any with source available. I think some third part vendor out there offers a linkable library for PDF management. I realize these don't directly address the answer you're looking for, but may give you some places to take a look for additional information. --Chris	[reply]
RE: RE: Reading PDF Files? by mdillon (Priest) on Jun 07, 2000 at 22:14 UTC
i'm not so sure about this. i have a package called GhostScript on my computer. it includes two programs of relevance: `pdf2ps` and `ps2ascii`. the first will translate a document in PDF format into PostScript, while the second will produce a (decent) text rendering of the PostScript. PDF is not inherently about protection, as many have pointed out. if the PDF has been encrypted, i have installed the RC4 encryption extension to GhostScript, so provided that i have the key, i can even gain legitimate access to encrypted PDF files.	[reply]
Who said anything about cracking? by Melvin (Beadle) on Jun 08, 2000 at 21:42 UTC
First of all, I was under the understanding that the PDF document format was fully documented in the white papers on adobe's website. If I am correct (I'll have to check), then it isn't "cracking" at all, it is simply using the data in the documents to provide a useful service. Second, this is for a client, who WANTS to have their pdf documents searched. I'll be very disappointed if the pdf format is a closed standard, but I believe you are wrong. Third, there are already a number of utilites out there that convert from PDF to another format, such as PS and txt, which I doubt are the result of reverse engineering. So, I belive my question is valid, though unfortunely I'm not finding a lot of answers. Please do feel free to correct me on any of the above points though. Mel	[reply]
RE:(2) Reading PDF Files? by swiftone (Curate) on Jun 07, 2000 at 21:45 UTC
I think that's unnecessarily harsh. Someone can still cut-and-paste the text (and "Select All", "Copy", "Paste" isn't terribly time-consuming), so it isn't like any "real" protection is afforded. I would consider it a strech to say that protection is the primary goal an agency has when placing a PDF online. Portability and print-formating (i.e. making sure two pages STAYS two pages) are of more importance. It would be more useful (to the questioner) to say that Adobe protects PDF text reading-methods, and that no modules exist to do so (that can be distributed).	[reply]