Re: Detecting PDF content

You want to detect some of the low-level structure in a PDF file, right? Perhaps the pagedump.pl script, which comes in the PDF CPAN package in the "examples" subdirectory would be of some help.

Here is the start of what it dumps about the first page of The Perl Journal:

% perl pagedump.pl 0301tpj.pdf 1
Page 1
    Dictionary
        <<
        Name: /CropBox => Array
            [
            Number: 0
            Number: 0
            Number: 558
            Number: 756
            ]
        Name: /MediaBox => Array
            [
            Number: 0
            Number: 0
            Number: 558
            Number: 756
            ]
        Name: /Rotate => Number: 0
        Other: Page_Object => Object: 402 0 R
        Other: Resource_Object => Object: 434 0 R
        >>
...
[download]

You can probably find a distinct set of components for your image-only cases.

Update: Mr. Muskrat and I seem to have different interpretations of your question. I read "detect that" to mean "detect that a file (which is already known to be a pdf file) contains only images rather than images plus text or text alone."

Comment on Re: Detecting PDF content Download Code

Replies are listed 'Best First'.
Re: Re: Detecting PDF content by Rich36 (Chaplain) on Jan 23, 2003 at 22:20 UTC
I think that will definitely be helpful. I'll just need to figure out what the returned paramters are and try to figure out what constitutes an image. Thanks, «Rich36»	[reply]