FoxtrotUniform has asked for the wisdom of the Perl Monks concerning the following question:

Gentlemonks,

My lab has a fair number of conference papers and proceedings stashed away as PDFs. I'm trying to write a program to index them and create a web-page interface (which I hope will be more friendly than a ten-level directory tree). The papers' filenames almost invariably contain no more useful information than the primary author's last name, so I'm trying to use PDF::Parse to extract title and author information.

Unfortunately, my script fails with

PDF::Core::PDFGetPrimitive() called too early to check prototype at PD +F/Core.pm line 288. PDF::Core::PDFGetPrimitive() called too early to check prototype at PD +F/Core.pm line 294. Argument "" isn't numeric in seek at PDF/Core.pm line 556. Can't read cross-reference section, according to trailer
and I can't glean enough context from the module's documentation to know when PDF::Core::PDFGetPrimitive() should be called, or why a trailer might not be able to read a cross-reference section.

In fact, I'm finding it very difficult to debug based on the docs for PDF::Parse -- they seem to describe what the module's various functions will do if they succeed, but not what's necessary to ensure success.

My code, mildly edited for length, follows:

#! /usr/bin/perl -w use strict; use File::Find; use Data::Dumper; use PDF; use PDF::Parse; my %conferences = (); my $start_dir = "/gruvi/Data/Proceedings"; sub find_pdfs { # simple, yet shoud be effective if(/\.pdf$/) { # NOTE: this seems to ignore "top-level" PDFs # That's not a bug, it's a feature! my ($conference) = ($File::Find::dir =~ m!$start_dir/(\w+?)/!) +; $conference ||= "Top-level"; push @{$conferences{$conference}}, "$File::Find::dir/$_"; return 1; } return 0; } &File::Find::find(\&find_pdfs, $start_dir); for (sort keys %conferences) { &write_conference_page($_, "$_.html", $conferences{$_}); } sub write_conference_page { my ($conference, $outfile, $pdfs) = @_; open PROCS, '>', "./$outfile" or die "Cannot open $outfile: $!\n"; # we're going to callously ignore proper HTML for the present print PROCS "<html>\n<body>\n"; # we're also going to callously ignore proper Perl HTML-gen style # for the moment print PROCS "<h3>$conference</h3>\n"; print PROCS "<ul>\n"; for (@$pdfs) { my $pdf = PDF->new($_); $pdf->TargetFile($_); my $title = $pdf->GetInfo("Title"); my $authors = $pdf->GetInfo("Author"); print PROCS "\t<li>$authors:<br>"; print PROCS "<a href=\"file://$_\">$title</a></li>\n"; } print PROCS "</ul>\n<hr>\n\n"; print PROCS "</body>\n</html>\n"; close PROCS or die "Cannot close $outfile: $!\n"; }
I'll freely admit that the code around $pdf is cargo-cult stuff -- I couldn't tell from the docs what I was expected to do to make a PDF object, so I just copied from the synopses.

This code fails to generate any of the conference index pages. Any ideas?

--
F o x t r o t U n i f o r m
Found a typo in this node? /msg me
% man 3 strfry

Replies are listed 'Best First'.
Re: PDF::Parse fails with obscure error messages
by Jaap (Curate) on Jul 21, 2004 at 07:28 UTC
    Try to isolate the problem and mail the author about it. Usually when i isolate the problem i find what i was doing wrong or i find an easy workaround.
Re: PDF::Parse fails with obscure error messages
by Albannach (Monsignor) on Jul 22, 2004 at 13:12 UTC
    I did a simple with your code, and apart from the redundant $pdf->TargetFile($_); it seems to correctly extract the requested PDF info from my small sample of files. I noticed that PDF::Core (which I've rarely used) has some uninitialized variables that could be cleaned up, but I also suspect that as it is a bit old now it simply doesn't grok the particular PDF you're feeding it in your test. Perhaps you should see which PDF is being scanned when that error appears, and test some other (i.e. not generated at your shop) PDFs to see if it is a version problem.

    You might also look into PDF::API2 which I have found to be pretty good (and newer), though slower than PDF::Parse for header information.

    Good luck!

    --
    I'd like to be able to assign to an luser