Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re: Convert PDF to HTML (or JPEG)

by almut (Canon)
on Sep 12, 2009 at 12:31 UTC ( [id://794918]=note: print w/replies, xml ) Need Help??


in reply to Convert PDF to HTML (or JPEG)

For PDF to JPG (or any other raster image format like PNG or TIFF), you could use GhostScript to do the conversion:

$ gs -q -dBATCH -dNOPAUSE -sDEVICE=jpeg -dJPEGQ88 -r150 -sOutputFile=i +mg%d.jpg input.pdf

This would create as many images (img1.jpg to imgN.jpg) as there are pages in the PDF file.  -r is the resolution in dpi (150dpi would create an image size of 1240x1754 for A4 paper size), and -dJPEGQ is the quality factor (up to 100).

Unfortunately, this doesn't do any anti-aliasing, so the fonts typically look rather ragged...  You can work around that problem by doing the anti-aliasing yourself; which means, you'd have to oversample while rendering from PDF to raster (e.g. by a factor of 4, i.e. 600dpi) and then downsample with an appropriate filter.

ImageMagick's convert can be used for the latter. The complete sequence of steps would be:

$ gs -q -dBATCH -dNOPAUSE -sDEVICE=jpeg -dJPEGQ88 -r600 -sOutputFile=i +mg%d.jpg input.pdf $ for img in img*.jpg ; do convert $img -filter Lanczos -resize 25% -q +uality 90 out_$img ; done

The resulting anti-aliased images out_img*.jpg would then have 150dpi resolution.

In case you have the non-/usr/bin-namespace-polluting sister GraphicsMagick installed (instead of ImageMagick), the command would be gm convert ...

(Those who hold a degree in Signal Processing - or have come in contact with filter design in some other context - might want to take a look at the list of filters to choose from — in case of doubt, stick with Lanczos or Kaiser for somewhat sharper, or Gaussian or Cubic for somewhat softer results.)

Also, there's documentation - well hidden from daylight - under /usr/share/doc/ghostscript/Devices.htm, which explains what options are available with the individual Ghostscript output devices (you usually need to have another package installed (e.g. ghostscript-doc on Debian/Ubuntu) to have that file).

Replies are listed 'Best First'.
Re^2: Convert PDF to HTML (or JPEG)
by LanX (Saint) on Sep 12, 2009 at 14:13 UTC
    Almut, IIRC convert has a switch for antialiasing, I never had problems converting PDF to bitmaps (well ... years ago)

    So no need for oversampling.

    Cheers Rolf

      Yes, convert has an -antialias switch, but not GhostScript — at least not the jpeg driver (there's an x11alpha screen driver, but I think that's the only one which does anti-aliasing by itself).  And ImageMagick (i.e. convert) cannot render PDF/PS itself; it uses GhostScript for that under the hood, anyway...

      Personally, I prefer to use both tools separately, because then I have fine control over the parameters used during conversion, and so far, I've always achieved better results (in less time) than when trying to convince convert alone to do what I want.

      For example, the naive approach (which I figure should be comparable to the conversions I posted above) when using convert directly would be something like this:

      $ convert input.pdf -density 150 -geometry 1240x1754 -antialias -quali +ty 90 img%d.jpg

      But the results are much worse than when doing the steps separately... (example: test1.jpg, test2.jpg — where test1.jpg has been produced by using gs and convert separately, and test2.jpg when calling gs indirectly via convert (the command right above)).

      As I read the docs, -density is supposed to set the resolution ("set resolution of an image for rendering to devices"), however, for some reason this doesn't seem to be passed on to Ghostscript (as can be revealed using strace)...  In case you have the patience to figure out the correct incantation of options for convert that achieves the quality of test1.jpg, please let me know (input PDF here) — IMHO, there's too much Magick going on :)

        Hi Almut,

        Now I had the time to check my old computer for these 8 year old bash scripts I used :)

        And ... well ... it's really strange, but I'm not experiencing your problems!

        test.pdf.00.jpg

        test.pdf.00.png

        that's the script I used:

        cd /home/lanx/tmp; SOURCE="test.pdf"; ILTYPE="plane" ; GEOMETRY="1240x1754"; QUALITY=90; DENSITY="150x150"; for OUT in "jpg" "png"; do echo $OUT; DEST="$SOURCE.%02d.$OUT"; convert +adjoin -interlace $ILTYPE -geometry $GEOMETRY \ -density $DENSITY -quality $QUALITY $SOURCE $DEST done

        Maybe some other installations like latex2html or GraphicsMagick are altering the behavior of convert on my box?

        Cheers Rolf

        PS: Große Vallüla ??? xD (SCNR)

        UPDATE: just noticed I didn't even use the -antialias switch ...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://794918]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (5)
As of 2024-03-28 16:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found