hacker has asked for the wisdom of the Perl Monks concerning the following question:

I'm building a complex content-management system for delivering selected content to Palm handheld devices, and I've run into a problem which may not be possible with given tools, so I'm soliciting some ideas..

What I've done so far, is capture about 885 different URLs of content that users have been using, and stored the url, name, etc. in a database. Nothing magical there. When the user wants to select some content to convert for their Palm, they pick from a list of sites in the database, and are shown a "preview" of what it might look like when it gets onto their handheld in our application.

The problem with the approach I'm using, currently an iframe as shown in the first code snippet below, is that our list of "previews" is subject to whatever HTML the remote site we're pointing to for the iframe decides they want to use, including any broken HTML, maliscious HTML or javascript, meta-refresh tags, popups, and other noise.

# The URL itself is queried from the database above here, # but that isn't important for this particular explanation # # The URL itself is stored in $url_uri from bind_col() while ($sth->fetch) { my $substr_url_uri; $substr_url_uri = length ($url_uri) > 60 ? substr($url_uri,0,60) . "..." : $url_uri; print br(), "\n"x3, start_div({-class=>'urlbox'}), span({-class=>'urlinfo'}, b("$url_name"), br(), br(), "Current category:", start_span({-class=>'catname'}), $category_name, end_span(), br()x2, start_span({-class=>'catlist'}), "Move to new category:", br(), popup_menu("category", \@categories), end_span(), ), end_div(), start_div({-class=>'urlprev'}), start_iframe({-src => "$url_uri", -width => '170', -height => '170', -scrolling => 'no', -marginwidth => '1', -marginheight => '1', -frameborder => '0',}), end_iframe, end_div(), } $sth->finish;

Obviously this presents a problem when we have a really nice looking interface for the previews, similar to those in this screenshot.

What I'm wondering, is if it is possible to take a screenshot of the content on the remote site, in the same 160x160 window, and present that screenshot to the user in the preview, instead of an iframe pointing directly at the live, remote (raw) content? I know I can fetch the content with LWP::UserAgent, using the following (working) snippet:

sub fetch_url { my $url_uri = shift; my $ua = LWP::UserAgent->new; $ua->agent('Mozilla/3.0'); my $url = "$url_uri"; my $request = HTTP::Request->new(GET => $url); my $response = $ua->request($request); # Content itself is in $response->content; here. return $response->content; }

Using an image, however, presents us with a few good advantages:

I can also use LWP::Simple's getstore() function or fetch the content and store it in a local file directly using the snippet above, such as:

open(CONTENT, ">content.$$") or die $! print CONTENT $response->content; close CONTENT;
I'd rather stay away from that for a few reasons:

So the question is, does there exist a way to do this programatically?

Is it possible to print the content stored in $response->content; above and create an image out of the contents?

Can I jam the content into some sort of postscript filter, and then print that to an image?

Is there some merlyn column on exactly this topic somewhere on stonehenge?

Has anyone done anything remotely similar to this?

Replies are listed 'Best First'.
Re: Automated Screenshot Grabbing via LWP?
by PodMaster (Abbot) on Apr 07, 2003 at 10:26 UTC
      These links offer interesting solutions, but fail for me in a few ways:

      • HTML::FormatPS assumes I have a local file, which I won't. Suffers from the same problems as getstore(), because I won't have a reference to the images linked in the file (nor will they show up in the output)

      • html2ps was remarkably close, but for normal-sized text on a very long scrolling page, such as The UBC Psychology Department website for handhelds, the .ps looks good, but the png version created from convert is completely illegible; the text is turned into a "city skyline".

      • I need to pass a very specific UserAgent string to the sites I'm "snapshotting" in some cases, as they restrict who can view the content by UserAgent.

      • I need to specify the viewport size to be exactly 160x160 pixels, or 320x240 pixels, depending on the user's preferences for preview size (determined by their chosen Palm type, standard or high-resolution). Most of these tools seem to grab the site in it's "natural" capacity, which is in A4 or "page" size.

      I'll keep looking, I'm sure there's something I can use here, thanks for the great ideas thus far.

        So don't use getstore. Be a little more creative HTML::LinkExtor/HTML::LinkExtractor, w3mir.

        As for convert not doing a good job, you give up too easy (plus i'm sure there are other ways of manipulating postscript files).


        MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
        I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
        ** The Third rule of perl club is a statement of fact: pod is sexy.

Re: Automated Screenshot Grabbing via LWP?
by zby (Vicar) on Apr 07, 2003 at 08:24 UTC
    If I understand you well what you want to do is not grabbing the image but - rendering the grabbed page to an image. As far as I know you can render a HTML page only by a web browser. It should be possible to automate mozilla and a screen grabber for the task - but of course this would use very much computing power.
Re: Automated Screenshot Grabbing via LWP?
by simon.proctor (Vicar) on Apr 07, 2003 at 09:20 UTC
    The only way I have done this is to use IE and print the page to a file (ie .prn). I then ran this through ghostscript to get a viewable page.

    I haven't spent anytime on this but plan to automate it at some point. In fact, I think this was discussed recently so super searching may help. An alternative is to automate loading urls in a browser and doing a printscreen but that sounds worse :)

    As far as I know, using a browser to create the screenshot is the only way you can do it.
Re: Automated Screenshot Grabbing via LWP? (rendering, not just fetching)
by Aristotle (Chancellor) on Apr 07, 2003 at 08:45 UTC
    Yep, what zby said - you'll have to have the page rendered. Mozilla would be a hog though. Dillo or ELinks might be more viable alternatives, although of course they won't render complex layouts nearly as faithfully.

    Makeshifts last the longest.

Re: Automated Screenshot Grabbing via LWP?
by benn (Vicar) on Apr 07, 2003 at 09:08 UTC
    Not being a windowsy perlie, I couldn't tell you the exact route, but maybe this would be possible using *gulp* OLE automation of the various MS HTML-rendering objects - MSIE / Word etc. - then printing to a file.
    Cheers,
    ben