cleverett has asked for the wisdom of the Perl Monks concerning the following question:

I have a need to spider websites to look for certain HTML elements and determine their X/Y coordinates when rendered in a browser.

I can think of two approaches:

1. Use a client workstation and a custom web application to download the page, use javascript to find the HTML using DOM and then report the coordinates back and get the next page.

Pros: I can do this without a steep learning curve.
Cons: Ties up a workstation doing a batch job. Kludgey. Javascript.

2. I think Mozilla XPCOM can do this, and there even exists a Perl interface to XPCOM.

Pros: single program, single computer solution.
Cons: Don't know for sure that XPCOM can actually do this, I have a question in over there to make sure. A wicked learning curve if it can.

My question is, have I missed an alternative approach?

Replies are listed 'Best First'.
Re: Unique spidering need
by Roger (Parson) on Jan 23, 2004 at 01:50 UTC
    You could use WWW::Mechanize and read a couple of merlyn's articles at his stonehenge website.

    If you give us more info, eg., link to a sample website to parse, the pattern you are looking for on the webpage, the output format, I can do a little demo script for you later today. ;-)

    And yes, I am quite sure you can do this in a single Perl script, and it's free (but of course :-).

      Well, the WW::Mechanize part I can handle. It would be a IFRAME or A tag with a given URL as the SRC or HREF attribute. And basically, I need to know if it appears in the top left 800x600 pixels in the browser window. but having the X/Y coords would be better.
        I would probably use the libgtkhtml library to do the HTML rendering part. I haven't got a linux box handy to verify this at the moment, but I am sure that you would be able to get the level of control you seek with the library.

Re: Unique spidering need
by ViceRaid (Chaplain) on Jan 23, 2004 at 02:07 UTC

    WWW::Mechanise, although good, won't do want you want, I think - it doesn't deal with visual presentation of a page. I don't know whether XPCOM would; I guess what you want is some way to interface to the layout engine.

    As a halfway house, and if you're on windows, you might try something like Win32::OLE to automate a instance of Internet Explorer.

    use Win32::OLE; my $ie = Win32::OLE->new('InternetExplorer.Application') or die $@; $ie->{'visible'} = 1; $ie->navigate( "http://search.cpan.org/" ); # I think you need some kind of waiting loop here.... # You can access the DOM $ie->{'document'}->{'links'}->{'length'}; # JS : document.links.length

    I picked up this from seeing it done in a similar way in Ruby; hopefully you could get hold of any of the document.height properties you wanted from IE. OK, you'll still have to do a bit of Javascript, but one way or another you're going to have to talk to a rendering engine, and it'll probably be more merciful going if it's in Perl and Javascript than Gecko bindings. But horses for courses..

    cheers
    ViceRaid

Re: Unique spidering need
by CountZero (Bishop) on Jan 23, 2004 at 07:12 UTC
    I do not know if you can do this, with Perl or otherwise, but it looks to me that it goes fully against the grain of what HTML ought to be.

    I always thought that the HTML has nothing to do with the actual rendering of the page and that on widely different types of output devices, the page renders indeed widely different. Perhaps someone is looking at it with a text-only browser and someone else has a high definition graphical workstation with super-large screen. If you look through a windowed application, you can resize your window and the rendering-engine should recalculate how it shows the page in your window.

    Try it with this page: resize the window and see at which column the nodelets start. I can resize my window over a rather large range before I get horizontal scroll-bars.

    What the effects can be of the application of individual CSS-files or font-sizes or ..., makes your task more of a guessing game.

    So the best you can hope for is to know what the X/Y coordinates are on a particular system with a particular window size.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      Thanks for pointing out my unspoken assumptions. But I think this is a bit picky. I just want to know what the user sees.

      So the best you can hope for is to know what the X/Y coordinates are on a particular system with a particular window size.

      Mmmph. Pretty much I haven't noticed gross differences 'twixt IE and Moz in terms of where things go.

      What the effects can be of the application of individual CSS-files or font-sizes or ..., makes your task more of a guessing game.

      So the best you can hope for is to know what the X/Y coordinates are on a particular system with a particular window size

      That's why I wanted a real layout engine to work with ...

      Still looking at Gtk::HTML.

        I just want to know what the user sees.
        What browser? What platform? What version?

        The user might not "see" anything at all on a talking browser for the blind.

        The user might be viewing the web page on their advanced cell phone, which collapses things it recognizes as navbars into simple menus.

        Your question makes no sense in the context of the world wide web.

        -- Randal L. Schwartz, Perl hacker
        Be sure to read my standard disclaimer if this is a reply.

        Howdy!

        I just want to know what the user sees.

        What assumptions are you able to reliably make about the browser settings of the user in question? If you are looking at a corporate intranet with every last detail locked down hard, you can probably make fairly detailed assumptions (although the size of the browser window is still hard to control).

        If all you need to know is "does this element render in the top/left 800x600 pixels?" you should be able to do a dummy rendering in a larger virtual window and see where stuff falls. Beyond that, you are trying to nail jello to a wall.

        yours,
        Michael