Unique spidering need

cleverett has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Unique spidering need by Roger (Parson) on Jan 23, 2004 at 01:50 UTC
You could use WWW::Mechanize and read a couple of merlyn's articles at his stonehenge website. If you give us more info, eg., link to a sample website to parse, the pattern you are looking for on the webpage, the output format, I can do a little demo script for you later today. ;-) And yes, I am quite sure you can do this in a single Perl script, and it's free (but of course :-).	[reply]
Re: Re: Unique spidering need by cleverett (Friar) on Jan 23, 2004 at 05:28 UTC
Well, the WW::Mechanize part I can handle. It would be a IFRAME or A tag with a given URL as the SRC or HREF attribute. And basically, I need to know if it appears in the top left 800x600 pixels in the browser window. but having the X/Y coords would be better.	[reply]
Re: Re: Re: Unique spidering need by Roger (Parson) on Jan 23, 2004 at 05:45 UTC
I would probably use the *libgtkhtml* library to do the HTML rendering part. I haven't got a linux box handy to verify this at the moment, but I am sure that you would be able to get the level of control you seek with the library.	[reply]
Re**4: Unique spidering need by cleverett (Friar) on Jan 23, 2004 at 06:10 UTC
Re: Unique spidering need by ViceRaid (Chaplain) on Jan 23, 2004 at 02:07 UTC
WWW::Mechanise, although good, won't do want you want, I think - it doesn't deal with visual presentation of a page. I don't know whether XPCOM would; I guess what you want is some way to interface to the layout engine. As a halfway house, and if you're on windows, you might try something like Win32::OLE to automate a instance of Internet Explorer. `use Win32::OLE; my $ie = Win32::OLE->new('InternetExplorer.Application') or die $@; $ie->{'visible'} = 1; $ie->navigate( "http://search.cpan.org/" ); # I think you need some kind of waiting loop here.... # You can access the DOM $ie->{'document'}->{'links'}->{'length'}; # JS : document.links.length` [download] I picked up this from seeing it done in a similar way in Ruby; hopefully you could get hold of any of the document.height properties you wanted from IE. OK, you'll still have to do a bit of Javascript, but one way or another you're going to have to talk to a rendering engine, and it'll probably be more merciful going if it's in Perl and Javascript than Gecko bindings. But horses for courses.. cheers ViceRaid	[reply] [d/l]
Re: Unique spidering need by CountZero (Bishop) on Jan 23, 2004 at 07:12 UTC
I do not know if you can do this, with Perl or otherwise, but it looks to me that it goes fully against the grain of what HTML ought to be. I always thought that the HTML has nothing to do with the actual rendering of the page and that on widely different types of output devices, the page renders indeed widely different. Perhaps someone is looking at it with a text-only browser and someone else has a high definition graphical workstation with super-large screen. If you look through a windowed application, you can resize your window and the rendering-engine should recalculate how it shows the page in your window. Try it with this page: resize the window and see at which column the nodelets start. I can resize my window over a rather large range before I get horizontal scroll-bars. What the effects can be of the application of individual CSS-files or font-sizes or ..., makes your task more of a guessing game. So the best you can hope for is to know what the X/Y coordinates are on a particular system with a particular window size. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re: Re: Unique spidering need by cleverett (Friar) on Jan 23, 2004 at 07:33 UTC
Thanks for pointing out my unspoken assumptions. But I think this is a bit picky. I just want to know what the user sees. So the best you can hope for is to know what the X/Y coordinates are on a particular system with a particular window size. Mmmph. Pretty much I haven't noticed gross differences 'twixt IE and Moz in terms of where things go. What the effects can be of the application of individual CSS-files or font-sizes or ..., makes your task more of a guessing game. So the best you can hope for is to know what the X/Y coordinates are on a particular system with a particular window size That's why I wanted a real layout engine to work with ... Still looking at Gtk::HTML.	[reply]
•Re: Re: Re: Unique spidering need by merlyn (Sage) on Jan 23, 2004 at 13:28 UTC
I just want to know what the user sees. What browser? What platform? What version? The user might not "see" anything at all on a talking browser for the blind. The user might be viewing the web page on their advanced cell phone, which collapses things it recognizes as navbars into simple menus. Your question makes no sense in the context of the world wide web. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]
Re: •Re: Re: Re: Unique spidering need by cleverett (Friar) on Jan 24, 2004 at 03:05 UTC
Re: Re: Re: Unique spidering need by herveus (Prior) on Jan 23, 2004 at 13:19 UTC
Howdy! I just want to know what the user sees. What assumptions are you able to reliably make about the browser settings of the user in question? If you are looking at a corporate intranet with every last detail locked down hard, you can probably make fairly detailed assumptions (although the size of the browser window is still hard to control). If all you need to know is "does this element render in the top/left 800x600 pixels?" you should be able to do a dummy rendering in a larger virtual window and see where stuff falls. Beyond that, you are trying to nail jello to a wall. yours, Michael	[reply]