Is there a way to get data from a queried web site without having to parse the resulting HTML?

devgoddess has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

This is my first post. I hope it's not TOO awful. :)

I'm currently trying to finish a rather nasty script I've been working on for a while. I'm down to the last section of it, which entails sending search data to a web site and then writing the data retrieved by the web site to a flat file.

I'm pretty familiar with the LWP library at this point. I know how to use UserAgent and Request objects to grab raw HTML using a URL, and I can parse the HTML as well. Thing is, I don't want to have to do that.

The site I'm querying uses forms for input, so I could easily send the data I want to query on (ex. widget 2345A) using a POST method in my Request. However, I've looked at the document source for this site's results page, and it's pretty hideous. I'm assuming that by using a Request object, I'm only going to get the automatically generated result page in HTML in the Response.

Is there any way I can submit the request more directly without having to bother with simulating a form submission via a Request by a UserAgent? Is there any way I can just have the data returned back to me sans HTML document? (ex. $description,$price,$size) If LWP is NOT the correct library, what should I use?

I really have no idea what's running on the site's back end... Sorry! :(

Thanks to anybody who can help.

-------------------------------------------------
Dev Goddess
Developer / Analyst / Criminal Mastermind

"Size doesn't matter. It's all about speed and performance."

Comment on Is there a way to get data from a queried web site without having to parse the resulting HTML?

Replies are listed 'Best First'.

Re: Is there a way to get data from a queried web site without having to parse the resulting HTML?
by PodMaster (Abbot) on Apr 28, 2003 at 01:04 UTC

WWW::Mechanize

Screen-scraping with WWW::Mechanize

MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
** The Third rule of perl club is a statement of fact: pod is sexy.

[reply]

Re: Is there a way to get data from a queried web site without having to parse the resulting HTML?
by The Mad Hatter (Priest) on Apr 28, 2003 at 01:00 UTC

Is there any way I can just have the data returned back to me sans HTML document?

[reply]

Re: Is there a way to get data from a queried web site without having to parse the resulting HTML?
by Aristotle (Chancellor) on Apr 28, 2003 at 01:49 UTC

As an alternative to screenscraping, you could write a quick logging HTTP proxy in Perl and set that in your browser, then use the page as you normally would to be able to take a look at the requests generated. Duplicating them programmatically by following the log output should then be trivial. Ugh, completely misread your question.

No, that's unfortunately not possible unless the site's backend provides for such a facility. (The code running PerlMonks can be told to return XML for many things, f.ex.) If you're dealing with tables, you might want to take a gander at HTML::TableExtract. It has served me well in dealing with pages too ugly to manually dissect.

Makeshifts last the longest.

[reply]