Better way?

jen has asked for the wisdom of the Perl Monks concerning the following question:

I've got a project right now that involves parsing web pages and looking for specific pieces of data - for example, getting a page back from FedEx and looking for the ship date, delivery date, and weight of a package. The carriers can change the format at any time, which means the regular expressions I'm now using are likely to break, and often. But I can't think of a better way (which perhaps reveals the extent of my Perl knowledge). Is there a better way? It's not even possible to parse them on HTML tags, say, by looking for table tag groups, because there's all kinds of crazy HTML formatting "junk" data in between. Any ideas welcome, thanks!

Comment on Better way?

Replies are listed 'Best First'.
Re: Better way? by chromatic (Archbishop) on Jun 16, 2000 at 22:47 UTC
Sounds like you want an HTML Parser. Try HTML::Parser or something similar on CPAN.	[reply]
RE: Re: Better way? by jen (Novice) on Jun 17, 2000 at 01:17 UTC
I did, and, as far as I can tell, it's not helpful, because the HTML tags themselves are almost never meaningful in the pages we get back. For example, it's all well and good to be able to pick out the data between table tags, but then I still have to sort through the table data. (I think the problem is that, in my case, it's the data and not the HTML tags that are significant - HTML::Parser is good for cases where the tags are the significant piece. If someone has used HTML::Parser in a similar way, please let me know.)	[reply]
RE: RE: Re: Better way? by merlyn (Sage) on Jun 17, 2000 at 02:13 UTC
Welcome to the reason that XML will eventually replace HTML (and is happening already). -- Randal L. Schwartz, Perl hacker	[reply]
RE: RE: RE: Re: Better way? by Anonymous Monk on Jun 18, 2000 at 00:36 UTC
Re: Better way? by visnu (Sexton) on Jun 17, 2000 at 02:32 UTC
if you have the money (although none may be required), i'm sure fedex has a supported (and documented) method of doing that sort of thing, without anyone needing to go and pilfer the same info off of their web page. heck, they may even have a server setup somewhere with a specified protocol you can use to query about orders... (???)	[reply]
RE: Better way? by Q*bert (Sexton) on Jun 17, 2000 at 10:42 UTC
Not much more to say. Try to generalize the parser as much as possible (by matching as little as possible). I think chromatic's suggestion of using an HTML parser, rather than dealing with the raw HTML directly, might make your code easier to change later. Also, set up some kind of monitoring so the code tells you when parsing breaks. Good luck! Sorry we couldn't offer you more help.	[reply]