jen has asked for the wisdom of the Perl Monks concerning the following question:

I've got a project right now that involves parsing web pages and looking for specific pieces of data - for example, getting a page back from FedEx and looking for the ship date, delivery date, and weight of a package. The carriers can change the format at any time, which means the regular expressions I'm now using are likely to break, and often. But I can't think of a better way (which perhaps reveals the extent of my Perl knowledge). Is there a better way? It's not even possible to parse them on HTML tags, say, by looking for table tag groups, because there's all kinds of crazy HTML formatting "junk" data in between. Any ideas welcome, thanks!

Replies are listed 'Best First'.
Re: Better way?
by chromatic (Archbishop) on Jun 16, 2000 at 22:47 UTC
    Sounds like you want an HTML Parser. Try HTML::Parser or something similar on CPAN.
      I did, and, as far as I can tell, it's not helpful, because the HTML tags themselves are almost never meaningful in the pages we get back. For example, it's all well and good to be able to pick out the data between table tags, but then I still have to sort through the table data.

      (I think the problem is that, in my case, it's the data and not the HTML tags that are significant - HTML::Parser is good for cases where the tags are the significant piece. If someone has used HTML::Parser in a similar way, please let me know.)
Re: Better way?
by visnu (Sexton) on Jun 17, 2000 at 02:32 UTC
    if you have the money (although none may be required), i'm sure fedex has a supported (and documented) method of doing that sort of thing, without anyone needing to go and pilfer the same info off of their web page. heck, they may even have a server setup somewhere with a specified protocol you can use to query about orders... (???)
RE: Better way?
by Q*bert (Sexton) on Jun 17, 2000 at 10:42 UTC
    Not much more to say. Try to generalize the parser as much as possible (by matching as little as possible). I think chromatic's suggestion of using an HTML parser, rather than dealing with the raw HTML directly, might make your code easier to change later. Also, set up some kind of monitoring so the code tells you when parsing breaks.

    Good luck! Sorry we couldn't offer you more help.