in reply to Regex/Pattern Matching

It reminds me of a little project I have done for personal reasons, and that is parsing online TV schedules, and turning them into a single, simple format for all. I'll describe the basic skeleton here, leaving the deails for you to fill in.

Basically, for each style of website with programme listings, I have a different class (package). I placed them under the "Channel" hierarchy, so the methods to parse the BBC schedule listings, for example, are in the package Channel::BBC. Each "style" of HTML pages has its own package.

And each of these Channel::* packages, all have a few class methods: one is "parse" for extracting the schedule; another one "date", to be able to check the date on the page... You get the idea. The important part is that the API is identical across all these packages.

Now, it's possible to do a generic call to parsing a page from the BBC, like this:

my $class = 'BBC'; my @programme = "Channel::$class"->parse($html);
As you can see: there's one statement to do the parsing, and it can be the exact same statement, class being a variable (parameter), irrespective of which style of page it is. Actually, the parameters (including the $class) are passed in a big Array Of Hashes, and one loop just fetches, checks, and parses all the different HTML pages, and eventually builds a new HTML file for each from the result.

Do note that the above snippet works under strict, even tough actually it is a symbolic reference.

Now, how can you parse a HTML file? The way I'd do it now, is using HTML::TokeParser::Simple. Just look out for a specific tag, e.g. "form", or "table", then maybe one more, etc... and then finally grab the data you need. Don't worry about other styles of pages, you just have to be able to process <em<this style of page.

Do remember that the first parameter to the methods, like "parse", will be the package name, so don't forget to drop it.

Replies are listed 'Best First'.
Re: Re: Regex/Pattern Matching
by shu (Initiate) on Jan 09, 2004 at 02:42 UTC
    Hi Bart Thanx for your suggestions. Im very very new to perl so Im not even sure how to sue the modules properly leave alone create my own classes. Anyway ill see what I can do tho i dint quite understand the concept of the BBc class. Hmm as for grabbing the data, I did use HTML::TokeParser::Simple, LWP::UserAgent and HTML::Parser but once I get the HTML into an external file on my hardisk i don't exactly know how to search for the headings and only grab data under a particular heading. The data is like HEADING 1 <data> <data> PUBLICATION <data> <data> <data> RESEARCH <data> .. . . so on and I need to grab the data under PUBLICATION. Now the heading may differ from page to page but i was trying to match 'pub' but thr r other headings also tht mite have those words! It would be great if you could help me with a code snippet. Thnx a lot Shuchi