in reply to Regex/Pattern Matching
Basically, for each style of website with programme listings, I have a different class (package). I placed them under the "Channel" hierarchy, so the methods to parse the BBC schedule listings, for example, are in the package Channel::BBC. Each "style" of HTML pages has its own package.
And each of these Channel::* packages, all have a few class methods: one is "parse" for extracting the schedule; another one "date", to be able to check the date on the page... You get the idea. The important part is that the API is identical across all these packages.
Now, it's possible to do a generic call to parsing a page from the BBC, like this:
As you can see: there's one statement to do the parsing, and it can be the exact same statement, class being a variable (parameter), irrespective of which style of page it is. Actually, the parameters (including the $class) are passed in a big Array Of Hashes, and one loop just fetches, checks, and parses all the different HTML pages, and eventually builds a new HTML file for each from the result.my $class = 'BBC'; my @programme = "Channel::$class"->parse($html);
Do note that the above snippet works under strict, even tough actually it is a symbolic reference.
Now, how can you parse a HTML file? The way I'd do it now, is using HTML::TokeParser::Simple. Just look out for a specific tag, e.g. "form", or "table", then maybe one more, etc... and then finally grab the data you need. Don't worry about other styles of pages, you just have to be able to process <em<this style of page.
Do remember that the first parameter to the methods, like "parse", will be the package name, so don't forget to drop it.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: Regex/Pattern Matching
by shu (Initiate) on Jan 09, 2004 at 02:42 UTC |