It reminds me of a little project I have done for personal reasons, and that is parsing online TV schedules, and turning them into a single, simple format for all. I'll describe the basic skeleton here, leaving the deails for you to fill in.

Basically, for each style of website with programme listings, I have a different class (package). I placed them under the "Channel" hierarchy, so the methods to parse the BBC schedule listings, for example, are in the package Channel::BBC. Each "style" of HTML pages has its own package.

And each of these Channel::* packages, all have a few class methods: one is "parse" for extracting the schedule; another one "date", to be able to check the date on the page... You get the idea. The important part is that the API is identical across all these packages.

Now, it's possible to do a generic call to parsing a page from the BBC, like this:

my $class = 'BBC'; my @programme = "Channel::$class"->parse($html);
As you can see: there's one statement to do the parsing, and it can be the exact same statement, class being a variable (parameter), irrespective of which style of page it is. Actually, the parameters (including the $class) are passed in a big Array Of Hashes, and one loop just fetches, checks, and parses all the different HTML pages, and eventually builds a new HTML file for each from the result.

Do note that the above snippet works under strict, even tough actually it is a symbolic reference.

Now, how can you parse a HTML file? The way I'd do it now, is using HTML::TokeParser::Simple. Just look out for a specific tag, e.g. "form", or "table", then maybe one more, etc... and then finally grab the data you need. Don't worry about other styles of pages, you just have to be able to process <em<this style of page.

Do remember that the first parameter to the methods, like "parse", will be the package name, so don't forget to drop it.


In reply to Re: Regex/Pattern Matching by bart
in thread Regex/Pattern Matching by shu

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.