in reply to Framework for News Articles
With WWW::Search, sites are updated or added by writing a subclass module, which is then, ideally, distributed through CPAN. This seems to me like too much friction to keep up with changes in the particulars of individual sites in an efective way.
I am not sure I have a better idea. But one approach I have thought of is to have a base class/module that reads in site-specific data through simple XML files. The XML would contain meta-data about the site incuding, at heart, one or more Perl5 regexes. Another key piece of information in the XML would be one or more URLs for updating the file when it seems to be out of date.
The advantage to this scheme is that, since there is support for Perl5 regexes outside of Perl5, the XML files could be used in other applications, for example a Windows-based aggregator. Also, the update URL(s) allow for more rapid correction when the site information changes.
Finally, because site descriptor files could be created with Perl5 regular expressions and a few pieces of information about the site, there is potentially a wider audience of authors than on CPAN. (Especially if someone created a Web service that made creating or updating a site descriptor file as easy as filling out a Web form.)
The disadvantage to such a scheme, of course, is that it relies heavily on regexes to extract data, which can be less efficient, less reliable and less powerful than proper parsing.
|
|---|