Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Re: RFC: HTML::ListScraper

by rinceWind (Monsignor)
on Apr 25, 2007 at 08:22 UTC ( #611937=note: print w/replies, xml ) Need Help??

in reply to RFC: HTML::ListScraper

Release early and often is a good approach with CPAN modules. I notice that CPAN testers is showing one failing test (though cpantesters is not yet displaying the results). If I were you, I'd make fixing the failing test a priority for 0.02.

My first thoughts on the module documentation are that it's not clear when you would want to use it, and what the advantages are over HTML::TokeParser or HTML::TreeBuilder. If this is spelled out loud and clear in the description section, more people will be inclined to install and use your module.

What would be really good is a worked example. Use some real website that's out there, maybe one you are hosting yourself. Together with a tutorial pod file, this would go a long way to promoting use of your module.

wetware hacker
(Qualified NLP Practitioner and Hypnotherapist)

Replies are listed 'Best First'.
Re^2: RFC: HTML::ListScraper
by vbar (Novice) on May 27, 2007 at 20:22 UTC
    Making glacial progress... I've fixed the failing tests, and as for when to use HTML::ListScraper, the principal use case is parsing search engine results. But documenting a worked-out example would IMHO be misleading - the module just doesn't work well enough for lots of people to start using it right now...

    HTML::ListScraper is different from HTML::TokeParser and HTML::TreeBuilder in that it doesn't return the same information (for the same input document); it drops the "irregular" parts, leaving something smaller and hopefully easier to interpret - except that as it stands, it drops rather too much...

    Recently I've been reminded that biologists have an interest in sequence matching, and some interesting algorithms I could try, but they don't seem implemented as CPAN modules, so the next step looks like implementing that before trying to incorporate some form of sequence alignment into HTML::ListScraper (a bit like Algorithm::AhoCorasick, which turned out to be completely unnecessary :-) ). And obviously the algorithms will have variations and alternatives I've no idea about - any bioinformatics specialists around here?

      Thanks for the module. I was looking for something similar for a while. The name did not clearly tell me what the module was doing. I installed HTML::ListScraper. The document talks about the example script scrape. This does not get installed with cpan install. I have to go back to the distribution to get the scrape script. This is just a small inconvenience. When I tried it on my example HTML file, I found that the approximation is splitting into finer blocks. I could not figure out a way to tune this parameter. Also, I would have liked to try approximation if the exact repetition (something like a suffix tree + largest repeating string combination) fails. Thanks once again. -Sreenivasa

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://611937]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (6)
As of 2022-01-21 17:54 GMT
Find Nodes?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:

    Results (59 votes). Check out past polls.