I don't know Newick, but you could have the user app spit out a Perl source file with a subroutine in it. That subroutine, given the story page content, would return the title (or undef if it can't find it.) Then your scraper would pull in all of the files (with require), call each subroutine, and use the first (or best) result.

For actually finding the title, I suggest going a step farther on the assumption that "sites are typically generated from a database." Instead of looking at the HTML structure for a pattern, use the raw text. Once the user selects the node that contains the title, capture some number of characters of context before and after the title. As with the HTML pattern you described, you might refine how much context you keep based upon it uniqueness.

Once you have the context, you can spit out the new subroutine as little more than a regexp match. Or go one step further and use String::Approx to do approximate matching.

To reduce false positives, you could look for a signature on a web site that identifies it as coming from a particular sourt. For instance, a copyright notice won't often change. When you create the rule, you also include a check for the copyright, and return undef immediate if it's not there.

Another way to improve accuracy is a feedback loop with the users. Give each subroutine a weight. If more than one subroutine gives you back an answer, use the one with the highest weight. However, also include links on the jump page (I assume you have a web version of the RDF feed) like "Should this have been titled 'Such-and-such'?" When clicked, it increases the weight of the subroutine that gave the right answer and decreases the wrong one. (But beware malicious users.)

One more: Have each subroutine return a confidence value (perhaps the Levenshtein edit distance of the context, inverted). Then use the one with the highest confidence.


In reply to Re: Extracting arbitrary data from HTML by TilRMan
in thread Extracting arbitrary data from HTML by vbfg

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.