downer has asked for the wisdom of the Perl Monks concerning the following question:

Todays task is seemingly simple, but has proven challenging for my woeful software skills. I will be processing arbitrary web pages, obtained by our crawler. I will be processing the textual content of these pages, if any. In no particular order, I need to remove any HTML tags as well as any garbage, separate any text into paragraphs, and for each of these paragraphs, separate into individual sentences which are processed individually.

By paragraph, I refer to the natural use of the word, and, since web pages are often poorly formed, it may be useful to employ some heuristics such as limiting to 20 sentences or something if no natural breaks are found near by.

Sentences are of course the groups of words separated by the usual punctuation, and maybe a similar heuristic can be employed, limiting the number of words to say, 100? I have tried writing some code to handle his, but it seems to perform poorly. I can use HTML::Strip to remove the tags, and Text::Sentence to find the sentences reasonably well, but have no clue how to grab paragraphs. I realize that I could probably use the html tags to aid this task, but I have been unsuccessful in doing so. I really appreciate any help!

Replies are listed 'Best First'.
Re: Seperating HTML by paragraph, sentence
by blue_cowdawg (Monsignor) on Oct 03, 2007 at 14:17 UTC
        I realize that I could probably use the html tags to aid this task, but I have been unsuccessful in doing so. I really appreciate any help!

    With that realization in mind, take a look at HTML::TokeParser instead of HTML::Strip.


    Peter L. Berghold -- Unix Professional
    Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
Re: Seperating HTML by paragraph, sentence
by mr_mischief (Monsignor) on Oct 03, 2007 at 14:55 UTC
    Your idea to establish a heuristic for what makes a paragraph in badly formatted HTML is interesting. From the top of my head, I can think of several ways a paragraph could be denoted.
    • text after an opening paragraph tag, of course, up to a possible closing tag
    • text after an ending paragraph tag that's not marked as anything special itself
    • two or more line break tags successively
    • a div
    • a list, either ordered or unordered
    • a table row, or sometimes an individual cell
    Of course, what makes this a special kind of torture for you is that it's not easy to tell if a table row should be kept together and that most of these elements can be contained within one another. You almost need a parse tree of the document plus a heuristic in order to preserve the author's intended paragraphs.

    Your life gets easier if you can treat a table as an object as a whole.

      it seems that maybe this is more of an algorithms problem than a perl problem.
        Well, if you make it a module and put it on CPAN, we can make it Perl wisdom to point said module out to others who ask. Otherwise, just realizing that your approach has as much to do with the plan as with the tools is wisdom in itself.

        BTW, it's probably possible to write an ad hoc text extractor using heuristic rules and regular expressions to get close to what you want without building a proper tree. I'm not sure without trying if HTML::TreeBuilder or such would be really necessary, but my gut feeling is that it could help quite a bit.

Re: Seperating HTML by paragraph, sentence
by Cody Pendant (Prior) on Oct 04, 2007 at 05:42 UTC
    One obvious approach to parsing the HTML for what you're calling "paragraphs" is based on the two main kinds of HTML tag.

    There are block tags, which create a linebreak above and below, and inline tags which flow with the text.

    The DTD for HTML will tell you which is which: http://www.w3.org/TR/html401/sgml/dtd.html If you feel like working through it.

    Complications of course arise because there can be blocks inside blocks, and because CSS is allowed to re-define the block/inline setting to suit the author.



    Nobody says perl looks like line-noise any more
    kids today don't know what line-noise IS ...