ghettofinger has asked for the wisdom of the Perl Monks concerning the following question:

Wise monks,

I would like to extract the first paragraph from a series of web pages. Normally, I would use LWP and a regex and find a pattern of tags around the paragraph and just extract it. The problem is that with the current web pages I want to extract from, the tags are different all of the time.

Is there a way that I can say, extract the first grouping of words that has more than 7 plain "words" next to each other and stop the match at a newline? Is there a better way to go about extraction without relying on a regular expression?

I appreciate your help.

Many thanks,
ghettofinger

Replies are listed 'Best First'.
Re: Extracting paragraphs from html
by merlyn (Sage) on Sep 11, 2005 at 16:50 UTC
Re: Extracting paragraphs from html
by sk (Curate) on Sep 11, 2005 at 16:49 UTC
    As you noticed parsing HTML gets messy/tricky with regex when the tags change all the time.

    You might want to look at HTML::TokeParser::Simple

    -SK

Re: Extracting paragraphs from html
by fraktalisman (Hermit) on Sep 11, 2005 at 16:55 UTC

    If you can't rely on certain tags (and I agree that you can't), the question is, what is the definition of a paragraph?

    Where does it stop? Certainly not at a newline, for we are dealing with HTML, and there might be many newlines in the source code where they are not visible in the page that is actually displayed.
    So what would possibly terminate a paragraph?

    • A closing tag of a block element, like </div> </p> etc.
    • More than one break, i.e. <br> <br> without words or images between them
    • The start of another paragraph or block element, like <div> <p> <iframe> <hr> etc.
    • An image <img>
    • The end of the page or document

    And for a pragmatic approach, you might want to specify a maximum length at which the given text is truncated. There are people who don't use paragraphs at all, they just type or copy hundreds and thousands of words on a page, like they were writing a novel or like they haven't understood the necessity of formatting at all.