Extracting paragraphs from html

ghettofinger has asked for the wisdom of the Perl Monks concerning the following question:

Wise monks,

I would like to extract the first paragraph from a series of web pages. Normally, I would use LWP and a regex and find a pattern of tags around the paragraph and just extract it. The problem is that with the current web pages I want to extract from, the tags are different all of the time.

Is there a way that I can say, extract the first grouping of words that has more than 7 plain "words" next to each other and stop the match at a newline? Is there a better way to go about extraction without relying on a regular expression?

I appreciate your help.

Many thanks,
ghettofinger

Comment on Extracting paragraphs from html

Replies are listed 'Best First'.
Re: Extracting paragraphs from html by merlyn (Sage) on Sep 11, 2005 at 16:50 UTC
Use XML::LibXML in HTML-parsing mode, then use an XPath that looks for text() nodes that have a length greater than N. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply. update: See Locate large HTML paragraphs with XML::LibXML.	[reply]
Re: Extracting paragraphs from html by sk (Curate) on Sep 11, 2005 at 16:49 UTC
As you noticed parsing HTML gets messy/tricky with regex when the tags change all the time. You might want to look at HTML::TokeParser::Simple -SK	[reply]
Re: Extracting paragraphs from html by fraktalisman (Hermit) on Sep 11, 2005 at 16:55 UTC
If you can't rely on certain tags (and I agree that you can't), the question is, what is the definition of a paragraph? Where does it stop? Certainly not at a newline, for we are dealing with HTML, and there might be many newlines in the source code where they are not visible in the page that is actually displayed. So what would possibly terminate a paragraph? A closing tag of a block element, like </div> </p> etc. More than one break, i.e. <br> <br> without words or images between them The start of another paragraph or block element, like <div> <p> <iframe> <hr> etc. An image <img> The end of the page or document And for a pragmatic approach, you might want to specify a maximum length at which the given text is truncated. There are people who don't use paragraphs at all, they just type or copy hundreds and thousands of words on a page, like they were writing a novel or like they haven't understood the necessity of formatting at all. _{fraktalisman keeps rolling}	[reply]