ghettofinger has asked for the wisdom of the Perl Monks concerning the following question:
Wise monks,
I would like to extract the first paragraph from a series of web pages. Normally, I would use LWP and a regex and find a pattern of tags around the paragraph and just extract it. The problem is that with the current web pages I want to extract from, the tags are different all of the time.
Is there a way that I can say, extract the first grouping of words that has more than 7 plain "words" next to each other and stop the match at a newline? Is there a better way to go about extraction without relying on a regular expression?
I appreciate your help. Many thanks,
ghettofinger
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Extracting paragraphs from html
by merlyn (Sage) on Sep 11, 2005 at 16:50 UTC | |
Re: Extracting paragraphs from html
by sk (Curate) on Sep 11, 2005 at 16:49 UTC | |
Re: Extracting paragraphs from html
by fraktalisman (Hermit) on Sep 11, 2005 at 16:55 UTC |