Wise monks,
I would like to extract the first paragraph from a series of web pages. Normally, I would use LWP and a regex and find a pattern of tags around the paragraph and just extract it. The problem is that with the current web pages I want to extract from, the tags are different all of the time.
Is there a way that I can say, extract the first grouping of words that has more than 7 plain "words" next to each other and stop the match at a newline? Is there a better way to go about extraction without relying on a regular expression?
I appreciate your help. Many thanks,
ghettofinger
In reply to Extracting paragraphs from html by ghettofinger
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |