Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Extracting paragraphs from html

by ghettofinger (Monk)
on Sep 11, 2005 at 16:38 UTC ( [id://491069]=perlquestion: print w/replies, xml ) Need Help??

ghettofinger has asked for the wisdom of the Perl Monks concerning the following question:

Wise monks,

I would like to extract the first paragraph from a series of web pages. Normally, I would use LWP and a regex and find a pattern of tags around the paragraph and just extract it. The problem is that with the current web pages I want to extract from, the tags are different all of the time.

Is there a way that I can say, extract the first grouping of words that has more than 7 plain "words" next to each other and stop the match at a newline? Is there a better way to go about extraction without relying on a regular expression?

I appreciate your help.

Many thanks,
ghettofinger

Replies are listed 'Best First'.
Re: Extracting paragraphs from html
by merlyn (Sage) on Sep 11, 2005 at 16:50 UTC
Re: Extracting paragraphs from html
by sk (Curate) on Sep 11, 2005 at 16:49 UTC
    As you noticed parsing HTML gets messy/tricky with regex when the tags change all the time.

    You might want to look at HTML::TokeParser::Simple

    -SK

Re: Extracting paragraphs from html
by fraktalisman (Hermit) on Sep 11, 2005 at 16:55 UTC

    If you can't rely on certain tags (and I agree that you can't), the question is, what is the definition of a paragraph?

    Where does it stop? Certainly not at a newline, for we are dealing with HTML, and there might be many newlines in the source code where they are not visible in the page that is actually displayed.
    So what would possibly terminate a paragraph?

    • A closing tag of a block element, like </div> </p> etc.
    • More than one break, i.e. <br> <br> without words or images between them
    • The start of another paragraph or block element, like <div> <p> <iframe> <hr> etc.
    • An image <img>
    • The end of the page or document

    And for a pragmatic approach, you might want to specify a maximum length at which the given text is truncated. There are people who don't use paragraphs at all, they just type or copy hundreds and thousands of words on a page, like they were writing a novel or like they haven't understood the necessity of formatting at all.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://491069]
Approved by sk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (6)
As of 2024-03-28 11:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found