Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
This, in some ways, follow on from the discussions on extracting the raw text from an HTML source (see the How to get HTML::Parser to return a line of parsed text thread.
What I would like to do is to extract the first n words of printable data from a string of HTML text.
Using the code:
# Create a new, empty, scalar my $text = ""; # Define what the parser does my $p = HTML::Parser->new( text_h => [ sub {$text .= shift}, 'dtext' ] ); # .. and parse! $p->parse($full_text);
Based on this, it is quite easy to then get the first n words:
# now hack off the first lump of words @list_of_words = split /[ \t\r\f]+/, $text, $n;
However, how do I then correlate the words from @list_of_words to the start of the HTML text in $full_text?
(The plan being that I can do the "blah blah blah (more...)" thing...
Edit: chipmunk 2001-05-29
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: getting the first n printable words from a string of HTML
by tachyon (Chancellor) on May 29, 2001 at 20:39 UTC | |
by Vynce (Friar) on May 30, 2001 at 17:37 UTC | |
|
Re: getting the first emn/em printable words from a string of HTML
by kiz (Monk) on May 29, 2001 at 18:58 UTC | |
|
Re: getting the first n printable words from a string of HTML
by tachyon (Chancellor) on May 30, 2001 at 07:07 UTC | |
by kiz (Monk) on May 30, 2001 at 19:04 UTC | |
|
Re: getting the first n printable words from a string of HTML
by kiz (Monk) on May 31, 2001 at 17:15 UTC |