in reply to string splitting with spaces and HTML

Parsing html is always more problematic than it first seems, and rrwo is quite right to suggest that HTML::Parser is going to save you headaches later, but that's probably overkill for now.

The problem is that there isn't a single regex that will remove html. The most commonly used construction is:

s/<[^>]+>//gs

But that fails because this is valid html, even though the >s could as easily have been entities:

<input type="submit" value="go >>>">

The best answer seems to be HTML::FormatText, which will strip html and optionally wrap the output for you. If there are particular tags that you want to keep, then the quickest thing to do is probably a set of simple regexes that replace each one with something innocuous before the text is parsed and then replace it back again afterwards.

Once the html removal is taken care of, truncation should be simple:

my $text = 'The world is all that is the case'; my $sniplength = 10; my $truncated = substr($text,0,index($text,' ',$sniplength));

muffled update from within paper bag: s/should be/could as easily have been/

Replies are listed 'Best First'.
Re: string splitting with spaces and HTML
by Abigail (Deacon) on Jul 11, 2001 at 16:28 UTC
    You are wrong that in HTML the mentioned >s need to be escaped. They don't. Nor do all <s be escaped. A < followed by a space is just fine. Just like <#. Don't forget that HTML is an SGML application. Real men deal with SGML. XML is for people who can't parse their way out of a wet paper bag. And in XML, such <s and >s need to be escaped.

    -- Abigail