mark0ls0n has asked for the wisdom of the Perl Monks concerning the following question:

I have been stuck on this problem for a few days, and was hoping for some help;
I was to snip a peice of text to ~40 chars ($sniplength in the script) , but I don't want it to cut mid-word, or in a HTML tag. So I thought about taking out all the HTML tags except for <br> and it's variants, and then snip at the first space after the specified snip length.

Replies are listed 'Best First'.
Re: string splitting with spaces and HTML
by rrwo (Friar) on Jul 11, 2001 at 03:40 UTC

    Have you tried HTML::Parser?

    If you'd like quick and dirty hacks instead, try using Text::Wrap setting the columns to 40 and then splitting by newlines?

Re: string splitting with spaces and HTML
by thpfft (Chaplain) on Jul 11, 2001 at 15:05 UTC

    Parsing html is always more problematic than it first seems, and rrwo is quite right to suggest that HTML::Parser is going to save you headaches later, but that's probably overkill for now.

    The problem is that there isn't a single regex that will remove html. The most commonly used construction is:

    s/<[^>]+>//gs

    But that fails because this is valid html, even though the >s could as easily have been entities:

    <input type="submit" value="go >>>">

    The best answer seems to be HTML::FormatText, which will strip html and optionally wrap the output for you. If there are particular tags that you want to keep, then the quickest thing to do is probably a set of simple regexes that replace each one with something innocuous before the text is parsed and then replace it back again afterwards.

    Once the html removal is taken care of, truncation should be simple:

    my $text = 'The world is all that is the case'; my $sniplength = 10; my $truncated = substr($text,0,index($text,' ',$sniplength));

    muffled update from within paper bag: s/should be/could as easily have been/

      You are wrong that in HTML the mentioned >s need to be escaped. They don't. Nor do all <s be escaped. A < followed by a space is just fine. Just like <#. Don't forget that HTML is an SGML application. Real men deal with SGML. XML is for people who can't parse their way out of a wet paper bag. And in XML, such <s and >s need to be escaped.

      -- Abigail