Parsing html is always more problematic than it first seems, and rrwo is quite right to suggest that HTML::Parser is going to save you headaches later, but that's probably overkill for now.

The problem is that there isn't a single regex that will remove html. The most commonly used construction is:

s/<[^>]+>//gs

But that fails because this is valid html, even though the >s could as easily have been entities:

<input type="submit" value="go >>>">

The best answer seems to be HTML::FormatText, which will strip html and optionally wrap the output for you. If there are particular tags that you want to keep, then the quickest thing to do is probably a set of simple regexes that replace each one with something innocuous before the text is parsed and then replace it back again afterwards.

Once the html removal is taken care of, truncation should be simple:

my $text = 'The world is all that is the case'; my $sniplength = 10; my $truncated = substr($text,0,index($text,' ',$sniplength));

muffled update from within paper bag: s/should be/could as easily have been/


In reply to Re: string splitting with spaces and HTML by thpfft
in thread string splitting with spaces and HTML by mark0ls0n

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.