huskey has asked for the wisdom of the Perl Monks concerning the following question:

I have what to me seems like a simple regex problem but i can't seem to get it to work. What i need is a regex that will substitute a return ( \n) with a  <br> tag from the body of a newsletter. I would like to remove more than two  <br> tags back to back and any  <br> tags from the beginning and end of the body of the newsletter.

I would also like to cut the body of the newsletter down to 2000 characters. I have been using  substr but this sometimes cuts in the middle of a word. How would i end at the end of a sentence after the 2000th character?

Thank you for the help and please excuse me if these questions have been answered before.

Replies are listed 'Best First'.
Re: Removing extra <br> tags
by perlguy (Deacon) on Apr 01, 2003 at 21:04 UTC

    If the entire body of the paragraph is in a single variable, then I would do the following:

    $paragraph =~ s/\n+/<BR>/g; ($paragraph) = $paragraph =~ /(.{1,2000})\s+/;

    That last line captures the first 2000 characters (except for newlines), and fails and begins backtracking if a space does not follow. It will backtrack until it finds the space, and thus it won't cut off in the middle of a word.

    Hope that helps.

    Update: I misread the request. The second regex could be something like the following to meet your needs, so that it won't be longer than 2000 characters but will end at sentence end:

    ($paragraph) = $paragraph =~ /(.{1,2000}[.!?])/;
Re: Removing extra <br> tags
by Cody Pendant (Prior) on Apr 02, 2003 at 01:16 UTC
    From a usability point of view, I think the idea of just cutting the thing down arbitrarily to 2000 chars is a terrible idea -- what happens if it's in the middle of something I really want to read? I guess you're going to have some kind of system where I can read the rest?

    But anyway, one good idea would be to use a module which can split text up into sentences, which is apparently what Lingua::EN::Sentence does.
    --

    “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.”
    M-J D
Re: Removing extra <br> tags
by dga (Hermit) on Apr 01, 2003 at 21:31 UTC

    Ending the newsletter at the end of a sentance will require that your script knows what that looks like ie. will the writer always make sure to put a period, question mark etc. at a sentance end and what about abbreviations like the etc. earlier in this paragraph.

    You will also have to decide whether you want less than 2000 or just more than 2000 characters so you can end at the end of a sentance. Getting exactly 2000 seems unlikely. If you cut on words, you may be able to get within 20 or so of 2000 characters.

    It may be handier to end at a paragraph and have the writer put in a double newline to end paragraphs. This would be easy for a program to find and could break out if the character count is > 2000 and at the end of a paragraph. Or you could have the author to put 2 spaces after sentances and no where else ever. That could be found and split on.

    Basically, you have 2 conditions to end. After a 'sentance' and after 2000 characters. For the characters, keep a running count and determining a good breaking point in software is discussed in the preceding paragraphs.