jck has asked for the wisdom of the Perl Monks concerning the following question:

just starting to use regex, so this is probably obvious to those of you with so much more experience than i! i have this loop to edit text entries in a form to convert news stories to render HTML.
foreach ('postby','title','teaser','content') { $in{$_} =~ s/([\r\n]){2,}/\n<\/p><p>\n/g; }
the objective is to delimit the paragraphs wherever there are two or more CR entered in the textfield, and it works, but it replaces even one linefeed with another

, so that every time it's edited and updated in the database, it accumulates more useless

tags:

.....end of first paragraph</p><p> </p><p> </p><p> </p><p> beginning of next paragraph....

Replies are listed 'Best First'.
Re: regex to replace linefeeds with <p> tags
by liverpole (Monsignor) on Dec 25, 2006 at 21:33 UTC
    Hi jck,

    One simple method of fixing it immediately comes to mind.

    Since it's HTML, why not just skip putting the newline before and after the </p> ... <p>:

    $in{$_} =~ s/([\r\n]){2,}/<\/p><p>/g;

    That way, you at least avoid the newline accumulation problem.

    Update:  If you really have your heart set on putting them on the same line, something like:

    $in{$_} =~ s/(?!^<\/p><p>)([\r\n]){2,}/\n<\/p><p>\n/g;

    Might do the trick.  It uses a zero-width negative lookbehind assertion, which avoids adding </p> ... <p> to any line which contains that pair (and only that pair) already.


    s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/
Re: regex to replace linefeeds with <p> tags
by ysth (Canon) on Dec 25, 2006 at 21:35 UTC
    Your [\r\n] is suspicious. Does the data have \r or \n or \r\n? If the latter, you don't want a character class, since that matches even a single "\r\n" when presumably you mean it to match only "\r\n\r\n...".

    You might try stripping out all \r's before doing the regex and using \n as the only line ending, with your regex looking for just \n{2,}

        I'll second Joost's approach, but make a few suggestions for readability / maintainability.
        • Use qw so to de-clutter the list of strings.
        • Use an explicit variable in the for; they are cheap and they make your intention clear.
        • Use [ ] braces for the the regex separator so you won't have to backslash the slash. This de-emphasizes some of the executable line noise effect
        • Use the regex x modifier to put some whitespace and comments in here.
        foreach my field (qw(postby title teaser content)){ $in{$field} =~ s[ (\r? \n){2,} ] # two or more CR [ \n </p> \n <p> \n]gx; # Close one para, open ano +ther }
        throop
        I didn't suggest that because then you are (assuming the data was consistent in the first place) leaving most lineends as "\r\n" but those at paragraph breaks as "\n", and that bothered me.
      actually, i DO want to match "\r\n" (or "\n\r" for that matter).

      the bottom line, is that i want to identify two+ linebreaks as a paragraph break, whether the linebreaks are \r or \n (or a mixture of both).

        But the behaviour you describe indicates that the user input is coming back as the sequence \r\n for a single line break.
Re: regex to replace linefeeds with <p> tags
by jck (Scribe) on Dec 26, 2006 at 02:29 UTC
    thanks to all for the great suggestions. they're all very helpful.

    liverpole, i agree with you, and i was thinking that i would just leave out the linefeeds, but when the posts are long, i don't like seeing the text all strung together without easily seeing the paragraph breaks - just a preference thing.

    a general question about the \r ? that both Joost and throop suggest......i started out with \n{2,} but found that some of my users were cutting an pasting from word processors, and that introduced the occasional \r into the mix. so, will [ (\r? \n){2,} ] match to "\r\r" ? that was what i was hoping would work with the \r\n{2,} - that it would match to \r\r or \r\n or \n\r or \n\n (as well as \r\r\n and \r\n\r and \r\n\n and \n\n\r etc etc etc.....)

    clearly, passing through twice, and changing any \r to \n and then matching the \n{2,} to replace to the para tags would be reliable, but seems inefficient.

      Pasting from Windows environments will introduce \r\n because that's what Microsoft uses for linebreaks. It won't introduce \n\r or \r\r.

      You don't want to introduce a <p> from a single <RETURN>, right? \r?\n is what you want to match.

      If it's clearer to you, go ahead and remove all the \r in one pass and then handle the \n. Don't worry about efficiency here — you're doing IO!. The number of CPU cycles it takes to get a response from the keyboard to the CPU is enormous in comparison to the cycles to do a string replace.

      throop

Re: regex to replace linefeeds with <p> tags
by j3 (Friar) on Dec 26, 2006 at 17:06 UTC

    Hi jck,

    Might not be too relevant here, but note that, in general, if you need to convert text to html you might have a look at Markdown. There's even a Text::Markdown module for it.

Re: regex to replace linefeeds with <p> tags
by f00li5h (Chaplain) on Dec 27, 2006 at 01:44 UTC

    The Template toolkit filter html_para may also help you. It wraps <p> tags arround paragraphs (delimted by a blank line).

    Merlyn will tell you how to use Tempate in This article, one of his spiffy Linux Magazine Columns.

    @_=qw; ask f00li5h to appear and remain for a moment of pretend better than a lifetime;;s;;@_[map hex,split'',B204316D8C2A4516DE];;y/05/os/&print;