HTML String Parsing

Ionizor has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to get my Perl script to pull newline characters out and put HTML <BR> tags in instead. The idea is that my site's users can post comments that will end up in a text file for SSI inclusion. That part works great but I don't want them to be able to put obscene amounts of blank space in my pages so I want to strip out <BR> tags on otherwise blank lines.

Here's the relevant code:

# $comment is input from the user via web form

# Fixes multi-line comments to actually be multi-line
$stripvariable = "\n";
$joinvariable = "&ltBR&gt";
@commentlines = split(/$stripvariable/, $comment);
$comment = join($joinvariable,@commentlines);

# This is supposed to eliminate blank lines but it doesn't work
#  $stripvariable = "&ltBR&gt\n&ltBR&gt";
#  $joinvariable = "&ltBR&gt";
#  @commentlines = split(/$stripvariable/, $comment);
#  $comment = join($joinvariable,@commentlines);

# I have no idea why this eliminates those funky rectangle 
# characters from the file but it does.
$stripvariable = "&ltBR&gt";
$joinvariable = "\n&ltBR&gt";
@commentlines = split(/$stripvariable/, $comment);
$comment = join($joinvariable,@commentlines);
[download]

$comment is then formatted and written to a file (this part also works fine).

Comment on HTML String Parsing Download Code

Replies are listed 'Best First'.
(jeffa) Re: HTML String Parsing by jeffa (Bishop) on Dec 10, 2001 at 00:14 UTC
Check out HTML::FromText. it's as easy as: `use HTML::FromText; # after you get content 'into' $text print text2html($text, lines => 1);` [download] The 'lines' arg give the behavior of replacing newlines with `<br>` tags, read the docs to find out about more useful features this module has. Much easier. ;) jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR F--F--F--F--F--F--F--F-- (the triplet paradiddle)	[reply] [d/l] [select]
Re: HTML String Parsing by Ionizor (Pilgrim) on Dec 10, 2001 at 00:47 UTC
That module rocks! Unfortunately, it doesn't do elimination of blank lines, which is the part that's giving me the trouble.	[reply]
(jeffa) 2Re: HTML String Parsing by jeffa (Bishop) on Dec 10, 2001 at 02:22 UTC
Sigh. I was hoping you wouldn't say that. ;) Personally, i think you should use said module with the 'paras' arg instead of 'lines'. The reason is because the browser does an excellent job with text placement. If you are worried about width, just embed the resulting writeup in a `<table>`. Besides, 'paras' _does_ eliminate that unwanted white space. thraxil's solution is nice, by the way. Note how both \n and \r is accounted for. thraxil++ If you are still hell bent on using `<br>` tags then here is a hack i came up with, borrowing a little from thraxil and accounting for extra whitespace: `$comment =~ s/(?:\s[\n\r]\s){2,}/\n<p>/g; $comment =~ s/[\n\r](?!<p>)/<br>\n/g; $comment =~ s/<p>/<p>\n/g;` [download] The first regex replaces two or more newlines surrounding by possible other whitespace with a `<p>` on it's own line (and if you think that the two \s* thingies are unecessary, try this without em). I left out the trailing new line in the substitution because i just couldn't get a negative lookahead to work in the next regex. Hence, the third regex. I am sure that there is a way to use a negative lookahead to deprecate having to resort to the third regex, but I would just use HTML::FromText anyway! The second regex replaces all newlines that are not followed by a `<p>` tag with a `<br>` tag and newline. I would have rather liked for this to work: `$comment =~ s/(?:\s[\n\r]\s){2,}/\n<p>\n/g; $content =~ s/(?!<p>)[\n\r](?!<p>)/<br>\n/g;` [download] but as i said, this just didn't work. :( .o0(?) UPDATE: Looks like you have your solution, but consider how much time it takes (barring educational purposes of course) for you to figure out these little details instead of finding a CPAN module - especially when puting together a site. Granted, this one didn't do exactly what you need - but, do you really need 'exactly' what you need? (ask that question to the great film makers) jeffa	[reply] [d/l] [select]
Re: HTML String Parsing by Ionizor (Pilgrim) on Dec 13, 2001 at 11:33 UTC
Re: HTML String Parsing by greywolf (Priest) on Dec 10, 2001 at 00:19 UTC
Since I am not a regex master I would do it in 2 steps. The first simply replaces every newline with a break. The second replaces every pair of breaks with a single break. The & may need to be escaped, I can't test this right now to be sure. `$comment =~ s/\n/&ltbr&gt/g; $comment =~ s/&ltbr&gt&ltbr&gt/&ltbr&gt/g;` [download] It's not the most elegant solution and I'm sure somebody else will whip up something really cool. It should do the trick though. mr greywolf	[reply] [d/l]
Re: Re: HTML String Parsing by thraxil (Prior) on Dec 10, 2001 at 01:41 UTC
this will break if they put spaces on the blank lines and also probably won't work right if they're on a platform that sends \n\r instead of just \n. i usually use something like: `$comment = "<p>" . (join '</p><p>', grep {!/^\s+$/} split /[\n\r]+/, $comment) . "</p>";` [download] which could be adapted to use <br> instead of <p> tags if you prefer. anders pearson	[reply] [d/l]
Re: HTML String Parsing by Ionizor (Pilgrim) on Dec 10, 2001 at 02:10 UTC
Awesome! Thanks! That worked beautifully. I ended up using: `$comment = (join "\n<BR>", grep {!/^\s+$/} split/[\n\r]+/, $comment);` [download]	[reply] [d/l]
Re: HTML String Parsing by Ionizor (Pilgrim) on Dec 10, 2001 at 00:44 UTC
Hrm... Well I got printed <BR> tags when I tried putting the first part into the script so I set it to `<BR>` instead of `&ltBR&gt` and it worked the same as my original, looked neater, and used fewer variables so I kept it (thanks!). The second part produces the same results I got with my original script, however...	[reply] [d/l] [select]