in reply to HTML String Parsing

Check out HTML::FromText.

it's as easy as:

use HTML::FromText; # after you get content 'into' $text print text2html($text, lines => 1);
The 'lines' arg give the behavior of replacing newlines with <br> tags, read the docs to find out about more useful features this module has. Much easier. ;)

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
F--F--F--F--F--F--F--F--
(the triplet paradiddle)

Replies are listed 'Best First'.
Re: HTML String Parsing
by Ionizor (Pilgrim) on Dec 10, 2001 at 00:47 UTC

    That module rocks!

    Unfortunately, it doesn't do elimination of blank lines, which is the part that's giving me the trouble.

      Sigh. I was hoping you wouldn't say that. ;)

      Personally, i think you should use said module with the 'paras' arg instead of 'lines'. The reason is because the browser does an excellent job with text placement. If you are worried about width, just embed the resulting writeup in a <table>. Besides, 'paras' _does_ eliminate that unwanted white space.

      thraxil's solution is nice, by the way. Note how both \n and \r is accounted for. thraxil++

      If you are still hell bent on using <br> tags then here is a hack i came up with, borrowing a little from thraxil and accounting for extra whitespace:

      $comment =~ s/(?:\s*[\n\r]\s*){2,}/\n<p>/g; $comment =~ s/[\n\r](?!<p>)/<br>\n/g; $comment =~ s/<p>/<p>\n/g;
      The first regex replaces two or more newlines surrounding by possible other whitespace with a <p> on it's own line (and if you think that the two \s* thingies are unecessary, try this without em). I left out the trailing new line in the substitution because i just couldn't get a negative lookahead to work in the next regex. Hence, the third regex. I am sure that there is a way to use a negative lookahead to deprecate having to resort to the third regex, but I would just use HTML::FromText anyway!

      The second regex replaces all newlines that are not followed by a <p> tag with a <br> tag and newline. I would have rather liked for this to work:

      $comment =~ s/(?:\s*[\n\r]\s*){2,}/\n<p>\n/g; $content =~ s/(?!<p>)[\n\r](?!<p>)/<br>\n/g;
      but as i said, this just didn't work. :( .o0(?)

      UPDATE:
      Looks like you have your solution, but consider how much time it takes (barring educational purposes of course) for you to figure out these little details instead of finding a CPAN module - especially when puting together a site. Granted, this one didn't do exactly what you need - but, do you really need 'exactly' what you need? (ask that question to the great film makers)

      jeffa

        The problem with tables is that they produce a crapload of HTML overhead. My site (http://mbn.dhs.org:81/) is really heavy on the tables already so I wanted to use the <BR> tags because they save a lot of space and bandwidth. The code produced by the script is actually just a list of comments included inside a table as an SSI. It's the comments page: http://mbn.dhs.org:81/comments.shtml

        I used thraxil's solution (you've been given credit in the script btw ;) because it's short, elegant, and easily understandable.

        The site is definitely a learning experience for me Perl-wise, which is why I've decided against using modules. The more programming I have to do the more I'll learn and the happier I'll be. It's not as though the site has a deadline or anything since I'm writing and running it to cater to a very limited group who don't really _need_ the site anyway.

        As for having it do exacly what I need it to do... I must admit to being a bit of a perfectionist. This occasionally gets me in trouble :)