Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am working with the newsboy.pl script available at http://www.dieresis.com - I'm a newbie and still trying to grok regexps. The only problem is, \n characters are replaced with spaces when output as HTML code. I need them to somehow be replaced with line break HTML tags. This happens when it grabs information from a form, but I don't understand:
$value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
and this is what I notice later on in the code:
$sub_body =~ s/\n/\x20/gi;
is this the cause of my problem? If so can I do anything about it?

Replies are listed 'Best First'.
Re: form parsing
by btrott (Parson) on May 10, 2000 at 23:26 UTC
    That first regular expression is translating hex encoded characters back into regular characters. For example, if it sees %20, it will translate that into a space.

    W/r/t the second question--yes, that looks like it's causing the spaces instead of carriage returns. \x20 is a space character (\x20 = hex 20, so space in ASCII).

    You could change that to:

    $sub_body =~ s/\n/<br>/g;
    Try that and see if it works for you.
Re: form parsing
by mikfire (Deacon) on May 10, 2000 at 23:37 UTC
    $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
    translates the %20 and whatnot of HTML encoding into the correct characters. This is done the with the help of the magic /e modifier ( see the library and search on egimosx for a full discussion ), which causes perl to run the REPLACEMENT part through an eval first. In this case, the pack statement is translating the 20, eg, into a space.

    The line giving you fits is the second one, which is replacing all newlines with 20 spaces. And yes, there is something you can do about it :).
    Try this instead:

    $sub_body =~ s/\n/<BR>/g;
    which does exactly what you asked - substitutes
    tags for every newline

    I am curious to know why they used the /i modifier on the regex, though. To the best of my knowledge, newlines do not come in upper- and lower- case flavours.

    Mik
    Mik Firestone ( perlus bigotus maximus )

    UPDATE Sorry - I badly misparsed the regex. Don't ask why because I couldn't tell you for certain. It is replacing it with, as btrott so correctly stated, a single space.
    mea culpa,
    Mik

RE: form parsing, hex, HTML formatting
by muppetBoy (Pilgrim) on May 11, 2000 at 14:50 UTC
    The first regular expression is quite neat. I came up with a similar - but quite ungainly - solution to the problem (see the code below). I assumed that my code, being a bit of a mess would be particularly slow and inefficient so being curious I compared the two solutions using some code borrowed from httptech.
    #!/usr/local/bin/perl use Benchmark; my $str = "the %0A quick %2b brown %63 fox %63 jumped %24 over %7e th +e %4f%45 lazy %25 dog"; timethese (500000, { 'myway' => sub { $_ = $str; while (/%[0-9A-Fa-f]{2}/) { $_ = $&; /[0-9A-Fa-f]{2}/; $ob[0] = hex($&); $temp = pack("C*", @ob); $name =~ s/%[0-9A-Fa-f]{2}/$temp/; $_ = $name; } }, 'regexpway' => sub { $_ = $str; s/%([a-fA-F0-9]{2})/pack("C", hex($1))/eg; }, }); Benchmark: timing 500000 iterations of myway, regexpway... myway: 23 wallclock secs (21.99 usr + 0.00 sys = 21.99 CPU) regexpway: 57 wallclock secs (57.58 usr + 0.00 sys = 57.58 CPU)
    I was quite suprised that my messy solution appeared to be significantly faster than the reg exp.
    Any ideas why this is?
      Don't hold me to this, but I believe the difference in speed is the eval. A s///e causes perl to eval the REPLACEMENT part. The perldoc for eval warns that this is an expensive thing to do.

      Your regex merely does double-quotish expansion. Even with the other work you are doing, you are not kicking off a new interpreter.

      But this is only a guess.
      Mik

      Interesting perlversion you have. with me your code is slower. on using $& man perlvar states:
      	The use of this variable anywhere in a program
      	imposes a considerable performance penalty on all
      	regular expression matches.  See the
      	Devel::SawAmpersand module from CPAN for more
      	information.
      
      Which is consitant with these results:
      Benchmark: timing 500000 iterations of regexpway...
       regexpway: 22 wallclock secs (22.50 usr +  0.05 sys = 22.55 CPU)
      
      and
      
      Benchmark: timing 500000 iterations of myway, regexpway...
           myway: 50 wallclock secs (49.41 usr +  0.18 sys = 49.59 CPU)
       regexpway: 25 wallclock secs (23.24 usr +  0.09 sys = 23.33 CPU)
      
        /users/mfiresto/experiment)perl timeclean.pl
        Benchmark: timing 500000 iterations of myway, regexpway...
        myway: 24 wallclock secs (24.11 usr + 0.00 sys = 24.11 CPU)
        regexpway: 70 wallclock secs (67.60 usr + 0.00 sys = 67.60 CPU)

        /users/mfiresto/experiment)perl -v
        This is perl, version 5.005_03 built for sun4-solaris

        /users/mfiresto/experiment)uname -a
        SunOS XXXX 5.5.1 Generic_103640-28 sun4u sparc SUNW,Ultra-5_10

        /users/mfiresto/experiment)perl timeclean.pl
        Benchmark: timing 500000 iterations of myway, regexpway...
        myway: 19 wallclock secs (18.44 usr + 0.00 sys = 18.44 CPU)
        regexpway: 40 wallclock secs (40.65 usr + 0.00 sys = 40.65 CPU)

        /users/mfiresto/experiment)perl -v
        This is perl, version 5.005_03 built for i686-linux

        /users/mfiresto/experiment)uname -a
        Linux XXXX 2.0.36 #1 Tue Dec 29 13:11:13 EST 1998 i686 unknown

        And, just for something completely different, I thought to try this against perl 5.6.0 and the results are very similar.

        UPDATE
        It has dawned on me while reading this thread that this exposes one of the dangers on relying too heavily on Benchmark to determine the best algorithm. Notice the man pages for $& state it will impose a serious penalty on all other regex. In normal code for me, this would indeed be disasterous because I use a lot of regex.

        However, in the Benchmark code, there are only two regex statements used. With a bit of cleverness, I was able to remove the $& reference and received no real speed improvement. That is because Benchmark does not run in real-world conditions. We all tend to extract the part we wish to test and just run that. In this case, it may not work. If we were to test this code in a real world situation, we may see a difference.

        Then again, we may not. According to some quick experiments, Benchmark cannot reliably measure the first method in less than ( approximately ) 9500 iterations. Personally, I have not seen a CGI parameter that contains 9500 'escaped' characters. Do the benchmarks at this point really mean anything?

        Shouldn't we be more concerned with good code? How about which one is easier to maintain? How about which one is more Perlish? Which one fits the coder better?

        Wow. Sorry to rant. I have been thinking about this too much.

        Mik

        That was exactly what I was expecting, I thought the $& would cause a big performance hit. I'm even more puzzled by my results now :-(
        Incidently I'm running v. 5.005003 on Solaris.
      Yes, well, the regex actually works.
      #!/usr/bin/perl -w my $str = "the %0A quick %2b brown %63 fox %63 jumped %24 over %7e the + %4f%45 lazy %25 dog"; print "MYWAY: ", myway(), "\nREGEXPWAY: ", regexpway(), "\n"; sub myway { $_ = $str; while (/%[0-9A-Fa-f]{2}/) { $_ = $&; /[0-9A-Fa-f]{2}/; $ob[0] = hex($&); $temp = pack("C*", @ob); $name =~ s/%[0-9A-Fa-f]{2}/$temp/; $_ = $name; } return $_; } sub regexpway { $_ = $str; s/%([a-fA-F0-9]{2})/pack("C", hex($1))/eg; return $_; } __END__ chh@scallop test> perl unescape Use of uninitialized value in substitution (s///) at unescape line 14. Use of uninitialized value in pattern match (m//) at unescape line 9. Use of uninitialized value in print at unescape line 5. MYWAY: REGEXPWAY: the quick + brown c fox c jumped $ over ~ the OE lazy % dog chh@scallop test>
        I'm not entirely sure what is going on here.
        Although its a particularly messy piece of code it does actually work! I was concerned by the results you got so I re-ran the code myself and was unable to get the same results.
        Because I am lazy I missed off the -w and use strict in the code I posted (in my actual code everthing is a little more strict). It could be something to do with this that causes the effect you have seen. Although with -w I do not get any errors and it works OK. Try this:
        #!/usr/local/bin/perl -w use strict; my $desc = "I %2Blike %3A cheese"; my ($str, @ob); print "TEST: ",test(),"\n";<br> sub test { $_ = $desc; while (/%[0-9A-Za-z]{2}/) { $_ = $&; /[0-9A-Za-z]{2}/; $ob[0] = hex($&); $str = pack("C*", @ob); $desc =~ s/%[0-9A-Za-z]{2}/$str/; $_ = $desc; } return $desc; } returns: TEST: I +like : cheese
        This definitely works OK for me.
        UPDATE: My apologies, I've just noticed what was wrong with the code that I originally posted. $name is not initially set up - so the loop runs through 1 iteration and fails. As I didn't want to actually change $str I changed the code a little - but forgot check it properly. This should work:
        ... 'myway' => sub { $_ = $str; $name = $str; while (/%[0-9A-Fa-f]{2}/) { $_ = $&; ...
        I guess the lessons learnt are:
        • Always use -w and use strict
        • Check the code works before you post it
        • Write clear, maintainable code
        BTW the new benchmark timings are:
        Benchmark: timing 500000 iterations of myway, regexpway... myway: 162 wallclock secs (162.07 usr + 0.00 sys = 162.07 CPU) regexpway: 58 wallclock secs (58.52 usr + 0.01 sys = 58.53 CPU)
        which make a lot more sense.
        I now feel older, wiser and more than a little bit stupid.
Re: form parsing
by Anonymous Monk on May 11, 2000 at 00:22 UTC
    Textareas in Netscape preserve the newlines when sent to the cgi script. With IE though the newlines are removed and the entire message is sent as a single line. You may want to prepare some switch for this too.