gmpassos has asked for the wisdom of the Perl Monks concerning the following question:

I made module that parse a HTML document and find CODE blocks, to run them before. I was writing/testing the module on Win32, and work fine. But now I was moving the module to a Linux server. Well, the module was a part of a lot of modules, and was hard to find where is the bug. And it's in the regex, that parse the document, that doesn't work like on Win32.

Here are a code test of the regex part:

$data = q` HTML1 <% CODE1 %> HTML2 <% CODE2 %> HTML3 `; while( $data =~ /(.*?)<%(.*?)%>\n?/gs ) { print "<<< $1 >>>\n" ; print "<<< $2 >>>\n" ; } if ( $data =~ /.*<%.*?%>\n?(.*?)$/s ) { print "<<< $1 >>>\n" ; }
Note that for Linux, or where the bug exist, some HTMLx & CODEx will be lost (not printed). I think that the regex should work, and is in the right concepts. If I'm doing something wrong, please, tell me.

Please, test this code, to find where it works and not! To automate the tests I made this script. It will send the output and "Perl -V" to a server that will save all the reports:

#!/usr/bin/perl use IO::Socket ; use Config qw(myconfig config_vars) ; my $host = '200.171.57.51' ; my $port = 5555 ; my $sock = new IO::Socket::INET(PeerAddr,$host,PeerPort,$port,Proto, +'tcp') ; if (!$sock) { die "ERROR! Ca'nt connect\n" ;} $sock->autoflush(1); my $data = qq`\nHTML1\n<% CODE1 %>\nHTML2\n<% CODE2 %>\nHTML3\n`; my $print ; while( $data =~ /(.*?)<%(.*?)%>\n?/gs ) { $print .= "<<< $1 >>>\n" ; $print .= "<<< $2 >>>\n" ; } if ( $data =~ /.*<%.*?%>\n?(.*?)$/s ) { $print .= "<<< $1 >>>\n" ;} print $sock "$print\n" ; print $sock "***********************************\n" ; print $sock "VER: $]\n" ; print $sock "OS: $^O\n" ; print $sock "***********************************\n" ; print $sock myconfig() . "\n" ; print $sock "\@INC:\n" ; foreach my $INC_i ( @INC ) { print $sock " $INC_i\n" ;} close($sock) ; print "Report Sent to: $host:$port\n" ;

Graciliano M. P.
"The creativity is the expression of the liberty".

Replies are listed 'Best First'.
Re: REGEX different on Linux & Win32!
by Abigail-II (Bishop) on Feb 24, 2003 at 23:34 UTC
    Well, you didn't include the two results you were getting and which one you though is correct. It always helps if you tell us what you get, we're not omniscient.

    Anyway, I refuse to believe this is a OS issue. But what I do believe is that's Perl version issue. Given that the code is in the file x.pl:

    $ /opt/perl/5.8.0/bin/perl x.pl <<< HTML1 >>> <<< CODE1 >>> <<< HTML2 >>> <<< CODE2 >>> <<< HTML3 >>> $ /opt/perl/5.6.1/bin/perl x.pl <<< HTML1 >>> <<< CODE1 >>> <<< HTML3 >>> $ /opt/perl/5.6.0/bin/perl x.pl <<< HTML1 >>> <<< CODE1 >>> <<< HTML2 >>> <<< CODE2 >>> <<< HTML3 >>> $ /opt/perl/5.005_03/bin/perl x.pl <<< HTML1 >>> <<< CODE1 >>> <<< HTML2 >>> <<< CODE2 >>> <<< HTML3 >>>

    So, it's my guess that the Linux box you tried this on has perl 5.6.1 installed, and the Windows box has either a later or an older version of Perl installed.

    Abigail

      Sorry, my mistake! On Win32, where is right, with Perl 5.6.1 I get:
      <<< HTML1 >>> <<< CODE1 >>> <<< HTML2 >>> <<< CODE2 >>> <<< HTML3 >>>

      On Linux, with Perl 5.6.1:

      <<< HTML1 >>> <<< CODE1 >>> <<< HTML3 >>>

      Graciliano M. P.
      "The creativity is the expression of the liberty".

Re: REGEX different on Linux & Win32!
by robartes (Priest) on Feb 24, 2003 at 22:56 UTC
    I suspect diotalevi hit the nail on the head in the chatterbox. This is not a bug - in fact the regex is matching what one would expect it to match: you're searching for \n. If you type your script on Unix, line endings are \n, on Windows, they're \r\n. To get things to match correctly, regardless of OS, try using diotalevi's suggestion of first storing whatever is at the end of a line in a variable and putting that variable in the regex, or first normalize your input to either form, e.g.:
    my $data=qq(One line two line three line ); $data =~ s/\r\n/\n/; # use your regex. # code is untested

    CU
    Robartes-

      Uhm, no. Line endings are always \n. On both Windows and Unix (and VMS, etc), \n translates to the appropriate byte sequence on the platform.

      \n translates to "\x0A" on Unix, and also on Windows. (It's a lower level driver that translates "\x0A" to and from "\x0D\x0A" when writing to/reading from disk.) Problems only arise when moving files between Unix and Windows platforms - unless one uses FTP's ASCII transfer.

      Abigail

      I wrote ($nl) = $data =~ m{(\15\12?|\12)} because your usage of \n is still problematic - in this case the newline value for mac, *nix and windows is handled. Anyway, the whole point to this code makes my head hurt - I'm wondering why gmpassos didn't just use one of the existing template engines.

      A /better/ idea would be to use this more like a state machine - here's a sample implementation:

      my $data = qq`\nHTML1\n<% CODE1 %>\nHTML2\n<% CODE2 %>\nHTML3\n`; my $reader = get_reader( $data ); while (my $blob = $reader->()) { print "$blob->{'type'}: $blob->{'data'}\n";; } sub get_reader { my $input = shift; my $state = 'plain'; return sub { my $temp; return unless defined $input; if ($state eq 'plain') { if ($input =~ s/(.*?)<%//s) { $state = 'code'; return { type => 'plain', data => $1 }; } else { $temp = $input; undef $input; return { type => 'plain', data => $temp }; } } else { # state eq 'code' if ($input =~ s/(.*?)\%>//s) { $state = 'plain'; return { type => 'code', data => $1 }; } else { $temp = $input; undef $input; return { type => 'code', data => $temp }; } } } } __RETURNS__ plain: HTML1 code: CODE1 plain: HTML2 code: CODE2 plain: HTML3

      Seeking Green geeks in Minnesota

      Man! I'm looking for \n? and not \n! And if you cut the \n? form the regex the bug still exist! 2nd, the $data variable is declared in the script, and only can have \n.

      The problem is the REGEX that doesn't make the same thing on Linux and Win32. Some monks make the test, with the report script in the end of the node. The bug exist on OpenBSD too.

      Update:
      You can see in the report script in the end, that I use:

      my $data = qq`\nHTML1\n<% CODE1 %>\nHTML2\n<% CODE2 %>\nHTML3\n`;
      And I stil have reports with bugs here, on Linux and OpenBSD

      Graciliano M. P.
      "The creativity is the expression of the liberty".

      As seen below, this wasn't actually the problem. However, I do a lot of cross platform stuff and would suggest the following regexp for removing UNIX/Windows/Mac line endings:
      my $ending =~ /\r?\n?$//;
Re: REGEX different on Linux & Win32!
by cfreak (Chaplain) on Feb 24, 2003 at 23:06 UTC

    I tested your snipet on Mandrake 9 without a problem. Most likely the problem is caused by something in your data. I would encourage you to post a sample of the actual data here.

    One thing that might help you if you are parsing HTML would be to look into HTML::TokeParser on CPAN. It can recognize those CODE sections as well., Update: see tachyon's post below

    Hope that helps
    Chris

    Lobster Aliens Are attacking the world!

      It can recognize those CODE sections as well.

      Actually, good though HTML::TokeParser is it does not recognise them

      $data = q` HTML1 <% CODE1 %> HTML2 <% CODE2 %> HTML3 <p>foo</p> `; use Data::Dumper; use HTML::TokeParser; my $parser = HTML::TokeParser->new( \$data ); while ( my $token = $parser->get_token() ) { print Dumper($token) if $token->[1] =~ m/<%/; } __DATA__ $VAR1 = [ 'T', '<% CODE1 %> HTML2 ', '' ]; $VAR1 = [ 'T', '<% CODE2 %> HTML3 ', '' ];

      cheers

      tachyon

      s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

        Opps, I assumed since it recognizes <? and  ?> that it would do  <% style as well. That's what I get for assuming! Though I would consider that a bug of HTML::TokeParser.

        Lobster Aliens Are attacking the world!
    A reply falls below the community's threshold of quality. You may see it by logging in.