Perl to C problem

lbrandewie has asked for the wisdom of the Perl Monks concerning the following question:

Hey folks,

I'm having a problem. I'm porting some perl to C to speed it up, and I can't get the two languages to agree on what's going on. I'm using ActiveState's perl 5.24.1 for Windows. Consider the following code:


$test = "\n\n\n\n\n";

for ($x = 0; $x < length($test); $x++) {
    print ord(substr($test, $x, 1));
    <STDIN>;
}

The code prints a string of 10s, indicating that the 13/10 combination in Windows has been translated to a bare 10. I can understand that, for compatibility reasons. But it seems to mean there is no way in Windows of telling a 13/10 combo from a bare 10. This makes a difference in my code, in that I can't get C and perl to hash the same string the same way. Am I missing something? Is there no way at all to tell a 13/10 from a 10 in Windows perl?

I believe the upshot of this, where my code is concerned, is that perl is doing it "wrong" compared to C and I'll never get the two to agree. I hope I'm wrong about that. I note with interest and mild annoyance that strlen("\n") == 1 even in C for windows. But C allows me to get at the underlying character buffer and perl does not.

Thanks,

Lars

Comment on Perl to C problem

Replies are listed 'Best First'.

Re: Perl to C problem
by haukex (Archbishop) on Feb 16, 2019 at 06:47 UTC

$test = "\n\n\n\n\n"; The code prints a string of 10s, indicating that the 13/10 combination in Windows has been translated to a bare 10.

That's because, even in Windows, \n only represents LF, not CRLF (that'd be \r\n)*. You can verify this using Dump from Devel::Peek, it'll show you exactly what Perl is storing internally. The only place where Perl on Windows does automatic translation between CRLF/LF is on I/O from/to a file, because there, the :crlf layer is enabled by default (unless you specify the :raw layer or use binmode). So perhaps if you're reading from a file that has CRLF endings, you might want to turn off that layer.

By the way, you might want to have a look at RPerl, it can take a subset of Perl and compile it.

* Update: As Athanasius points out, see Newlines in perlport for even more details.

[reply]
[d/l]
[select]

Re: Perl to C problem
by Athanasius (Cardinal) on Feb 16, 2019 at 07:09 UTC

Hello lbrandewie, and welcome to the Monastery!

Have a look at perlport#Newlines. \n is Perl’s logical newline, so when reading from a file on the Windows platform, the 2-character sequence \x0D\x0A (represented here in hexadecimal) is translated to \n; and when writing to a file, \n is translated to \x0D\x0A.

If you want to distinguish between \x0A (Unix) and \x0D\x0A (Windows) in your input files, then you can read using the :raw layer:

use strict;
use warnings;
use autodie;

my $file = '1977_SoPW.txt';

open(my $fh, '<:raw', $file);
my @test = <$fh>;
close $fh;

for my $line (@test)
{
    print ord(substr($line, -2, 1)), "\t";
    print ord(substr($line, -1, 1)), "\n";
}

print "\n";
[download]

Output (where the input file “1977_SoPW.txt” contains 3 sentences, the first and third terminated by \x0D\x0A, the second by just \x0A):

16:51 >perl 1977_SoPW.pl
13      10
46      10
13      10


16:51 >
[download]

(46 is the decimal code for ASCII character ., the full stop or period, which happened to be the last text character in the line.)

On the other hand, if you want to distinguish between different newline representations in your code, then you should either use literal rather than logical characters:

my $test = "\x0D\x0A\x0D\x0A\x0D\x0A\x0D\x0A\x0D\x0A";
[download]

or add \r as haukex says above:

my $test = "\r\n\r\n\r\n\r\n\r\n";
[download]

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: Perl to C problem

by lbrandewie (Acolyte) on Feb 16, 2019 at 18:10 UTC

That helps a lot, thanks much!

Lars

[reply]