in reply to Re^2: line ending troubles
in thread line ending troubles

Regarding matching behaviors with alternation, see Matching this or that in perlretut. Short answer, yes.

slurp as implemented in Perl6::Slurp v0.03 (what I'm using for reference), calls a 3-argument open with mode = '<' if no layer information is passed. This means it will behave like a normal file open on your OS, which as you've observed includes the crlf-layer by default under Windows. See Defaults and how to override them in PerlIO.

If you haven't reviewed it yet, you should read about Newlines in perlport.

Replies are listed 'Best First'.
Re^4: line ending troubles
by Dirk80 (Pilgrim) on Dec 25, 2009 at 00:31 UTC

    Thanks for your answers. But now I have trouble again with slurp. I wanted to find out when I'm using slurp in a list context whether a big file is read at once or not. I just wanted to know if I can use slurp when I'm reading a big file. The following code is not really creating a big file (4 MB). But the slurping takes more than 2 minutes.

    #!/usr/bin/perl use strict; use warnings; use Perl6::Slurp; # write file with different line endings # \r = 0x0D # \n = 0x0A my $win_line = "Windows\r\n"; my $unix_line = "Unix\n"; my $mac_line = "Mac\r"; open(my $fh, ">", "le_big.txt") or die "Failed file open: $!"; binmode($fh); for (1 .. 100000) { print $fh $win_line; print $fh $unix_line; print $fh $mac_line; print $fh $win_line; print $fh $mac_line; print $fh $win_line; print $fh $unix_line; } close($fh); # read file with slurp # PerlIO-layer 'crlf' is doing the conversion \r\n --> \n # i.e. the input record separator only has to handle the line endings +\n and \r # Win32: crlf-layer is activated as default, so it is not necessary # to explicitly add this layer # Unix and other OS: crlf-layer is NOT activated as default # necessary to add this layer for my $line (slurp("<:crlf", "le_big.txt", {irs => qr/\n|\r/, chomp = +> 1}) ) { print $line . "\n"; } # NOTE: # It would also be possible to write a regexp which is working # if the crlf-layer is active or not: # irs => qr/\r\n|\n|\r/ # crlf-layer active: possible line endings are \n OR \r # crlf-layer NOT active: possible line endings are \n\r OR \n OR \r

    Am I doing something wrong or why does the slurping take so much time?

    This code was running with Perl 5.10 in a Ubuntu-Linux

    Greetings,

    Dirk

      I haven't gone into great detail, but it appears the module incurs high overhead. Specifically, I ran the following benchmarks:

      #!/usr/bin/perl use strict; use warnings; use Perl6::Slurp; use Benchmark qw(cmpthese :hireswallclock); # write file with different line endings # \r = 0x0D # \n = 0x0A my $win_line = "Windows\r\n"; my $unix_line = "Unix\n"; my $mac_line = "Mac\r"; my @strings = (); for (1 .. 1000) { push @strings, $win_line; push @strings, $unix_line; push @strings, $mac_line; push @strings, $win_line; push @strings, $mac_line; push @strings, $win_line; push @strings, $unix_line; } open(my $fh, ">", "le_big.txt") or die "Failed file open: $!"; binmode($fh); print $fh $_ foreach @strings; close($fh); cmpthese(1000, { 'naive' => \&naive, 'original' => \&original, 'local split' => \&local_split, 'crlf split' => \&crlf_split, }); # Original code sub original { my @results = slurp("<:crlf", "le_big.txt", {irs => qr/\n|\r/, + chomp => 1}); return @results; } # Just use slurp and crlf to read the file sub crlf_split { my @initial_results = slurp("<:crlf", "le_big.txt"); my @results = map split(/\r/), @initial_results; return @results; } # Just use slurp to read the file sub local_split { my @initial_results = slurp("<", "le_big.txt"); my @results = map split(/\n|\r\n?/), @initial_results; return @results; } # Naive local implementation sub naive { open(my $fh, "<", "le_big.txt") or die "Failed file open: $!"; local $/; my $slurp = <$fh>; close $fh; my @results = split /\n|\r\n?/, $slurp; return @results; }

      With the following results:

      time perl fluff.pl Rate original local split crlf split naive original 28.4/s -- -32% -43% -80% local split 41.8/s 47% -- -16% -70% crlf split 49.6/s 75% 19% -- -65% naive 141/s 398% 238% 185% -- real 1m26.512s user 1m26.450s sys 0m0.060s

      Run under perl v5.8.8 built for x86_64-linux-gnu-thread-multi, Ubuntu box. Note how much faster the quick-and-dirty slurp and split approach I wrote is. The moral, I think, is that you should only use this module if you have good reason. Note as well that I'm pretty sure the half-way solutions will drop empty lines from the result.