comment on

The "failing to reset $i" bug is duely noted. The impact of moving "my $i=0" inside the "while (<FH>)" line is simply that each of those cases slows down, leaving me with the results:

                       s/iter raw_slurp_substr nonraw_sysread_onechar 
+raw_sysread_onechar getc slurp_substr nonraw_sysread_buffer slurp_reg
+ex raw_sysread_buffer
raw_slurp_substr       3.68  --   -16%-17% -34% -53% -57%  -60%  -62%
nonraw_sysread_onechar 3.08  19%   --  -1% -22% -45% -49%  -53%  -54%
raw_sysread_onechar    3.06  20%   1%   -- -21% -44% -48%  -52%  -54%
getc                   2.42  52%  27%  26%   -- -29% -35%  -40%   -42%
+  
slurp_substr           1.71 115%  80%  79%  41%   --  -8%   -15%  -18%
nonraw_sysread_buffer  1.58 133%  95%  93%  53%   8%   --    -8%  -11%
slurp_regex            1.46 152% 111% 110%  66%  17%   8%    --  -4%
raw_sysread_buffer     1.41 161% 119% 117%  72%  22%  12%   4%   --
      
               Rate raw_sysread_buffer slurp_length slurp_simpleregex
raw_sysread_buffer 0.706/s     --        -98%         -98%
slurp_length        31.2/s   4328%         --          -9%
slurp_simpleregex   34.5/s   4786%        10%           --
[download]

But I'm not sure how this is apples and oranges. There are two benchmark tests here:

In the first, after the bug fix, the raw/sysread buffer approach works best, but is only about 10% better than just slurping in the contents with (<FH>). (and the raw/sysread_onechar approach is actually worse than getc). In general, your final result shows improvement, but it's not as fantastic as I'd hoped. Perhaps this is a function of the OS in use (such dramatic differences between your results and mine suggest that you may be using Windows (I'm on Linux)...and that getc may be really terrible in Windows - is that right?). I'd be interested in seeing the results you get when you run the same benchmark (after fixing the $i bug you mention).

In the second benchmark, my point is a bit more interesting (to me) than simply saying that Perl is slower than C. I've reposted, comparing my two "fast" approaches to raw_sysread_buffer. The point of &slurp_length and &slurp_simpleregex is to show that the thing that makes &raw_sysread_buffer (and the others) so remarkably slow is not the actual act of reading from disk, but the act of accessing the values one at a time. For example, the regex test simply aims to show that I must have read the entire block from disk (I got the last character). In these tests, I'm not meaning to say that Perl is slower than C, I'm saying that Perl (as I'm using it) is unbearably slower than expected (by me).

This amazing slowness in this one application is surprising to me, because I've generally found Perl to be pretty darned fast. This has been especially true in dealing with text (i.e. regex).

A couple small notes:

I have no interest in writing this in C. Perl is my preferred language, and it was my intention to show the C-lovers I work with that Perl is a perfectly good tool for this sort of task. I'm having a (much) harder time proving that than I'd hoped I would. Perhaps I'm wrong :(

I also have no intention of flaming you with my response. It's clear to me that you've taken a good deal of time to think about my problem, and I'm most appreciative of that time. The intention of my response is simply to show that my benefits don't match your expectations, and to see if you can suggest another approach.

In reply to Re: Re: Re: character-by-character in a huge file by mushnik
in thread character-by-character in a huge file by mushnik

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.