Re: searching for unique numbers into a string

Replies are listed 'Best First'.
Re^2: searching for unique numbers into a string by almut (Canon) on Apr 06, 2009 at 15:44 UTC
Interestingly, when you benchmark it, the OP's method turns out to be slightly faster (~20% with Perl 5.8.8, ~40% with Perl 5.10.0) than List::MoreUtils' implementation, which is `sub uniq (@) { my %h; map { $h{$_}++ == 0 ? $_ : () } @_; }` [download] So, if the original order of values doesn't need to be maintained, it isn't such a bad choice, after all — though BrowserUk's form would be somewhat more natural, IMHO (but not faster).	[reply] [d/l]
Re^3: searching for unique numbers into a string by GrandFather (Saint) on Apr 06, 2009 at 23:06 UTC
I'd like to see that benchmark. My version gives somewhat different results: Read more... the code (1128 Bytes) Prints (neglecting the sanity check output): `Rate Grep For UtilsH Map UtilsM Grep 5068/s -- -0% -6% -22% -63% For 5068/s 0% -- -6% -22% -63% UtilsH 5410/s 7% 7% -- -17% -61% Map 6502/s 28% 28% 20% -- -53% UtilsM 13863/s 174% 174% 156% 113% --` [download] Update: UtilsM uses List::MoreUtils. UtilsH is the uniq code implemented in the same context as the other benchmark tests. True laziness is hard work	[reply] [d/l] [select]
Re^4: searching for unique numbers into a string by almut (Canon) on Apr 07, 2009 at 02:02 UTC
OK, a couple of errors on my part... (mea culpa) However, as it looks after more judicious investigation, the results are highly data dependent. So what did I do? First, the code (cleaned up, and with GrandFather's `@values` added): use strict; use warnings; use Benchmark qw(cmpthese); use List::MoreUtils; my @data; # AB for ( 1 .. 1e4 ) { push @data, int( rand 1e6 ); } my @lines; # BU for ( 1 .. 1e3 ) { my $line = int( rand 1e6 ); $line .= chr(9) . int( rand 1e6 ) while length( $line ) < 4096; push @lines, $line; } my @values = map {int rand 10} 1 .. 1000; # GF my $data; $data = \@data; #$data = \@lines; #$data = \@values; sub uniq1 { # copied from List::MoreUtils my %h; map { $h{$_}++ == 0 ? $_ : () } @_; } sub uniq2 { my %h; grep { $h{$_}++ == 0 } @_; } sub uniq3 { # OP my %h; grep {$h{$_} = undef} @_; keys %h; } sub uniq4 { # BrowserUk my %h; undef @h{ @_ }; keys %h; } cmpthese(-1, { 'uniqM' => sub { my @uniq = List::MoreUtils::uniq(@$data) }, 'uniq1' => sub { my @uniq = uniq1(@$data) }, 'uniq2' => sub { my @uniq = uniq2(@$data) }, 'uniq3' => sub { my @uniq = uniq3(@$data) }, 'uniq4' => sub { my @uniq = uniq4(@$data) }, }); [download] I first started with my input data ("AB", an adapted/simplified version of BrowserUk's random input generator), and got the following results: `Rate uniq1 uniqM uniq2 uniq3 uniq4 uniq1 35.2/s -- -1% -5% -15% -21% uniqM 35.5/s 1% -- -4% -14% -20% uniq2 36.9/s 5% 4% -- -11% -17% uniq3 41.5/s 18% 17% 13% -- -7% uniq4 44.7/s 27% 26% 21% 8% --` [download] From this I had concluded (prematurely) that there is virtually no difference between "uniq1" and "uniqM" (the XS implementation), so I commented out the latter benchmark (my error 1). Then, after having played around a bit, I had settled on the following results (which is where the reported ~40% for Perl 5.10.0 came from): `Rate uniq2 uniq1 uniq3 uniq4 uniq2 34.2/s -- -4% -30% -30% uniq1 35.5/s 4% -- -28% -28% uniq3 49.1/s 43% 38% -- 0% uniq4 49.1/s 43% 38% 0% --` [download] The thing I had overlooked (error 2) is, that my `$data` pointer was still referring to BrowserUk's data ("BU"), which I had been playing around in between. So, those results are in fact for rather unusual input, i.e. 1000 values of around 4K each... The full set with the BU data is, btw: `Rate uniqM uniq2 uniq1 uniq3 uniq4 uniqM 24.8/s -- -28% -30% -50% -50% uniq2 34.2/s 38% -- -4% -31% -31% uniq1 35.5/s 43% 4% -- -28% -28% uniq3 49.5/s 100% 45% 39% -- 0% uniq4 49.5/s 100% 45% 39% 0% --` [download] which shows that, for large strings (probably all of them being unique), the XS variant is clearly the slowest (!) With GrandFather's input data, OTOH, I do get similar results: `Rate uniq1 uniq2 uniq3 uniq4 uniqM uniq1 3445/s -- -18% -27% -77% -80% uniq2 4213/s 22% -- -10% -72% -75% uniq3 4696/s 36% 11% -- -69% -72% uniq4 15175/s 340% 260% 223% -- -10% uniqM 16905/s 391% 301% 260% 11% --` [download] Overall, BrowserUk's `uniq()` seems to be the winner. In other words, the findings essentially remain the same — with my original data (which isn't all that unrealistic). But there is huge variation depending on the type of input. Moral of the story: thou shalt not be lazy and not disclose your benchmark code (telling myself) ;(	[reply] [d/l] [select]
Re^2: searching for unique numbers into a string by jwkrahn (Abbot) on Apr 06, 2009 at 15:52 UTC
'grep' builds an array Wrong. grep builds a list. Perhaps you should read What is the difference between a list and an array?.	[reply]
Re^3: searching for unique numbers into a string by Marshall (Canon) on Apr 07, 2009 at 17:14 UTC
Well this whole array vs list thing is filled with controversy. Tom Christiansen says things like @xyz is an array variable that defines a list, your faq reference not withstanding. I mean look at Tom's writings on the subject like this one: http://www.perl.com/doc/manual/html/pod/perllol.html. Or some of his books. I personally think this gets into what I would call "language lawyering" and fine parsing of the terminology and to no real benefit. I personally like the way Tom does it by introducing the term array and then quickly moving to calling all of these Perl equivalent things to "arrays in other langugages", lists. That a list is described by an array variable type, is not that an important distinction to me. When we get into more complex Perl structures like LoL (List of Lists), LoH (Lists of Hash), LoLoL (List of Lists of Lists), my opinion is that these are MUCH more descriptive than other types of terms. I guess part of this has to do with what somebody's programming background is. In the C world, a "traditional 2- D" or higher order C array is a pretty worthless data structure for most jobs. There are lots of problems with this, just one thing is that you have to pass around both dimensions which makes it very hard to write general purpose matrix routines. Also for example, I don't know of any traditional 2-D arrays used in the Unix O/S. Maybe there are some, I just don't know where they are. Starting with intermediate C, "traditional" 2-D C arrays go the way of the dodo bird. The way in C to build a practical 2D structure, say of ints is an **int (array of pointers to arrays of ints). This is very close to exactly what a Perl LoL is! In C, this is also a 2-D array, but it is a special kind of 2D array. In Perl, calling this a LoL, List of List (or more specifically List of references to Lists) is much more descriptive of what is really going on! A main point with a LoL is that everything is a pointer until you get to the final dimension. A "traditional" array has fixed memory layout and dimensions. That is not what a Perl list is! Any Perl list that has a name can be "grown". Even ones that are initialized with X number of elemements at the beginning of the program. I'm sure this post will generate some controversy. Maybe sometimes we get too caught up in yelling about terminology? I like the terms LoL, etc. If somebody wants to call this AoA, I'm not that bent out of shape about it. I think LoL is better, but this is not the "end of the world".	[reply]