marcelpreda has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have about 3 000 000 of text lines.
Each line contains one or more numbers separated by TAB.
I want to get only the unique numbers from each line.
Something like:
12123_TAB_456_TAB_455667
'Till noweverything was fine with a code like:
LOOP STARTS HERE my %unique =(); grep {$unique{$_} = undef} split(/\t/, $myLine); $myLine = join("\t", keys %unique); LOOP ENDS HERE
Now it seems to be to slow, the lines length have increased.
There are lines of ~4kB .
Is there any other (fast) way to do it?

Best Regards,
Marcel

Replies are listed 'Best First'.
Re: searching for unique numbers into a string
by BrowserUk (Patriarch) on Apr 06, 2009 at 14:51 UTC

    How slow? The following shows that it takes an average of 0.7/1000th of a second to dedup a 4k line:

    #! perl -slw use 5.010; use strict; use Time::HiRes qw[ time ];; sub uniq{ my %uniq; undef @uniq{ @_ }; keys %uniq } my @lines; for ( 1 .. 1e3 ) { my $line = int( rand 1e6 ); $line .= chr(9) . int( rand 1e6 ) while length( $line ) < 4096; push @lines, $line; } my $start = time; $_ = join chr(9), uniq( split chr(9), $_ ) for @lines; my $stop = time; printf "On random 4k lines, uniquing averaged %.6f seconds/line\n", ($stop - $start) / 1e3; __END__ c:\test>junk On random 4k lines, uniquing averaged 0.000695 seconds/line c:\test>junk On random 4k lines, uniquing averaged 0.000696 seconds/line c:\test>junk On random 4k lines, uniquing averaged 0.000712 seconds/line c:\test>junk On random 4k lines, uniquing averaged 0.000700 seconds/line

    Which means processing your 3e6 lines should take around 35 minutes (plus a bit for the file processing).

    So how fast do you want it to run? Or more likely, what are you doing in your full code that is slowing things down?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: searching for unique numbers into a string
by roubi (Hermit) on Apr 06, 2009 at 14:41 UTC
    'grep' builds an array that you then discard. Maybe this would be faster:
    use List::MoreUtils qw(uniq); my @unique = uniq split(/\t/, $myLine); $myLine = join("\t",@unique);

      Interestingly, when you benchmark it, the OP's method turns out to be slightly faster (~20% with Perl 5.8.8, ~40% with Perl 5.10.0) than List::MoreUtils' implementation, which is

      sub uniq (@) { my %h; map { $h{$_}++ == 0 ? $_ : () } @_; }

      So, if the original order of values doesn't need to be maintained, it isn't such a bad choice, after all — though BrowserUk's form would be somewhat more natural, IMHO (but not faster).

        I'd like to see that benchmark. My version gives somewhat different results:

        Prints (neglecting the sanity check output):

        Rate Grep For UtilsH Map UtilsM Grep 5068/s -- -0% -6% -22% -63% For 5068/s 0% -- -6% -22% -63% UtilsH 5410/s 7% 7% -- -17% -61% Map 6502/s 28% 28% 20% -- -53% UtilsM 13863/s 174% 174% 156% 113% --

        Update: UtilsM uses List::MoreUtils. UtilsH is the uniq code implemented in the same context as the other benchmark tests.


        True laziness is hard work
        Well this whole array vs list thing is filled with controversy. Tom Christiansen says things like @xyz is an array variable that defines a list, your faq reference not withstanding. I mean look at Tom's writings on the subject like this one: http://www.perl.com/doc/manual/html/pod/perllol.html. Or some of his books.

        I personally think this gets into what I would call "language lawyering" and fine parsing of the terminology and to no real benefit. I personally like the way Tom does it by introducing the term array and then quickly moving to calling all of these Perl equivalent things to "arrays in other langugages", lists. That a list is described by an array variable type, is not that an important distinction to me.

        When we get into more complex Perl structures like LoL (List of Lists), LoH (Lists of Hash), LoLoL (List of Lists of Lists), my opinion is that these are MUCH more descriptive than other types of terms. I guess part of this has to do with what somebody's programming background is.

        In the C world, a "traditional 2- D" or higher order C array is a pretty worthless data structure for most jobs. There are lots of problems with this, just one thing is that you have to pass around both dimensions which makes it very hard to write general purpose matrix routines. Also for example, I don't know of any traditional 2-D arrays used in the Unix O/S. Maybe there are some, I just don't know where they are.

        Starting with intermediate C, "traditional" 2-D C arrays go the way of the dodo bird. The way in C to build a practical 2D structure, say of ints is an **int (array of pointers to arrays of ints). This is very close to exactly what a Perl LoL is! In C, this is also a 2-D array, but it is a special kind of 2D array. In Perl, calling this a LoL, List of List (or more specifically List of references to Lists) is much more descriptive of what is really going on! A main point with a LoL is that everything is a pointer until you get to the final dimension. A "traditional" array has fixed memory layout and dimensions. That is not what a Perl list is!

        Any Perl list that has a name can be "grown". Even ones that are initialized with X number of elemements at the beginning of the program.

        I'm sure this post will generate some controversy. Maybe sometimes we get too caught up in yelling about terminology? I like the terms LoL, etc. If somebody wants to call this AoA, I'm not that bent out of shape about it. I think LoL is better, but this is not the "end of the world".

Re: searching for unique numbers into a string
by kennethk (Abbot) on Apr 06, 2009 at 14:48 UTC
    If you look at List::MoreUtils, there is a uniq method that implements what you want. The only fundamental difference between what you are doing and what it does is by using your grep in void context and then calling keys, you are performing the same operation twice.
Re: searching for unique numbers into a string
by targetsmart (Curate) on Apr 06, 2009 at 14:48 UTC
    just a guess
    what happens to $myLine after building it?, 3000000 X 4KB is huge, if you are using that as a plain hash key, your memory may be eaten completely, check another way of solving it.
    by building that unique number what you want to achieve?

    Vivek
    -- In accordance with the prarabdha of each, the One whose function it is to ordain makes each to act. What will not happen will never happen, whatever effort one may put forth. And what will happen will not fail to happen, however much one may seek to prevent it. This is certain. The part of wisdom therefore is to stay quiet.
      I think these folks are on the right track. The hash is only built for the line being analyzed. Evidently the number of items on a line is small compared with the 3 million line limit.

      As one hint for HUGE hashes (this situation doesn't qualify from what I understand), but the default hash starts with 8 key "slots". As the hash gets larger, 8,16,32,64,128,etc, all hash keys in the hash have to be re-calculated. If you start getting into hash sizes of like 100,000 keys, this doubling process can cost. It is possible to start a hash with a larger number of buckets than the default of 8, by assigning a scalar value to keys, like keys %hash=2**16 or whatever. If the hash is less than 1024 total key space, this usually doesn't make much difference. Good values for number of available keys is 1/2 expected total number of items in hash (a place to experiment from).