need for speed - how fast is a hash

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: need for speed - how fast is a hash by GrandFather (Saint) on Jan 07, 2011 at 01:18 UTC
Try it and see! You'll spend far more time agonising over optimisation that you are likely to save in execution time. Once you have some code working, and if it proves too slow for your application, then come back and ask for a better way to achieve your purpose. This sort of optimisation problem is almost always highly dependent on the exact problem you are trying to solve. True laziness is hard work	[reply]
Re^2: need for speed - how fast is a hash by afoken (Chancellor) on Jan 07, 2011 at 15:10 UTC
Additional hints: Re: how to quickly parse 50000 html documents? Re: Minimum Hash keys allocated? Re^2: web performance 2010 Devel::NYTProf Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re: need for speed - how fast is a hash by BrowserUk (Patriarch) on Jan 07, 2011 at 01:22 UTC
Hashes are fast. To decide whether they are fast enough for your application the easiest, if not the only, way is to try it. If once you've measured, you really need more speed, come back and ask again. Depending upon the nature of your keys there might be a faster mechanism, but be assured, it will involve considerably more work than using a hash. Once you know how fast the hash solution is, you'll be in a possition to decide if the extra work is worth the effort. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re: need for speed - how fast is a hash by oko1 (Deacon) on Jan 07, 2011 at 01:55 UTC
To give you a general idea: `#!/usr/bin/perl -w use strict; use Time::HiRes qw/gettimeofday tv_interval/; my(@data, %seen); $data[$_] = int rand 100 for 0 .. 1e6 - 1; # 1,000,000 pieces of data my $t0 = [gettimeofday]; for (@data){ 1 unless $seen{$_}++; # Perform a no-op the first time we see a +given number } print "Elapsed time: ", tv_interval($t0, [gettimeofday]), " seconds\n" +;` [download] Running this 1 million-item data list with a hash lookup for each item on my little laptop (Intel Atom 1.6GHz) gives the following: `Elapsed time: 1.466658 seconds` [download] Reasonably fast, I'd say. -- Education is not the filling of a pail, but the lighting of a fire. -- W. B. Yeats	[reply] [d/l] [select]
Re: need for speed - how fast is a hash by davido (Cardinal) on Jan 07, 2011 at 08:38 UTC
Hash insertion and lookup operations are performed in O(1) time. By contrast, reading through your file once through is O(n) assuming you're not sub-looping within. What this means in layman terms is that it takes an amount of time proportional to the number of elements (or lines) in your file to read through it. But inserting into a hash, or doing a hash lookup takes an amount of time that remains practically constant regardless of how many elements the hash contains. Putting it another way, reading the file takes longer the bigger the file is. Inserting and looking up hash entries does not slow down from a practical standpoint as you add more and more elements. Dave	[reply]
Re: need for speed - how fast is a hash by locked_user sundialsvc4 (Abbot) on Jan 07, 2011 at 02:02 UTC
Hashes are fast until they fill up. It really depends on just how many rows there are. If we’re talking about “millions,” read on. The fastest way (believe it or not...) to detect and to remove duplicates is to do exactly what people were doing with large-to-them datasets before computers existed. Namely, “sort the file,” using an on-disk sort. (Plenty of those in CPAN.) When you know that a file has been sorted by its key value, then you know that all occurrences of any given key will be immediately adjacent to one another. And, if you encounter a gap in the sorted sequence, you know that key-values which would fall into that gap must not exist anywhere in the file. Furthermore, you can reach all of these conclusions without searching. If you need to merge two files, sort both of them in the same way. Having done this, the merge algorithm (for arbitrarily large files) becomes trivial. It requires almost no RAM at all. Sorting happens to be one of the most heavily-studied of all computer algorithms. It happens much faster than you expect.
Re^2: need for speed - how fast is a hash by oko1 (Deacon) on Jan 07, 2011 at 02:17 UTC
OK, I'll admit - you've got me puzzled. How can a multi-pass operation (in the average case) like sorting be faster than a single traverse of a list - even with a hash lookup thrown in? # After writing the data to a file and then timing a ``system "sort -u + nums.txt > /dev/null"'' Elapsed time (hash): 0.707023 seconds Elapsed time (disk sort): 7.108102 seconds [download] -- Education is not the filling of a pail, but the lighting of a fire. -- W. B. Yeats	[reply] [d/l]
Re^3: need for speed - how fast is a hash by locked_user sundialsvc4 (Abbot) on Jan 07, 2011 at 03:46 UTC
It all has to do with the data volumes we are talking about, as I referred-to in my original response. A hash-table, or a list, is of course an “in-memory data structure.” And that means virtual memory, of course, which is really not quite “memory” at all, but rather a very souped-up, hardware-accelerated, over-glorified disk file. This distinction is not particularly important until data volumes become large ... at which time the issue of “page faults” begins to rear its ugly head. Virtual memory subsystems have one rather-strange characteristic: “they’re good, then they’re not-quite so good, then (boom!) they’re awful.” There is absolutely nothing in-between. If you plot a performance curve, you see a gradual mostly-linear progression, then (boom!) it “hits the wall” (the so-called thrash point) at which time the graph becomes exponential... We are talkin’ Old Testament. I mean, real wrath of God stuff! 40 years of torment, earthquakes, tidal waves! The dead rising from the grave! World destruction, human sacrifices, Dogs and Cats Living Together....Mass Hysteria!!! (Ahem...) I digress... Disk-based sorting, on the other hand, with any current algorithm, is basically O(log_xⁿ). Which basically means, “it’s never gonna be exponential, and that’s really all that matters.” No matter how many millions of records you have, it’s going to take a predictable (and reasonable, considering...) amount of time to sort them all. And, once you have finished sorting, it will take one, linear pass through the sorted file(s) to finish the entire job. If you have a “modest” amount of data to deal with, this technique might make no difference, and it might even be slower. But if you have large amounts of data, it will make all the difference in the universe. Well, that might be a bit of an exaggeration... but at least we won’t have Dogs and Cats Living Together. And that, surely, is a good thing ... _{Memo to File: From now on, do not undertake to make comments on PerlMonks after having watched both installments of Ghostbusters back-to-back in a single evening ...}
Re^4: need for speed - how fast is a hash by oko1 (Deacon) on Jan 07, 2011 at 04:54 UTC
Re^2: need for speed - how fast is a hash by moritz (Cardinal) on Jan 07, 2011 at 09:19 UTC
Hashes are fast until they fill up. If a Hash is filled up beyond a certain level, it is automatically extended to a bigger size. That doesn't make it slow (it just makes one growing operation slow). If it doesn't fit into memory anymore, it becomes slower (of course), but then it's due to size, not due to filling up. All other data structures suffer from the same problem. The fastest way (believe it or not...) to detect and to remove duplicates is to do exactly what people were doing with large-to-them datasets before computers existed. Namely, �sort the file,� using an on-disk sort. I'm curious, why should sorting a file on disc be faster than building a hash map on disc? Better optimization for sequential access comes to mind, but it's still O(N) vs. O(N log N) Perl 6 - second systems done right	[reply]
Re: need for speed - how fast is a hash by Anonyrnous Monk (Hermit) on Jan 07, 2011 at 01:21 UTC
Don't worry, hashes are among the fastest structures for such lookup purposes — provided you use them correctly, i.e. don't search/iterate through all the keys to check if an entry exists ;)	[reply]
Re: need for speed - how fast is a hash by CountZero (Bishop) on Jan 07, 2011 at 17:08 UTC
and there may be multiple rows per key (although each row may well have different information for that key) You cannot have "multiple rows" per key. That means your "key" is not a "key" since it does not uniquely identifies one record and one record only. It also is a strong indication that your data file is not normalized. That is not bad, it just makes life a lot more difficult. But your problem has in essence nothing to do with keys and database design. It is simply: upon the first occurrence of a certain value in a certain field, do something. Easy as pie! `use Modern::Perl; my %keys; while (<DATA>) { my ($key, $data1, $data2) = split /,/; unless (exists($keys{$key})){ # do something say $key; # just as an example $keys{$key} = 1; } } __DATA__ key1, abd, 123 key2, dhde, 9+6+ key3, dxdc, edazedaz key1, dea, 564dz key1, deksizi, 4833 key4, eoz, 852662 key2, dzdadazd, 9566` [download] Output: `key1 key2 key3 key4` [download] CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]