Re: need for speed - how fast is a hash
by GrandFather (Saint) on Jan 07, 2011 at 01:18 UTC
|
Try it and see! You'll spend far more time agonising over optimisation that you are likely to save in execution time.
Once you have some code working, and if it proves too slow for your application, then come back and ask for a better way to achieve your purpose. This sort of optimisation problem is almost always highly dependent on the exact problem you are trying to solve.
True laziness is hard work
| [reply] |
|
|
| [reply] |
Re: need for speed - how fast is a hash
by BrowserUk (Patriarch) on Jan 07, 2011 at 01:22 UTC
|
Hashes are fast. To decide whether they are fast enough for your application the easiest, if not the only, way is to try it.
If once you've measured, you really need more speed, come back and ask again. Depending upon the nature of your keys there might be a faster mechanism, but be assured, it will involve considerably more work than using a hash. Once you know how fast the hash solution is, you'll be in a possition to decide if the extra work is worth the effort.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
Re: need for speed - how fast is a hash
by oko1 (Deacon) on Jan 07, 2011 at 01:55 UTC
|
#!/usr/bin/perl -w
use strict;
use Time::HiRes qw/gettimeofday tv_interval/;
my(@data, %seen);
$data[$_] = int rand 100 for 0 .. 1e6 - 1; # 1,000,000 pieces of data
my $t0 = [gettimeofday];
for (@data){
1 unless $seen{$_}++; # Perform a no-op the first time we see a
+given number
}
print "Elapsed time: ", tv_interval($t0, [gettimeofday]), " seconds\n"
+;
Running this 1 million-item data list with a hash lookup for each item on my little laptop (Intel Atom 1.6GHz) gives the following:
Elapsed time: 1.466658 seconds
Reasonably fast, I'd say.
--
Education is not the filling of a pail, but the lighting of a fire.
-- W. B. Yeats
| [reply] [d/l] [select] |
Re: need for speed - how fast is a hash
by davido (Cardinal) on Jan 07, 2011 at 08:38 UTC
|
Hash insertion and lookup operations are performed in O(1) time. By contrast, reading through your file once through is O(n) assuming you're not sub-looping within. What this means in layman terms is that it takes an amount of time proportional to the number of elements (or lines) in your file to read through it. But inserting into a hash, or doing a hash lookup takes an amount of time that remains practically constant regardless of how many elements the hash contains. Putting it another way, reading the file takes longer the bigger the file is. Inserting and looking up hash entries does not slow down from a practical standpoint as you add more and more elements.
| [reply] |
Re: need for speed - how fast is a hash
by locked_user sundialsvc4 (Abbot) on Jan 07, 2011 at 02:02 UTC
|
Hashes are fast until they fill up.
It really depends on just how many rows there are. If we’re talking about “millions,” read on.
The fastest way (believe it or not...) to detect and to remove duplicates is to do exactly what people were doing with large-to-them datasets before computers existed. Namely, “sort the file,” using an on-disk sort. (Plenty of those in CPAN.)
When you know that a file has been sorted by its key value, then you know that all occurrences of any given key will be immediately adjacent to one another. And, if you encounter a gap in the sorted sequence, you know that key-values which would fall into that gap must not exist anywhere in the file. Furthermore, you can reach all of these conclusions without searching. If you need to merge two files, sort both of them in the same way. Having done this, the merge algorithm (for arbitrarily large files) becomes trivial. It requires almost no RAM at all.
Sorting happens to be one of the most heavily-studied of all computer algorithms. It happens much faster than you expect.
| |
|
|
# After writing the data to a file and then timing a ``system "sort -u
+ nums.txt > /dev/null"''
Elapsed time (hash): 0.707023 seconds
Elapsed time (disk sort): 7.108102 seconds
--
Education is not the filling of a pail, but the lighting of a fire.
-- W. B. Yeats
| [reply] [d/l] |
|
|
It all has to do with the data volumes we are talking about, as I referred-to in my original response.
A hash-table, or a list, is of course an “in-memory data structure.” And that means virtual memory, of course, which is really not quite “memory” at all, but rather a very souped-up, hardware-accelerated, over-glorified disk file. This distinction is not particularly important until data volumes become large ... at which time the issue of “page faults” begins to rear its ugly head.
Virtual memory subsystems have one rather-strange characteristic: “they’re good, then they’re not-quite so good, then (boom!) they’re awful.” There is absolutely nothing in-between. If you plot a performance curve, you see a gradual mostly-linear progression, then (boom!) it “hits the wall” (the so-called thrash point) at which time the graph becomes exponential...
We are talkin’ Old Testament. I mean, real wrath of God stuff! 40 years of torment, earthquakes, tidal waves! The dead rising from the grave! World destruction, human sacrifices, Dogs and Cats Living Together....Mass Hysteria!!!
(Ahem...) I digress...
Disk-based sorting, on the other hand, with any current algorithm, is basically O(logxn). Which basically means, “it’s never gonna be exponential, and that’s really all that matters.” No matter how many millions of records you have, it’s going to take a predictable (and reasonable, considering...) amount of time to sort them all. And, once you have finished sorting, it will take one, linear pass through the sorted file(s) to finish the entire job.
If you have a “modest” amount of data to deal with, this technique might make no difference, and it might even be slower. But if you have large amounts of data, it will make all the difference in the universe.
Well, that might be a bit of an exaggeration... but at least we won’t have Dogs and Cats Living Together. And that, surely, is a good thing ...
Memo to File: From now on, do not undertake to make comments on PerlMonks after having watched both installments of Ghostbusters back-to-back in a single evening ...
| |
|
|
|
|
Hashes are fast until they fill up.
If a Hash is filled up beyond a certain level, it is automatically extended to a bigger size. That doesn't make it slow (it just makes one growing operation slow).
If it doesn't fit into memory anymore, it becomes slower (of course), but then it's due to size, not due to filling up. All other data structures suffer from the same problem.
The fastest way (believe it or not...) to detect and to remove duplicates is to do exactly what people were doing with large-to-them datasets before computers existed. Namely, “sort the file,” using an on-disk sort.
I'm curious, why should sorting a file on disc be faster than building a hash map on disc? Better optimization for sequential access comes to mind, but it's still O(N) vs. O(N log N)
| [reply] |
Re: need for speed - how fast is a hash
by Anonyrnous Monk (Hermit) on Jan 07, 2011 at 01:21 UTC
|
Don't worry, hashes are among the fastest structures for such lookup purposes — provided you use them correctly, i.e. don't search/iterate through all the keys to check if an entry exists ;)
| [reply] |
Re: need for speed - how fast is a hash
by CountZero (Bishop) on Jan 07, 2011 at 17:08 UTC
|
and there may be multiple rows per key (although each row may well have different information for that key) You cannot have "multiple rows" per key. That means your "key" is not a "key" since it does not uniquely identifies one record and one record only. It also is a strong indication that your data file is not normalized. That is not bad, it just makes life a lot more difficult. But your problem has in essence nothing to do with keys and database design. It is simply: upon the first occurrence of a certain value in a certain field, do something. Easy as pie!
use Modern::Perl;
my %keys;
while (<DATA>) {
my ($key, $data1, $data2) = split /,/;
unless (exists($keys{$key})){
# do something
say $key; # just as an example
$keys{$key} = 1;
}
}
__DATA__
key1, abd, 123
key2, dhde, 9+6+
key3, dxdc, edazedaz
key1, dea, 564dz
key1, deksizi, 4833
key4, eoz, 852662
key2, dzdadazd, 9566
Output: key1
key2
key3
key4
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
| [reply] [d/l] [select] |