Assuming, based upon the OPs description:
1 TB file containing approximately 100 billion 11-characters words evenly distributed. (On average, 1000 occurrences of each of 100 million unique words.)
and the following (relatively modest) hardware for the cluster/cloud hardware:
- File on raided drive array with 600MB/s sustained throughput.
- Commodity hardware with 16GB memory per machine.
- 1Gb/s network infrastructure.
Each of 32 readers processes a different 32GB chunk of the file. Assume the disk bandwidth is evenly allocated, each reader takes 32GB/(600/32)MB/s) = just 30 minutes to read its chunk. (Assume the extraction/hashing/accumulation of counts can be done in the time between reads.)
With 16GB of ram, each machine will be able to accumulate all 100 million key/count pairs, before it needs to flush its counts to the accumulators. Each key/count pair packet consists of 11-character word plus a (way oversized) 7 digit count plus a 2 character terminator, for a 20 bytes packet + 8-bytes packet overhead = 28 bytes. Each flush therefore requires 2.6GB of data packets be transmitted. Times 32 readers = 84GB total transmissions.
At 1Gb/s (call that 100MB/s for simplicity), gives 852 seconds or a little under 15 minutes of transmission time.
Job done is somewhat less than 1 hour.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP Neil Armstrong
|