in reply to Re^2: to thread or fork or ?
in thread to thread or fork or ?
An architecture:
Split your bigfile size across N machines.
Have a process on each machine that processes a filesize/N chunk of the bigfile. (Say, 32 machines each reading a different 32GB chunk of your 1TB file.)
Each reader accumulates word counts in a hash until the hash size approaches it's memory limit.
(Assume ~1.5GB/10 million words/keys on a 64-bit Perl; somewhat less on a 32-bit.)
When that limit is reached; it posts (probably udp) out the word/count pairs to the appropriate accumulator machines; frees the hash and continues reading the file from where it left off.
No threading, shared memory or locking required. Simple to set up and efficient to process.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: to thread or fork or ?
by locked_user sundialsvc4 (Abbot) on Oct 19, 2012 at 14:21 UTC | |
by BrowserUk (Patriarch) on Oct 19, 2012 at 15:54 UTC |