Re^3: Google like scalability using perl?

You say that Disks are slow. But network IO is slower!

Your solution is to keep all the data in memory by distributing it across multiple machines, but for 130,000 bibliographies, a (these days quite modest) 2GB machine could handle 16KB per record. Even with structural overheads that is a fairly expansive bibliography. By moving to even the most basic 64-bit OS on commodity hardware, that can limit can be at least quadrupled.

Whilst you gain through paralellism of the multiple boxes, you loose through networking (IO and coordination). You say "This is where merge to central Redis from nodes comes into play, removing need for merge step all-together at expense of network traffic.", but one way or another, the results from multiple machines has to end up being gathered back together. And that means it has to transition the network and be written to disk.

These days, even commodity boxes come with 2 or 4 cores, with 6 or 8 becoming the norm at the next inventory upgrade cycle. If you applied the same 'remote processes' solution you are using with networked boxes within a fully-RAMed, multi-core box, the 'networking' overhead becomes little more than serialisation through shared memory buffers (plus some protocol overhead that you'd have anyway).

If you use threads instead of processes, each core can process over a segment of shared memory, avoiding all the networking overhead entirely. And by returning references to the shared memory rather than data, to the merging thread, all network IO is avoided completely.

And for scalability; with 64-bit OSs capable of handling at least 128GB of virtual memory, you'd have plenty of room for growth. Whilst random access to virtual memory greater than physical memory will 'thrash the cache', there are still huge gains to be had by avoiding the networking protocal overhead.

Please don't take this as a critisism of your sharing your current solution. That's all good. I just can't help think ing that for problems of this scale, and those an order of magnitude or two larger, it is a solution with a quite limited lifespan.

For a while now I've been looking at a solution of using virtualised memory (memory mapped files) to provide generalised, thread-shared, random read-write access to large datasets (up to ~128GB), and your problem is the first real-world task that has come up here. If you could share your dataset, or at least a reasonably detailed schema sufficient to allow the creation of a realistic testset to be generated, it would be extremely interesting (to me) to see how the two approaches to the problem compare. In terms of both complexity and performance.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

RIP PCW It is as I've been saying!(Audio until 20090817)

Comment on Re^3: Google like scalability using perl?

Replies are listed 'Best First'.

Re^4: Google like scalability using perl?
by dpavlin (Friar) on Oct 14, 2009 at 15:29 UTC

All your observations are totally correct, and I would suggest to everyone to re-read your reply before deciding on architecture for system like this one.

I had different end-goal: make map over perl structure scale across commodity low-end desktop machines. To be honest I was hoping to be able to do it on single machine, in-memory but it didn't work out for me.

Dataset isn't huge: 540 Mb with 121887 records with about 4.5K per record.

dpavlin@klin:~$ ls -alh /data/isi/full.txt
-rw-r--r-- 1 dpavlin dpavlin 546M 2009-10-05 14:15 /data/isi/full.txt
<code>

Structure is quote simple with repetable field names:

<code>
PT J
AU RABIN, DL
   KALIMO, E
   MABRY, JH
TI WORLD-HEALTH-ORGANIZATION INTERNATIONAL COLLABORATIVE STUDY OF
   MEDICAL-CARE UTILIZATION - SUMMARY OF METHODOLOGICAL STUDIES AND
   PRELIMINARY FINDINGS
SO SOCIAL SCIENCE & MEDICINE
LA English
DT Article
C1 GEORGETOWN UNIV,SCH MED,DEPT COMMUNITY MED & INT HLTH,WASHINGTON,DC
+ 20007.
   SOCIAL INSUR INST,RES INST SOCIAL SECUR,HELSINKI,FINLAND.
   UNIV VERMONT,COLL MED,DEPT COMMUNITY MED,BURLINGTON,VT.
CR 1970, WHO INT COLLABORATIV
   1972, MILBANK MEMORIAL F 2, V50
   BICE TW, 1969, CROSS NATIONAL MEASU
   BICE TW, 1971, SOC SCI MED, V5, P283
   BICE TW, 1972, MILBANK MEM FUND Q, V50, P57
   KALIMO E, 1970, BRIT J PREVENTIVE SO, V24, P229
   KALIMO E, 1972, 1 WHO ICS MCU OCC RE
   KALIMO E, 1972, MED CARE, V10, P95
   KOHN R, IN PRESS
   LOGAN RFL, 1972, MILBANK MEMORIAL F 2, V50, P45
   RABIN DL, 1972, MILBANK MEMORIAL F 2, V50, P19
   SCHACH E, 1972, MILBANK MEM FD Q, V50, P65
   VUKMANOVIC C, 1972, MILBANK MEMORIAL F 2, V50, P5
   WHITE KL, 1967, NEW ENGL J MED, V277, P516
   WHITE KL, 1969, INT COMPARISONS MEDI
   WHITE KL, 1972, MILBANK MEMORIAL F 2, V50, P31
NR 16
TC 2
PU PERGAMON-ELSEVIER SCIENCE LTD
PI OXFORD
PA THE BOULEVARD, LANGFORD LANE, KIDLINGTON, OXFORD, ENGLAND OX5 1GB
SN 0277-9536
J9 SOC SCI MED
JI Soc. Sci. Med.
PY 1974
VL 8
IS 5
BP 255
EP 262
PG 8
SC Public, Environmental & Occupational Health; Social Sciences, Biome
+dical
GA T5180
UT ISI:A1974T518000003
ER
[download]

# $rec - input record
# $out - generated data

foreach ( @{ $rec->{C1} } ) {
        my $country = $1 if m{,\s?([^,]+)\.$};
        $country =~ s{^.+USA$}{USA};
        $country =~ s{^\w\w\s\d{5}$}{USA};
        $country =~ s{^\w\w$}{USA};
        $out->{'C1_country+'}->{ uc $country }++;
}
[download]

This worked well for most fields, until I discovered that I have 4.5 milion entries for CR running following code:

$out->{'CR+'}->{ $_ }++ foreach @{ $rec->{CR} };
[download]

To work around this problem, I implemented conversion from long fields names in CR (err... longer than 4 bytes for int ignoring perl decorations) into simple integer which enabled me to put 4.5 million CR records onto single machine.

I decided to use MD5 hash for each key, keep mapping of md5 to integer in-memory and replace (memory hungry) key with integer while preserving value. This way, I pushed full key names to disk in form of int -> full_name mapping.

On disk storage had two iterations, in first one I used BerkeleyDB. It saved so much memory that whole database file could fit in /dev/shm :-)

But, storing fields on disk did come with huge performance penalty to my query time. It took longer than 3 seconds to complete query and I wasn't prepared to admit defeat. To speed it up, it was logical to shard it across machines. Even better, I could control worst possible query time by changing size of shard for each node to adjust for different systems.

To implement communication between nodes, I first implemented fancy protocol, but then decided to just ship Storable object directly to socket. Well, not quite directly, since I'm using ssh with compression to speed up network transfer. Since it's mostly bulk (send data to node or receive results) it improves performance on my 100Mb/s network for about 30%.

Whole idea is to have fast throw-away calculations on data which comes from semi-formatted text files (e.g. Apache logs) so your note about limited lifespan is so very right. This was design decision. With this model, I can start on single machine until I fill up memory (or query becomes too slow) and then spread it across other machines until I have whole dataset available or scale out for query speed.

2share!2flame...

[reply]
[d/l]
[select]


Don't ask to ask, just ask
	PerlMonks