All your observations are totally correct, and I would suggest to everyone to re-read your reply before deciding on architecture for system like this one.
I had different end-goal: make map over perl structure scale across commodity low-end desktop machines. To be honest I was hoping to be able to do it on single machine, in-memory but it didn't work out for me.
Dataset isn't huge: 540 Mb with 121887 records with about 4.5K per record.
dpavlin@klin:~$ ls -alh /data/isi/full.txt
-rw-r--r-- 1 dpavlin dpavlin 546M 2009-10-05 14:15 /data/isi/full.txt
<code>
Structure is quote simple with repetable field names:
<code>
PT J
AU RABIN, DL
KALIMO, E
MABRY, JH
TI WORLD-HEALTH-ORGANIZATION INTERNATIONAL COLLABORATIVE STUDY OF
MEDICAL-CARE UTILIZATION - SUMMARY OF METHODOLOGICAL STUDIES AND
PRELIMINARY FINDINGS
SO SOCIAL SCIENCE & MEDICINE
LA English
DT Article
C1 GEORGETOWN UNIV,SCH MED,DEPT COMMUNITY MED & INT HLTH,WASHINGTON,DC
+ 20007.
SOCIAL INSUR INST,RES INST SOCIAL SECUR,HELSINKI,FINLAND.
UNIV VERMONT,COLL MED,DEPT COMMUNITY MED,BURLINGTON,VT.
CR 1970, WHO INT COLLABORATIV
1972, MILBANK MEMORIAL F 2, V50
BICE TW, 1969, CROSS NATIONAL MEASU
BICE TW, 1971, SOC SCI MED, V5, P283
BICE TW, 1972, MILBANK MEM FUND Q, V50, P57
KALIMO E, 1970, BRIT J PREVENTIVE SO, V24, P229
KALIMO E, 1972, 1 WHO ICS MCU OCC RE
KALIMO E, 1972, MED CARE, V10, P95
KOHN R, IN PRESS
LOGAN RFL, 1972, MILBANK MEMORIAL F 2, V50, P45
RABIN DL, 1972, MILBANK MEMORIAL F 2, V50, P19
SCHACH E, 1972, MILBANK MEM FD Q, V50, P65
VUKMANOVIC C, 1972, MILBANK MEMORIAL F 2, V50, P5
WHITE KL, 1967, NEW ENGL J MED, V277, P516
WHITE KL, 1969, INT COMPARISONS MEDI
WHITE KL, 1972, MILBANK MEMORIAL F 2, V50, P31
NR 16
TC 2
PU PERGAMON-ELSEVIER SCIENCE LTD
PI OXFORD
PA THE BOULEVARD, LANGFORD LANE, KIDLINGTON, OXFORD, ENGLAND OX5 1GB
SN 0277-9536
J9 SOC SCI MED
JI Soc. Sci. Med.
PY 1974
VL 8
IS 5
BP 255
EP 262
PG 8
SC Public, Environmental & Occupational Health; Social Sciences, Biome
+dical
GA T5180
UT ISI:A1974T518000003
ER
My basic goal was to write something like this for query:
# $rec - input record
# $out - generated data
foreach ( @{ $rec->{C1} } ) {
my $country = $1 if m{,\s?([^,]+)\.$};
$country =~ s{^.+USA$}{USA};
$country =~ s{^\w\w\s\d{5}$}{USA};
$country =~ s{^\w\w$}{USA};
$out->{'C1_country+'}->{ uc $country }++;
}
And I wanted to run this query as fast as possible on all my data (which affected my decision not to use disk for processing), so simpliest possible solution I could come up with is to load it all into perl hash.
This worked well for most fields, until I discovered that I have 4.5 milion entries for CR running following code:
$out->{'CR+'}->{ $_ }++ foreach @{ $rec->{CR} };
Perl hash structures are nice, and code is clean and concise, but memory usage for results pushed me to swap (confirming that my anti-disk bias is correct). Having said that, I have used swap very effectively as first line of on-disk offload before, but with a result set which has random access pattern it doesn't mix well (disks are tipically SATA drives, so fast in linear reads, but slow for almost anything else.
To work around this problem, I implemented conversion from long fields names in CR (err... longer than 4 bytes for int ignoring perl decorations) into simple integer which enabled me to put 4.5 million CR records onto single machine.
I decided to use MD5 hash for each key, keep mapping of md5 to integer in-memory and replace (memory hungry) key with integer while preserving value.
This way, I pushed full key names to disk in form of int -> full_name mapping.
On disk storage had two iterations, in first one I used BerkeleyDB. It saved so much memory that whole database file could fit in /dev/shm :-)
But, storing fields on disk did come with huge performance penalty to my query time. It took longer than 3 seconds to complete query and I wasn't prepared to admit defeat. To speed it up, it was logical to shard it across machines. Even better, I could control worst possible query time by changing size of shard for each node to adjust for different systems.
To implement communication between nodes, I first implemented fancy protocol, but then decided to just ship Storable object directly to socket. Well, not quite directly, since I'm using ssh with compression to speed up network transfer. Since it's mostly bulk (send data to node or receive results) it improves performance on my 100Mb/s network for about 30%.
Whole idea is to have fast throw-away calculations on data which comes from semi-formatted text files (e.g. Apache logs) so your note about limited lifespan is so very right. This was design decision. With this model, I can start on single machine until I fill up memory (or query becomes too slow) and then spread it across other machines until I have whole dataset available or scale out for query speed.
|