Re: DBI::SQLite slowness

Assumption: this data you are de-duping is downloaded fresh, daily from TwitFace.

The idea of loading 180 million records into a db on disk in order to de-dup it is ridiculous if you are in any way concerned with speed.

The following shows a 10-line perl script de-duping a 200-million line, 2.8 GB file of 12-digit numbers in a little over 2 1/2 minutes, using less than 30 MB of ram to do so:

C:\test>dir 1054929.dat
20/09/2013  04:22     2,800,000,000 1054929.dat

C:\test>wc -l 1054929.dat
200000000 1054929.dat

C:\test>head 1054929.dat
100112321443
100135127486
100110839892
100098464584
100098900542
100048844759
100090430059
100018238859
100132791659
100027638968

C:\test>1054929 1054929.dat | wc -l
1379647642.87527
1379647855.6311
113526721
[download]

That's processing the 12-digit numbers at a rate of just under 1 million per second.

You cannot even load the data into the DB at 1/100th of that rate; never mind get the de-duped back out.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re: DBI::SQLite slowness Download Code

Replies are listed 'Best First'.
Re^2: DBI::SQLite slowness by McA (Priest) on Sep 20, 2013 at 04:01 UTC
Hi BrowserUk, probably it's too early in the morning, but I can't see the script doing the dedup. There seems to be a magic program/script called `1054929`. What am I missing? Best regards McA	[reply] [d/l]
Re^3: DBI::SQLite slowness by Anonymous Monk on Sep 20, 2013 at 04:59 UTC
I imagine BrowserUk is using vec See Re: vec overflow?, Re: searching a vector array, Re^2: Array vs. Hash for sparsely integer-indexed data (bit vectors) This takes about 33MB of ram, `perl -e " my $f='' ; vec($f, 2**28,1)=1; "` [download] I think that's marking as seen (as duplicate) the id 268435456 or 268_435_456 I think BrowserUk is probably assuming a minimum starting id, 268_435_456 is a lot of IDs :) 256MiB	[reply] [d/l]
Re^4: DBI::SQLite slowness ( mini vec tutorial ) by Anonymous Monk on Sep 20, 2013 at 13:10 UTC
Following up on Re: vec overflow? an adjustable example (mini vec tutorial) Read more... (2 kB) The @ in the output it produces is used for meaning of "@at", its not an actual array :) for example `0@[ 0][ 0]` means the number zero is stored in the first(zero-th,0-th) seen_vecs string, and its the first bit of the string (0-th bit) ; neat how that works, id-zero is zero-th bit, is offset-th-ed-bit :) Read more... (2 kB) This can help with the vec syntax :) Bit::Vector::Minimal - Object-oriented vec wrapper	[reply] [d/l] [select]
Re^3: DBI::SQLite slowness by BrowserUk (Patriarch) on Sep 20, 2013 at 05:17 UTC
What am I missing? Nothing. I didn't post the actual script; just demonstrated that this wasn't an idle boast. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]