Re^2: Index a file with pack for fast access

I would very sincerely like to know, BrowserUK, why you advocate constructing an index, as it were, “manually,” instead of either (a) putting the data into (say...) a SQLite database file,

Firstly, I didn't advocate it. I simply supplied an answer to the OPs question.

But, there are several circumstances under which I might (and have) used it.

If the file is huge. Ie. > 4GB.
Feeding all the data into a DB will require (at least) double the diskspace. Even if only temporarily.
SQLite is better in this regard than most other DBs, but it still requires substantially more space than creating an external index to the existing file.
If the file contains records that are not in a convenient format for bulk importation to the DB.
SQLite's .import command expects char-delimited (tabs, commas, pipes or similar) fields within newline delimited records.
It cannot handle (without preprocessing) importing data from;
- Fixed length undelimited fields from a binary file.
- Multi-line records. Eg FASTA files.
The Perl indexer can do both those and many other variations in a single pass.
If I was in a hurry.
SQLite's .import file table command can take days to import 16 million records from a 1GB file. I can index it with Perl in 80 seconds.
If you discover the appropriate set of magic incantations -- which the SQLite folk seem to take perverse pleasure in keeping hidden -- then their bulk loader can be made to get much closer to Perl's performance, but then adding an index takes hours.

Once built, SQLite's random access performance is very good:

#! perl -slw
use strict;
use Time::HiRes qw[ time ];
use DBI;

our $N //= 1e3;

my $start = time;

my $dbh = DBI->connect( 'dbi:SQLite:dbname=1gb.db','','' )
    or die DBI::errstr;

my $sth = $dbh->prepare( 'select * from onegb where ROWID = ?' )
    or die DBI->errstr;

for( 1 .. $N ) {
    my $rowID = 1+int( rand 16*1024**2-1 );
    $sth->execute( $rowID )
        or die $sth->errstr;
    my $data = $sth->fetch
        or die $sth->errstr;
    my @row = @{ $data };
}

printf "Ave: %.6f seconds/record\n", ( time() - $start ) / $N;

__END__
c:\test>1gbdb -N=1e6
Ave: 0.000711 seconds/record
[download]

, but it still falls short of using the perl-built index:

#! perl -slw
use strict;
use Time::HiRes qw[ time ];

our $N //= 100;

my $start = time;

open INDEX, '<:raw', "$ARGV[ 0 ].idx" or die $!;
my $len = -s( INDEX );
sysread INDEX, my( $idx ), $len;
close INDEX;

sub getRecordN {
    my( $fh, $n ) = @_;
    seek $fh, unpack( 'N', substr $idx, $n * 4, 4 ), 0;
    return scalar <$fh>;
}

scalar <ARGV>;
my @lines; $#lines = $N; $#lines = 0;

for( 1 .. $N ) {
    my $toRead = int rand( length( $idx ) / 4 );
    my $line = getRecordN( \*ARGV, $toRead );
#    push @lines, "$toRead : $offset : ", $line;
}

printf "Ave: %.6f seconds/record\n", ( time() -$start ) / $N;

__END__
c:\test>readIndexedFile -N=1e6 1GB.csv
Ave. 0.000481 seconds/record
[download]

SQLite is obviously more powerful, but if you do not need that, it is of no benefit. And it comes at the expense of less flexibility unless you drive it from Perl, at which point you're into populating it via the DBI interface with the inherent slowdown.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Comment on Re^2: Index a file with pack for fast access Select or Download Code

Replies are listed 'Best First'.
Re^3: Index a file with pack for fast access by locked_user sundialsvc4 (Abbot) on Dec 21, 2011 at 12:48 UTC
Interesting. (Upvoted.) Of course it is understood that you are answering, not advocating, but I found the answer interesting and informative. As an aside, one characteristic of SQLite that “bit me bad” at first is the way that this system handles transactions. Basically, you must have one, because if you don’t, SQLite will physically verify every single disk write by reading the information again. Which certainly can result in the “hours or days” concern, and then rather dramatically relieve that concern. I admit that I tend towards the use of SQL-based systems mainly so that I can subsequently run queries against them. Perhaps I do not use hand-built searching techniques enough. Thanks for your example.
Re^4: Index a file with pack for fast access by Anonymous Monk on Dec 21, 2011 at 13:39 UTC
Really? You can get better performance transaction wise by using berkeleydbsqlite, see bdbvssqlite-wp-186779.pdf	[reply]
Re^4: Index a file with pack for fast access by BrowserUk (Patriarch) on Dec 21, 2011 at 16:13 UTC
As an aside, one characteristic of SQLite that “bit me bad” at first is the way that this system handles transactions. Basically, you must have one, because if you don’t, SQLite will physically verify every single disk write by reading the information again. Which certainly can result in the “hours or days” concern, and then rather dramatically relieve that concern. Transactions aren't involved when using SQLite's bulk loader. The syntax is simply: `CREATE TABLE onegb ( alpha varchar, no varchar, hex varchar, bin varch +ar ); .separator "," .import file table` [download] But if you do that alone on a csv file containing 16 million records, you'll wait days. Try it for yourself. And doing it via SQL inserts, even with transactions, will take even longer. Again, try it for yourself. However, if you precede the .import with the appropriate bunch of seven PRAGMA commands, then the entire import takes just over 2 minutes. But finding/working out/remembering those 7 pragmas is non-trivial. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l]
Re^5: Index a file with pack for fast access by Anonymous Monk on Dec 21, 2011 at 16:38 UTC
http://erictheturtle.blogspot.com/2009/05/fastest-bulk-import-into-sqlite.html commands.txt `.echo ON .read create_table_without_pk.sql PRAGMA cache_size = 400000; PRAGMA synchronous = OFF; PRAGMA journal_mode = OFF; PRAGMA locking_mode = EXCLUSIVE; PRAGMA count_changes = OFF; PRAGMA temp_store = MEMORY; PRAGMA auto_vacuum = NONE; .separator "\t" .import a_tab_seprated_table.txt mytable BEGIN; .read add_indexes.sql COMMIT; .exit` [download] sqlite3 mydb.db < commands.txt	[reply] [d/l]
Re^6: Index a file with pack for fast access by BrowserUk (Patriarch) on Dec 21, 2011 at 16:41 UTC