in reply to Re: Index a file with pack for fast access
in thread Index a file with pack for fast access
I would very sincerely like to know, BrowserUK, why you advocate constructing an index, as it were, “manually,” instead of either (a) putting the data into (say...) a SQLite database file,
Firstly, I didn't advocate it. I simply supplied an answer to the OPs question.
But, there are several circumstances under which I might (and have) used it.
Feeding all the data into a DB will require (at least) double the diskspace. Even if only temporarily.
SQLite is better in this regard than most other DBs, but it still requires substantially more space than creating an external index to the existing file.
SQLite's .import command expects char-delimited (tabs, commas, pipes or similar) fields within newline delimited records.
It cannot handle (without preprocessing) importing data from;
The Perl indexer can do both those and many other variations in a single pass.
SQLite's .import file table command can take days to import 16 million records from a 1GB file. I can index it with Perl in 80 seconds.
If you discover the appropriate set of magic incantations -- which the SQLite folk seem to take perverse pleasure in keeping hidden -- then their bulk loader can be made to get much closer to Perl's performance, but then adding an index takes hours.
#! perl -slw use strict; use Time::HiRes qw[ time ]; use DBI; our $N //= 1e3; my $start = time; my $dbh = DBI->connect( 'dbi:SQLite:dbname=1gb.db','','' ) or die DBI::errstr; my $sth = $dbh->prepare( 'select * from onegb where ROWID = ?' ) or die DBI->errstr; for( 1 .. $N ) { my $rowID = 1+int( rand 16*1024**2-1 ); $sth->execute( $rowID ) or die $sth->errstr; my $data = $sth->fetch or die $sth->errstr; my @row = @{ $data }; } printf "Ave: %.6f seconds/record\n", ( time() - $start ) / $N; __END__ c:\test>1gbdb -N=1e6 Ave: 0.000711 seconds/record
, but it still falls short of using the perl-built index:
#! perl -slw use strict; use Time::HiRes qw[ time ]; our $N //= 100; my $start = time; open INDEX, '<:raw', "$ARGV[ 0 ].idx" or die $!; my $len = -s( INDEX ); sysread INDEX, my( $idx ), $len; close INDEX; sub getRecordN { my( $fh, $n ) = @_; seek $fh, unpack( 'N', substr $idx, $n * 4, 4 ), 0; return scalar <$fh>; } scalar <ARGV>; my @lines; $#lines = $N; $#lines = 0; for( 1 .. $N ) { my $toRead = int rand( length( $idx ) / 4 ); my $line = getRecordN( \*ARGV, $toRead ); # push @lines, "$toRead : $offset : ", $line; } printf "Ave: %.6f seconds/record\n", ( time() -$start ) / $N; __END__ c:\test>readIndexedFile -N=1e6 1GB.csv Ave. 0.000481 seconds/record
SQLite is obviously more powerful, but if you do not need that, it is of no benefit. And it comes at the expense of less flexibility unless you drive it from Perl, at which point you're into populating it via the DBI interface with the inherent slowdown.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: Index a file with pack for fast access
by locked_user sundialsvc4 (Abbot) on Dec 21, 2011 at 12:48 UTC | |
by Anonymous Monk on Dec 21, 2011 at 13:39 UTC | |
by BrowserUk (Patriarch) on Dec 21, 2011 at 16:13 UTC | |
by Anonymous Monk on Dec 21, 2011 at 16:38 UTC | |
by BrowserUk (Patriarch) on Dec 21, 2011 at 16:41 UTC |