Re: Index a file with pack for fast access
by moritz (Cardinal) on Dec 16, 2011 at 19:07 UTC
|
If the rest of your file looks the same, you don't even need an index. Each line that you pasted has 34 characters (assuming that the newline is a single line-feed, and not CR LF as on windows), so to get the $n-th line (and you start counting the lines from 0), you can just do
my $bytes_per_line = 34;
my $pos = $line_number * $bytes_per_line;
seek(IN, $pos, 0);
my $line = <IN>;
If you need to access by the first column (and not line number), subtract the value of the first row before calculating the line number.
| [reply] [Watch: Dir/Any] [d/l] |
|
I updated the original post. Not all of the lines are 34 bytes long and I should have posted a better cross section of the file. Sorry.
| [reply] [Watch: Dir/Any] |
Re: Index a file with pack for fast access
by BrowserUk (Patriarch) on Dec 16, 2011 at 23:07 UTC
|
#! perl -slw
use strict;
open INDEX, '>:raw', "$ARGV[ 0 ].idx" or die $!;
syswrite INDEX, pack( 'N', 0 ), 4;
syswrite INDEX, pack( 'N', tell *ARGV ), 4 while <>;
close INDEX;
And this loads the appropriate index file for its input argument and the reads 100 records at random: #! perl -slw
use strict;
use Time::HiRes qw[ time ];
our $N //= 100;
open INDEX, '<:raw', "$ARGV[ 0 ].idx" or die $!;
my $len = -s( INDEX );
sysread INDEX, my( $idx ), $len;
close INDEX;
my $start = time;
open DAT, '<', $ARGV[ 0 ] or die $!;
for( 1 .. $N ) {
my $toRead = int rand( length( $idx ) / 4 );
my $offset = unpack 'N', substr $idx, $toRead * 4, 4;
seek DAT, $offset, 0;
my $line = <DAT>;
# print $line;
}
close DAT;
printf "Ave. %.6f seconds/record\n", ( time() -$start ) / $N;
And here is a console log with timings of indexing a 1gb file containing 16 million records and then reading a 100 records at random via that index: [23:03:42.25] c:\test>indexFile 1GB.csv
[23:05:08.24] c:\test>readIndexedFile 1GB.csv
Ave. 0.003699 seconds/record
[23:05:40.38] c:\test>readIndexedFile 1GB.csv
Ave. 0.003991 seconds/record
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Hi BrowserUk,
Thanks for your helpful response! I've created the index as you indicated above in your code, using this:
open(IN, $oneper) or die "Can't open file $oneper for reading: $!\n";
open(INDEX, ">:raw","$file.idx") or die "Can't open $file.idx for read
+/write: $!\n";
syswrite INDEX, pack('N',0),4;
while (<IN>) {
syswrite INDEX, pack('N', tell INDEX), 4;
}
close INDEX;
I've created a file to read, as follows:
open INDEX, "<:raw","$index" or die "Can't open $index for reading: $!
+";
my $len = -s( INDEX );
sysread INDEX, my( $idx ), $len;
close INDEX;
open FILE, "<$oneper" or die "Can't open $oneper for reading: $!";
foreach my $lineNum (sort {$a cmp $b} keys %todo) {
my $offset = unpack 'N', substr $idx, $lineNum * 4, 4;
print "offset is $offset for linenum $lineNum\n<br>";
seek FILE, $offset, 0;
my $line = <FILE>;
print "found line $line\n";
}
The start of my file is:
1 NoResults
2 NoResults
3 13 32446841 0
4 13 32447221 0
5 7 91839109 1
6 7 91747130 1
7 7 91779556 1
8 7 92408328 0
9 7 92373453 0
10 7 92383887 0
11 7 11364200 0
12 7 11337163 0
When I supply lineNum 3 it gives me back:
offset is 12 for linenum 3
found line 2 NoResults
What have I done wrong? :( It feels like it's not indexing an entire line? | [reply] [Watch: Dir/Any] [d/l] [select] |
|
open INDEX, "<:raw","$index" or die "Can't open $index for reading: $!
+";
my $len = -s( INDEX );
sysread INDEX, my( $idx ), $len;
close INDEX;
open FILE, "<$oneper" or die "Can't open $oneper for reading: $!";
foreach my $lineNum (sort {$a cmp $b} keys %todo) {
my $offset = unpack 'N', substr $idx, ( $lineNum - 1 ) * 4, 4; ##
+ Modified!!
print "offset is $offset for linenum $lineNum\n<br>";
seek FILE, $offset, 0;
my $line = <FILE>;
print "found line $line\n";
}
That should do the trick.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] [Watch: Dir/Any] [d/l] |
|
|
|
Re: Index a file with pack for fast access
by BrowserUk (Patriarch) on Dec 16, 2011 at 19:52 UTC
|
open(INDEX, "+>$file.idx") or die "Can't open $file.idx for read/write: $!\n";</c>
If you are using Windows -- and it will do no harm even if you are not-- one thing you should be doing is binmodeing your index file.
pack 'N' can produce values containing bytes that look like various control characters -- \r, \n, ^Z etc. -- that will cause Windows to do nasty things to your index file unless you tell it that the file will contain binary data.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] [Watch: Dir/Any] |
Re: Index a file with pack for fast access
by juster (Friar) on Dec 16, 2011 at 23:38 UTC
|
Asking for line 3 using this code gives me back:
size is 4
offset is 12
found line 39109 1 (incorrect data that appears to be from the middle of line 5)
There is an error in your logic. The offset for line 3 should be 8. Line 1 is offset 0, line 2 is 4, line 3 is 8. The lines are one-based and the index offsets are zero-based so the offset to the index would be size * (line-1).
A similar one-off error is in your index writing sub. You never write the offset to the last line of the text file. If I were writing this I would copy BrowserUk's example and simply write an offset of 0 to the index file straight away.
| [reply] [Watch: Dir/Any] |
Re: Index a file with pack for fast access
by sundialsvc4 (Abbot) on Dec 21, 2011 at 00:19 UTC
|
Now, this is an interesting thread, and I would very sincerely like to know, BrowserUK, why you advocate constructing an index, as it were, “manually,” instead of either (a) putting the data into (say...) a SQLite database file, or (b) at least indexing the data using such a mechanism. I mean ... without(!!) debating the validity of your chosen solution to this problem, why did you choose it? I am genuinely interested in your response.
| [reply] [Watch: Dir/Any] |
|
I would very sincerely like to know, BrowserUK, why you advocate constructing an index, as it were, “manually,” instead of either (a) putting the data into (say...) a SQLite database file,
Firstly, I didn't advocate it. I simply supplied an answer to the OPs question.
But, there are several circumstances under which I might (and have) used it.
- If the file is huge. Ie. > 4GB.
Feeding all the data into a DB will require (at least) double the diskspace. Even if only temporarily.
SQLite is better in this regard than most other DBs, but it still requires substantially more space than creating an external index to the existing file.
- If the file contains records that are not in a convenient format for bulk importation to the DB.
SQLite's .import command expects char-delimited (tabs, commas, pipes or similar) fields within newline delimited records.
It cannot handle (without preprocessing) importing data from;
- Fixed length undelimited fields from a binary file.
- Multi-line records. Eg FASTA files.
The Perl indexer can do both those and many other variations in a single pass.
- If I was in a hurry.
SQLite's .import file table command can take days to import 16 million records from a 1GB file. I can index it with Perl in 80 seconds.
If you discover the appropriate set of magic incantations -- which the SQLite folk seem to take perverse pleasure in keeping hidden -- then their bulk loader can be made to get much closer to Perl's performance, but then adding an index takes hours.
- Once built, SQLite's random access performance is very good:
#! perl -slw
use strict;
use Time::HiRes qw[ time ];
use DBI;
our $N //= 1e3;
my $start = time;
my $dbh = DBI->connect( 'dbi:SQLite:dbname=1gb.db','','' )
or die DBI::errstr;
my $sth = $dbh->prepare( 'select * from onegb where ROWID = ?' )
or die DBI->errstr;
for( 1 .. $N ) {
my $rowID = 1+int( rand 16*1024**2-1 );
$sth->execute( $rowID )
or die $sth->errstr;
my $data = $sth->fetch
or die $sth->errstr;
my @row = @{ $data };
}
printf "Ave: %.6f seconds/record\n", ( time() - $start ) / $N;
__END__
c:\test>1gbdb -N=1e6
Ave: 0.000711 seconds/record
, but it still falls short of using the perl-built index: #! perl -slw
use strict;
use Time::HiRes qw[ time ];
our $N //= 100;
my $start = time;
open INDEX, '<:raw', "$ARGV[ 0 ].idx" or die $!;
my $len = -s( INDEX );
sysread INDEX, my( $idx ), $len;
close INDEX;
sub getRecordN {
my( $fh, $n ) = @_;
seek $fh, unpack( 'N', substr $idx, $n * 4, 4 ), 0;
return scalar <$fh>;
}
scalar <ARGV>;
my @lines; $#lines = $N; $#lines = 0;
for( 1 .. $N ) {
my $toRead = int rand( length( $idx ) / 4 );
my $line = getRecordN( \*ARGV, $toRead );
# push @lines, "$toRead : $offset : ", $line;
}
printf "Ave: %.6f seconds/record\n", ( time() -$start ) / $N;
__END__
c:\test>readIndexedFile -N=1e6 1GB.csv
Ave. 0.000481 seconds/record
SQLite is obviously more powerful, but if you do not need that, it is of no benefit. And it comes at the expense of less flexibility unless you drive it from Perl, at which point you're into populating it via the DBI interface with the inherent slowdown.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Interesting. (Upvoted.) Of course it is understood that you are answering, not advocating, but I found the answer interesting and informative.
As an aside, one characteristic of SQLite that “bit me bad” at first is the way that this system handles transactions. Basically, you must have one, because if you don’t, SQLite will physically verify every single disk write by reading the information again. Which certainly can result in the “hours or days” concern, and then rather dramatically relieve that concern. I admit that I tend towards the use of SQL-based systems mainly so that I can subsequently run queries against them. Perhaps I do not use hand-built searching techniques enough. Thanks for your example.
| [reply] [Watch: Dir/Any] |
|
|
|
|
|
... would very sincerely like to know, BrowserUK, ...
perhaps you can ask BrowserUK directly, by responding to one of his posts?
| [reply] [Watch: Dir/Any] |