in reply to Re^4: search a large text file
in thread search a large text file

If you use a disk based hash, for example a DBM::Deep hash, you won't run out of memory. The main reason to have a disk-based hash is that you can create hashes that are bigger than your memory (another reason to use it is that the hash is permanent).

For sample code just look at the perldoc page of DBM::Deep, under 'Synopsis'. Basically you just call DBM::Deep to link a hashname with a file on disk and after that you use that hash like any other hash, only that everything you store in there is (behind the scenes) transfered directly to disk.

Replies are listed 'Best First'.
Re^6: search a large text file
by perl_lover_always (Acolyte) on Feb 10, 2011 at 09:45 UTC
    I tried to use it! why when I use in this way, the results are not correct.
    sub to_hash { my $file = shift; my $db = DBM::Deep->new( "$file.db" ); open(FILE, "<$file"); foreach $l (<FILE>) { my ($ngram,$line) = split /\t/, $l; push(@{ $db->{$ngram} }, $line); } close FILE; return $db; }
    for example when I search for a key, I'll get the correct value few times instead of for example one or two times!

      Do you mean "many times instead of one or two times" ? (one or two) == (few) in the english language? I don't see the rest of your script, so I can only guess:

      1) Did you call the sub "to_hash" more than once ? "to_hash" should be executed once and then never again. And with "once" I mean once in your lifetime and not once per execution of the script. Whenever you want to search, just use "my $db = DBM::Deep->new( "$file.db" );" and start to search. Remember that the file $file.db is permanent on your disk and keeps the info between invocations of your script. Call "to_hash" twice and you also get twice the values.

      Additionally you might want to add "$db->clear()" to your "to_hash" subroutine so that even if you have to call it twice (because the source file changed), you get an empty hash before filling it.

      2) Maybe your search routine prints out more than you want

        very simple:
        my $file_in_en=shift; my $hash_en=to_hash($file_in_en); print "@{$hash_en{'despite'}}";
        Results:
        17 18 18 18 18 18 18 18 18 18 18 18
        expected result:
        18
        When I try to use a normal hash in this way I get a correct result:
        my $file_in_en=shift; my %hash_en=to_hash($file_in_en); print "@{$hash_en{'despite'}}"; sub to_hash { my %hash; my $file = shift; open(FILE, "<$file"); foreach $l (<FILE>) { my ($ngram,$line) = split /\t/, $l; push(@{ $hash{$ngram} }, $line); } close FILE; return %hash; }