Baz has asked for the wisdom of the Perl Monks concerning the following question:

Im using a text file as a data base, and the file has grown to about 150 kBytes. I am unable to reach the information at the end of the file, so im guessing that perhaps i need to seek within the file before using read, (as done in C for large files).
A for loop is used to scan the <file> line by line for a certain symbol, But the search is only reaching as far as line 4300, and more lines exist after this.
open(INF,"data.txt"); ## Open read file @userdata = <INF>; ## Put into an array close(INF); $linenum=0; foreach $line (@userdata) { print "$linenum\n"; if ($userdata[$linenum] =~ /¦$tail/) { ********** Do Stuff - But currently not working is symbol beyond l +ine 4300 } $linenum++; }
Any ideas any1, Thanks.

Replies are listed 'Best First'.
Re: FIle Seeking
by maverick (Curate) on Nov 10, 2001 at 20:50 UTC
    well, I don't want to be critical, but this is really an inefficient way to go about it...and it's only going to get worse as your number of records increases. I would suggest moving you system to use a database of one flavor or another. There are several very good open source ones ( MySQL and PostgreSQL are two). What you seem to be doing is equivalent to a 'LIKE' operation in a database.

    Asside from that. You're reading the entire contents of your datafile into memory and then doing stuff with matching rows. Depending on the length of each record, that could be a LOT of memory usage...and may be the source of your woes. Try this instead

    use strict; # this will help you catch a LOT of errors my $tail = "some string"; open(INF,"data.txt") or die "Couldn't open data file: $!"; my $linenum=0; while (<INF>) { print "$linenum\n"; if ($_ =~ /|$tail/) { # do stuff } $linenum++; } close(INF);
    you have the same basic functionality, but you're only placing one row into memory at a time.

    HTH

    /\/\averick
    perl -l -e "eval pack('h*','072796e6470272f2c5f2c5166756279636b672');"

(jeffa) Re: FIle Seeking
by jeffa (Bishop) on Nov 10, 2001 at 20:38 UTC
    Looks like it's time for you to 'upgrade' to either Berkley DB files or a full blown database, because it only gets tougher from here. ;)

    In the meantime, File::ReadBackwards might be of use to you.

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    F--F--F--F--F--F--F--F--
    (the triplet paradiddle)
    
Re: FIle Seeking
by thraxil (Prior) on Nov 10, 2001 at 23:40 UTC

    while i would definately agree with the other monks that you're going to want to migrate to a real database solution pretty soon, there is an intermediate solution that's likely to help you out if you're on unix.

    open your file through grep (or egrep). instead of opening it and searching in perl, let grep do the search and just deal with its results. grep is amazingly more efficient at searching than your perl code will ever be. ie, something like:

    my $term = "foobar"; open(INF,"grep $term data.txt|") or die "couldn't open file: $!"; while (<INF>) { # do some stuff with $_ } close INF;

    of course, if you're getting the search term from users, you've got to be careful about taking dangerous shell characters and such.

    i had to use this trick once where i had no control over the format of the "database". it will actually scan through a 20MB file in the blink of an eye without using much memory at all whereas the all perl solution might take several minutes and fill your RAM.

    once again, this isn't by any means the best solution but if you have no control over the format of the data, it can speed things up a lot.

    update: fixed a typo. thanks ChemBoy

    anders pearson

Re: FIle Seeking
by Amoe (Friar) on Nov 10, 2001 at 20:43 UTC
    I would think you could read through the file via a while (<INF>) loop. Somewhere in the perldocs it says that array-context readline (<> operator) is extremely memory heavy given large files. If that doesn't work, then you should check out the seek function.

    --
    my one true love
Re: FIle Seeking
by {NULE} (Hermit) on Nov 11, 2001 at 04:12 UTC
    Hi,

    I wrote an HL7 Browser using Perl::Tk that regularly is asked to slurp in 100+ MB files. Not only that but portions of that data are parsed and loaded into HList widgets. Obviously it's consuming 200, 300 or more MB in system memory, but the point is if your loop is not surviving past line 4300, then there is another issue. These 100 MB files I deal with have hundreds of thousands of lines in them.

    I'd definitely agree that if you can think of a better way to handle your situation then you should do so, but I would hate to see you go through a bunch of conversion work then find that wasn't the real problem. It might be helpful if we could take a gander at more of your code here. Another thought is that perhaps you have a file with an un-timely EOF marker in it.

    As a more permanent solution I like maverick's idea of using MySQL or PostgreSQL, but another method that might work and is far more simple to impliment is a GDBM database. This is another type of file that I have seen work well even as it grows to tens of MB in size. (It works fine at the 150+ MB size, but can take hours to do a reorganization). Do a search for GDBM_File for more information. Yet another way to do it might be to use the filesystem to break your information up into directories to make it a little faster to parse.

    Good luck, however you decide to proceed,
    {NULE}
    --
    http://www.nule.org

Re: FIle Seeking
by brother ab (Scribe) on Nov 11, 2001 at 16:30 UTC

    May be Array::FileReader could help you. By the way, documentation of this module answers exactly your question.

    -- brother ab