Faster Flat File

Buzz has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Faster Flat File by particle (Vicar) on Feb 03, 2002 at 04:32 UTC
probably one of the best things you could do to your code is to change your database read/processing. right now, you have two steps: 1) load the database into memory (an array) 2) process each line and print something reading the data into an array takes unnecessary time and memory in your example. instead, open the file, and process using a while loop, like `while(<DATABASE>) ...`. that being said, there are a few other things you should do. here are some of them: > use strict! it will save you many, many, problems. and use warnings, too. either by -w on the shebang line, or, if you have perl >= 5.6.0, `use warnings;`. > when you open and close files, check that they were successful, and die (or warn, croak, etc.) if otherwise. > don't use $a and $b as variable names, as these are generally reserved for sort. > indent properly. your code is missing a closing brace, and is quite difficult to read without proper indentation. > did you include enough code? what are $s, $r, $m, $p? i'm not sure why you're using the g modifier on your matches. > if you're not using $rec later on, you don't need it. you can use $_ instead. i wasn't sure, so i left it in your code in my sample below. here's a list of notes you might want to read: while or foreach? Opening files Use strict warnings and diagnostics or die perlre here's an example of the kind of code i'm talking about. i haven't tested it, so i can't say it works. i've removed #!/usr/local/bin/perl -w use strict; $\|++; use FileHandle; my $path = "/path/to"; my $database = "database"; my $DATABASE = new FileHandle; ## what are these? what are they initialized to? my ($m, $p, $r, $s, $p); open ($DATABASE,"< ", $path/$database") or die("ERROR: cannot open database $path/$database! $!"); while(<DATABASE>) { my $rec = $_; chomp($rec); ## not good variable names, be descriptive my ($a, $b, $c, $d, $e, $f) = split(/\\|/, $rec); if ($s =~ /$d/g) { if ($r =~ /$e/g){ if ($m =~ /$fa/g){ print "$b\n"; if ($c eq "Y"){ print "$a - tal\n"; } if ($p eq "Y"){ print "$e"; } } } ## ... do more stuff } ## ... do more stuff } close($DATABASE) or die("ERROR: cannot close database $path/$database! $!"); [download] ~Particle	[reply] [d/l] [select]
(ar0n) Re: Faster Flat File by ar0n (Priest) on Feb 03, 2002 at 04:39 UTC
I'd probably try something like this (untested): `#!/usr/bin/perl -w use strict; use IO::File; use Text::CSV_XS; { my $fh = new IO::File ("<$path/$database") or die "Can't open file: $!\n"; my $csv = new Text::CSV_XS ({ sep_char => '\|' }); while ( my ($a,$b,$c,$d,$e,$f) = @{$csv->getline($fh)} ) { if ( -1 != index($s, $d) && -1 != index($r, $e) && -1 != index($m, + $f) ){ print "$b\n"; if ($c eq "Y"){ print "$a - tal\n"; } elsif ($p eq "Y"){ print "$e"; } } } }` [download] I used index, since it's faster than a regex, generally, and you seemed to be simply searching for a substring, not an actual regex. Also, Text::CSV_XS is XS (compiled C) which is usually quite fast (faster than Perl, at any rate). Update: As merlyn pointed out recently, index need not be faster than a regex (thanks for pointing that out, blakem). [ ar0n -- want job (boston) ]	[reply] [d/l]
Re: Faster Flat File by wmono (Friar) on Feb 03, 2002 at 04:15 UTC
Slurping an entire file into an array requires serious memory, if the file is of any decent size. Try this construct instead: `open (DATABASE, "file"); while ($rec = <DATABASE>) { # Do stuff } close DATABASE;` [download] That'll at least keep your memory consumption down. If you were going into swap before, it might even make things run faster. Good luck!	[reply] [d/l]
Re: Faster Flat File by rob_au (Abbot) on Feb 03, 2002 at 06:54 UTC
Another direction which you may want to is look at your data strorage itself. The storage of data within a list structure, such as a flat-text file, means that all subsequent data searches will take O(n), that is, the time taken to search the dataset will scale linearly with the growing size of the dataset. Alternatively, if you were to migrate your data storage to a DBM or serialised hash structure, your subsequent lookup times would be O(1), that is, a constant time, irrelevant of the size of your dataset (eg. a lot faster). Also too, the migration of your dataset to a hash structure would allow for the establishment of more complex data structures than currently allowable - For example, what happens currently when your data contains a `\|` character? Update - Also too, you could make use of Tie::Hash::Approx written by OeufMayo in place of your `$s =~ /$d/g` match if you index your hash by the value in the variable `$d` (assuming it is unique). `perl -e 's&&rob@cowsnet.com.au&&&split/[@.]/&&s&.com.&_&&&print'`	[reply] [d/l] [select]
Re: Faster Flat File by belg4mit (Prior) on Feb 03, 2002 at 05:13 UTC
Try using Sprite, DBD::Sprite, or DBD::CSV. The advantage of the latter two being that it will be easier to switch to a different DBI handler later, This will also give you the obvious speed increases that others have pointed out, though it may be slower than a completely home-made solution. UPDATE: Clarified first \|P. `-- perl -pe "s/\b;([st])/'\1/mg"`	[reply]
Re: Faster Flat File by dws (Chancellor) on Feb 03, 2002 at 20:41 UTC
Any help, guidance or enlightement that may be offered will be greatly appreciated. First, where is the `$fa` in `if ($m =~ /$fa/g){` coming from? The shape of the code suggests that you may have meant to use `$f`, instead. (If you note already using `strict`, please consider doing so. It saves a lot of minor typo grief.) For faster access to the file, your OS may provide some exploitable capabilities. If you're running on a nix system that includes shared memory, you can slurp the file into shared memory. I might be marginally faster to access it from there, since you can bypass the overhead of openning the file. As your file grows, linear scans get more expensive. The typical way around this is index the file so that you don't need to read the entire thing to find the pieces you're looking for. Unfortunately, by using regexs to score hits, you make indexing difficult. Is `if ($s =~ /$d/g){` really* how need want to work that comparison? Depending on your data, it could give you a lot of false hits. If the intent is to match a search term `$s` if it is a word in `$d`, then `if ( $s =~ /\b$d\b/i ) {` is more accurate. That also puts you in a better position to build a search index from the individual words in `$d`.	[reply] [d/l] [select]