large file and processing efficiency

chiragshukla has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

I have a relatively large datafile (100MB or so). All lines from this file are to be processed and dumped to MySQL table, **except** those lines that are already in MySQL table.

For example, ids F1,FTDR,RJIER,KERF are present in MySQL and I dont want to process lines from the datafile beginning with these IDs. I do the following at this time:

# populate DISTINCT ids from table into a variable
# separated by pipe symbol; variable looks like below
# and contains 15000 distinct ids
$compare = "F1|FTDR|RJIER|KERF...." 

# open the file and loop
open (MFIL,"<datafile") or die ("ouch");
while ($lines = <MFIL>)
{
  chomp ($lines);
  $lines =~ m/^$compare/o && next;
  print $lines,"\n";
}
close (MFIL);
[download]

This takes over 5 minutes on a 384MB RAM box running FC4 kernel 2.6.11 and Perl v.5.8.6.

If I process all lines from the file and handle MySQL errors (of duplicacy) manually, the file takes about 3 minutes. I am sure there should be a better/faster method to doing this. Please feel free to suggestion me options.

Thank you, all.

Regards,
Chirag Shukla

Comment on large file and processing efficiency Download Code

Replies are listed 'Best First'.
Re: large file and processing efficiency by Fletch (Bishop) on Dec 07, 2005 at 18:32 UTC
Populate a hash with the distinct existing ids, not a regex. Then use some variant of `exists $existing{ $id }` to determine if you want to ignore the line or not. Addendum: And if you've got a really large number of IDs you might want to populate a hash-on-disk (using say DB_File or the like) rather than slurping them all into RAM. That also would allow you to keep the existing IDs around between runs (you insert new keys into it as you insert into the database) and save you having to extract the current active IDs from your RDBMS every run (but then you run the risk of getting out of sync with the RDBMS' contents and have data duplicated in more than one place, so keep that in mind).	[reply] [d/l]
Re^2: large file and processing efficiency by chiragshukla (Initiate) on Dec 07, 2005 at 20:32 UTC
Hello Fletch, Use of hashes cut down the process time very significantly. Hashes did the trick in 20 seconds! Thanks for the nice suggestion. Regards, Chirag Shukla.	[reply]
Re^2: large file and processing efficiency by chiragshukla (Initiate) on Dec 07, 2005 at 19:30 UTC
Thanks Fletch, Roy, How did I forget hashes? Probably because I am new to Perl. Good point about hash-on-disk. Will try out different options and will let you know how things worked out. Roy's suggestion of the 'GrandFather' thread is very interesting too. Thanks for your suggestions. Regards, Chirag.	[reply]
Re: large file and processing efficiency by Roy Johnson (Monsignor) on Dec 07, 2005 at 18:38 UTC
See Slow Regex - How to Optimize, noting in particular GrandFather's benchmark results of ikegami's solution. The summary is: it's better to grab the first word and look it up in a hash than to use a big ol' alternation. If you can't parse out the word, though, you might want to look at Regexp::Optimizer or some such. Caution: Contents may have been coded under pressure.	[reply]
Re: large file and processing efficiency by beachbum (Beadle) on Dec 07, 2005 at 19:54 UTC
You could also move more of the processing to the rdbms by dropping the entire datafile into a temporary table and then issue something like: `insert into realtable select temptable.* from temptable, realtable leftouter join temptable.ID on realtable.ID` [download]	[reply] [d/l]
Re^2: large file and processing efficiency by chiragshukla (Initiate) on Dec 07, 2005 at 20:03 UTC
Yes, thats a good idea too. Thanks, beachbum. Regards Chirag.	[reply]