chiragshukla has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks,

I have a relatively large datafile (100MB or so). All lines from this file are to be processed and dumped to MySQL table, **except** those lines that are already in MySQL table.

For example, ids F1,FTDR,RJIER,KERF are present in MySQL and I dont want to process lines from the datafile beginning with these IDs. I do the following at this time:

# populate DISTINCT ids from table into a variable # separated by pipe symbol; variable looks like below # and contains 15000 distinct ids $compare = "F1|FTDR|RJIER|KERF...." # open the file and loop open (MFIL,"<datafile") or die ("ouch"); while ($lines = <MFIL>) { chomp ($lines); $lines =~ m/^$compare/o && next; print $lines,"\n"; } close (MFIL);

This takes over 5 minutes on a 384MB RAM box running FC4 kernel 2.6.11 and Perl v.5.8.6.

If I process all lines from the file and handle MySQL errors (of duplicacy) manually, the file takes about 3 minutes. I am sure there should be a better/faster method to doing this. Please feel free to suggestion me options.

Thank you, all.

Regards,
Chirag Shukla

Replies are listed 'Best First'.
Re: large file and processing efficiency
by Fletch (Bishop) on Dec 07, 2005 at 18:32 UTC

    Populate a hash with the distinct existing ids, not a regex. Then use some variant of exists $existing{ $id } to determine if you want to ignore the line or not.

    Addendum: And if you've got a really large number of IDs you might want to populate a hash-on-disk (using say DB_File or the like) rather than slurping them all into RAM. That also would allow you to keep the existing IDs around between runs (you insert new keys into it as you insert into the database) and save you having to extract the current active IDs from your RDBMS every run (but then you run the risk of getting out of sync with the RDBMS' contents and have data duplicated in more than one place, so keep that in mind).

      Hello Fletch,

      Use of hashes cut down the process time very significantly. Hashes did the trick in 20 seconds!

      Thanks for the nice suggestion.

      Regards,
      Chirag Shukla.
      Thanks Fletch, Roy,

      How did I forget hashes? Probably because I am new to Perl.

      Good point about hash-on-disk. Will try out different options and will let you know how things worked out. Roy's suggestion of the 'GrandFather' thread is very interesting too.

      Thanks for your suggestions.

      Regards,
      Chirag.
Re: large file and processing efficiency
by Roy Johnson (Monsignor) on Dec 07, 2005 at 18:38 UTC
Re: large file and processing efficiency
by beachbum (Beadle) on Dec 07, 2005 at 19:54 UTC
    You could also move more of the processing to the rdbms by dropping the entire datafile into a temporary table and then issue something like:
    insert into realtable select temptable.* from temptable, realtable leftouter join temptable.ID on realtable.ID
      Yes, thats a good idea too. Thanks, beachbum.

      Regards
      Chirag.