in reply to Needed Performance improvement in reading and fetching from a file

So what you are basically doing is:

1. Checking if column 2 has been seen already - if so, next line
2. Else, do some processing on the row and record the value of column 2 as seen

This is a pretty common thing to do and can be super fast. As already pointed out, the easiest win is to use a hash for maintaining the record of what col 2 values have been seen. That will get your check nearer O(1) than O(N).

20k isn't a lot - this is all you should have to do. If you find yourself dealing with a LOT of records (say half a million) you can get really cheap use of multiple cpu/cores (assuming you have them) by writing two scripts - 1 to strip out all the lines with duplicated col 2 values, the second can thus skip that step. Then pipe the output of one script to the input of the other and Unix will run the two processes in parallel for you. Assuming you are using a Unix OS that is. Something like:

cat the_file.txt | remove_duplicates.pl | process_data.pl
  • Comment on Re: Needed Performance improvement in reading and fetching from a file
  • Download Code