Re: Verifying data in large number of textfiles

You could put the data in a database, adding a linenum and a filenum field if necessary. Then, all you'd have to do is:

- foreach line $linenum
   - Compare the number of records returned by
     "SELECT * WHERE LINENUM=$linenum"
     to the number of records returned by
     "SELECT DISTINCT * WHERE LINENUM=$linenum".
     If they're different, there are duplicate records.
- end
[download]

The same approach can be taken without a database. It involves regrouping all the files so that line1.dat contains the first line of every original file, line2.dat contains the second line of every original file, etc. Pseudo-code:

- foreach original file
  - $linenum = 1;
  - while not eof
    - append the line to file "line${linenum}.dat"
    - $linenum++;
  - end
- end
- foreach line file
  - Compare num of lines returned by
    'cat line###.dat | sort | uniq'
    with the number of lines in line###.dat.
    If they're different, there are duplicate records.
- end
[download]

A completely different approach is to convert your CSV files to fixed-length field files. Then you can easily compare an arbitrary line in one file to the same line in another file by using seek().

Comment on Re: Verifying data in large number of textfiles Select or Download Code