in reply to Verifying data in large number of textfiles
You might try using a diff utility, but on 5000 files that a lot of comparisons. However, most diff's do have options to allow you to ignore some whitepace and other "inconsequential" differences.
You could create an Digest::MD5 digest for each file and the compare those. If the files are identical, the md5s will be also. However, even small whitespace differences will change the outcomes, which may be too sensitive for your purpose.
I think I would try preprocessing the files to strip all the whitespace, and then produce an md5 from the results. That probably still doesn't guarentee that the files aren't the same data sets with just variations in formatting (eg. 10.0 -v- 10 -v- 1e1 etc), but it should reduce the possibilities.
Beyond that, a lot depends on the size, nature and format of the data and the reliability of your ripping process.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Verifying data in large number of textfiles
by dchandler (Sexton) on Aug 18, 2004 at 01:36 UTC | |
by BrowserUk (Patriarch) on Aug 18, 2004 at 03:04 UTC |