in reply to Re^2: Building a new file by filtering a randomized old file on two fields
in thread Building a new file by filtering a randomized old file on two fields
That is a very important point, and it is one of the key reasons why I recommended testing each retrieved record against a regex ... if any assumption (such as this) upon which the logic depends is found not to hold, the program must die. Many data-files of this kind are fixed-length.
Another trick, which actually works just about as well, is to randomly select byte positions within the file, then seek() to that position, read a line of data and throw it away, then read the next line and keep that. Verifying, of course, that the second-line appears plausible. The seek() is presumed to have dropped us smack-dab into the middle of a record, and (for this trick to actually work ...) there must not be funky multi-character issues. In other words, the program must be able to somehow “land on its feet” when it reads the second-record.
The algorithm reference looks very nice. Thanks for sharing.
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: Building a new file by filtering a randomized old file on two fields
by RonW (Parson) on Apr 30, 2014 at 17:07 UTC |