in reply to Building a new file by filtering a randomized old file on two fields
The fact that you have multi-gigabyte files in this case might work in your favor, allowing you to create a simpleton algorithm that just might nonetheless produce effective results.
I would suggest treating the files as random-access files containing fixed-length records. (The end-of-line indicator could be one or two bytes depending on the system. Just read one line and note the file-position after you do so ... that’s your record-length in bytes. Wanna tie to a file? Feel free.™ Doesn’t matter.)
Let’s get to work. Generate a list of random integers (modulo the size of the file in records) that is some-“small x” times as long as the number of records you need. Sort the list (to de-dupe it). Duplicates are unlikely, will be removed by the sort step, and can be ignored. (You will still have enough.) Now, seek and read each of those records, anticipating that this, by itself, will in fact contain an acceptable random sample. (Suspiciously apply a regular-expression to each line that you know will pass, and die() if it does not, because this would indicate that there is a bug ... and there’s always one more bug.) Here is your bucket of candidates.
Now, walk through this bucket like a fisherman, throwing acceptable records into another bucket (the result ...), and skipping those which for any reason do not qualify to be in the result. (Since the list of candidates is randomly picked from the file, you do not have to again pick at random from that list. “Double random” is not “more random.”) You can simply walk the randomly-pulled list from head to tail. If you fill the results-bucket first, “you win.” If you reach the end of randomly-pulled records without filling the bucket, you lose ... just try again. If you try and fail several times, die() in disgrace. But you probably won’t. (If you do, bump-up your “small x.”)
You have, apparently, millions of records from which to randomly draw, and you need a comparatively small sample (500) from that source. If yours is a truly random draw, where any record anywhere has an equal probability of being picked, then I would expect that within a draw of, say, 2000 records from such a file, you will nearly always find a set of records that will qualify, pretty much no matter how stringent your criteria may be. This is a simpleton algorithm that, I predict, will get the job done quite nicely because you have a large population to draw from. Even for large samples.
Note that a sequential walk through the file, “flipping a coin each time,” is not the same, statistically. Any actual implementation would heavily favor the head of the file. It is important that the initial draw from the file should be perfectly pseudo-random, with each and every record having an equal probability of being selected for initial consideration. Hence, use random access.
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Building a new file by filtering a randomized old file on two fields
by Anonymous Monk on Apr 30, 2014 at 11:57 UTC | |
by locked_user sundialsvc4 (Abbot) on Apr 30, 2014 at 16:37 UTC | |
by RonW (Parson) on Apr 30, 2014 at 17:07 UTC |