in reply to Building a new file by filtering a randomized old file on two fields

The fact that you have multi-gigabyte files in this case might work in your favor, allowing you to create a simpleton algorithm that just might nonetheless produce effective results.

I would suggest treating the files as random-access files containing fixed-length records.   (The end-of-line indicator could be one or two bytes depending on the system.   Just read one line and note the file-position after you do so ... that’s your record-length in bytes.   Wanna tie to a file?   Feel free.™   Doesn’t matter.)

Let’s get to work.   Generate a list of random integers (modulo the size of the file in records) that is some-“small x” times as long as the number of records you need.   Sort the list (to de-dupe it).   Duplicates are unlikely, will be removed by the sort step, and can be ignored.   (You will still have enough.)   Now, seek and read each of those records, anticipating that this, by itself, will in fact contain an acceptable random sample.   (Suspiciously apply a regular-expression to each line that you know will pass, and die() if it does not, because this would indicate that there is a bug ... and there’s always one more bug.)   Here is your bucket of candidates.

Now, walk through this bucket like a fisherman, throwing acceptable records into another bucket (the result ...), and skipping those which for any reason do not qualify to be in the result.   (Since the list of candidates is randomly picked from the file, you do not have to again pick at random from that list.   “Double random” is not “more random.”)   You can simply walk the randomly-pulled list from head to tail.   If you fill the results-bucket first, “you win.”   If you reach the end of randomly-pulled records without filling the bucket, you lose ... just try again.   If you try and fail several times, die() in disgrace.   But you probably won’t.   (If you do, bump-up your “small x.”)

You have, apparently, millions of records from which to randomly draw, and you need a comparatively small sample (500) from that source.   If yours is a truly random draw, where any record anywhere has an equal probability of being picked, then I would expect that within a draw of, say, 2000 records from such a file, you will nearly always find a set of records that will qualify, pretty much no matter how stringent your criteria may be.   This is a simpleton algorithm that, I predict, will get the job done quite nicely because you have a large population to draw from.   Even for large samples.

Note that a sequential walk through the file, “flipping a coin each time,” is not the same, statistically.   Any actual implementation would heavily favor the head of the file.   It is important that the initial draw from the file should be perfectly pseudo-random, with each and every record having an equal probability of being selected for initial consideration.   Hence, use random access.

Replies are listed 'Best First'.
Re^2: Building a new file by filtering a randomized old file on two fields
by Anonymous Monk on Apr 30, 2014 at 11:57 UTC
    fixed-length records

    Just a note, I believe the OP didn't say that the lines are fixed-length.

    Note that a sequential walk through the file, "flipping a coin each time," is not the same, statistically. Any actual implementation would heavily favor the head of the file.

    I believe this would help: http://blog.cloudera.com/blog/2013/04/hadoop-stratified-randosampling-algorithm/

      That is a very important point, and it is one of the key reasons why I recommended testing each retrieved record against a regex ... if any assumption (such as this) upon which the logic depends is found not to hold, the program must die.   Many data-files of this kind are fixed-length.

      Another trick, which actually works just about as well, is to randomly select byte positions within the file, then seek() to that position, read a line of data and throw it away, then read the next line and keep that.   Verifying, of course, that the second-line appears plausible.   The seek() is presumed to have dropped us smack-dab into the middle of a record, and (for this trick to actually work ...) there must not be funky multi-character issues.   In other words, the program must be able to somehow “land on its feet” when it reads the second-record.

      The algorithm reference looks very nice.   Thanks for sharing.

        If the file uses UTF-8 encoding, then multibyte characters wont prevent finding the end of line. Or, more correctly, multibyte codepoints. The first byte of of any codepoint is always less than 128. Additional bytes are always greater than 127. Thus you can always find the next codepoint even if your random position lands you in the middle of a codepoint. Perl will be able to find the end of line. Other multibyte encodings are likely to cause problems.

        (FYI, there are multi codepoint characters, but you only need to worry about that after you find a whole line of text - if at all.)