in reply to Re: Building a new file by filtering a randomized old file on two fields
in thread Building a new file by filtering a randomized old file on two fields

fixed-length records

Just a note, I believe the OP didn't say that the lines are fixed-length.

Note that a sequential walk through the file, "flipping a coin each time," is not the same, statistically. Any actual implementation would heavily favor the head of the file.

I believe this would help: http://blog.cloudera.com/blog/2013/04/hadoop-stratified-randosampling-algorithm/

  • Comment on Re^2: Building a new file by filtering a randomized old file on two fields

Replies are listed 'Best First'.
Re^3: Building a new file by filtering a randomized old file on two fields
by locked_user sundialsvc4 (Abbot) on Apr 30, 2014 at 16:37 UTC

    That is a very important point, and it is one of the key reasons why I recommended testing each retrieved record against a regex ... if any assumption (such as this) upon which the logic depends is found not to hold, the program must die.   Many data-files of this kind are fixed-length.

    Another trick, which actually works just about as well, is to randomly select byte positions within the file, then seek() to that position, read a line of data and throw it away, then read the next line and keep that.   Verifying, of course, that the second-line appears plausible.   The seek() is presumed to have dropped us smack-dab into the middle of a record, and (for this trick to actually work ...) there must not be funky multi-character issues.   In other words, the program must be able to somehow “land on its feet” when it reads the second-record.

    The algorithm reference looks very nice.   Thanks for sharing.

      If the file uses UTF-8 encoding, then multibyte characters wont prevent finding the end of line. Or, more correctly, multibyte codepoints. The first byte of of any codepoint is always less than 128. Additional bytes are always greater than 127. Thus you can always find the next codepoint even if your random position lands you in the middle of a codepoint. Perl will be able to find the end of line. Other multibyte encodings are likely to cause problems.

      (FYI, there are multi codepoint characters, but you only need to worry about that after you find a whole line of text - if at all.)