fixed-length records
Just a note, I believe the OP didn't say that the lines are fixed-length.
Note that a sequential walk through the file, "flipping a coin each time," is not the same, statistically. Any actual implementation would heavily favor the head of the file.
I believe this would help: http://blog.cloudera.com/blog/2013/04/hadoop-stratified-randosampling-algorithm/
In reply to Re^2: Building a new file by filtering a randomized old file on two fields
by Anonymous Monk
in thread Building a new file by filtering a randomized old file on two fields
by mbp
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |