in reply to Re: Building a new file by filtering a randomized old file on two fields
in thread Building a new file by filtering a randomized old file on two fields
fixed-length records
Just a note, I believe the OP didn't say that the lines are fixed-length.
Note that a sequential walk through the file, "flipping a coin each time," is not the same, statistically. Any actual implementation would heavily favor the head of the file.
I believe this would help: http://blog.cloudera.com/blog/2013/04/hadoop-stratified-randosampling-algorithm/
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: Building a new file by filtering a randomized old file on two fields
by locked_user sundialsvc4 (Abbot) on Apr 30, 2014 at 16:37 UTC | |
by RonW (Parson) on Apr 30, 2014 at 17:07 UTC |