Re^3: Building a new file by filtering a randomized old file on two fields

That is a very important point, and it is one of the key reasons why I recommended testing each retrieved record against a regex ... if any assumption (such as this) upon which the logic depends is found not to hold, the program must die. Many data-files of this kind are fixed-length.

Another trick, which actually works just about as well, is to randomly select byte positions within the file, then seek() to that position, read a line of data and throw it away, then read the next line and keep that. Verifying, of course, that the second-line appears plausible. The seek() is presumed to have dropped us smack-dab into the middle of a record, and (for this trick to actually work ...) there must not be funky multi-character issues. In other words, the program must be able to somehow “land on its feet” when it reads the second-record.

The algorithm reference looks very nice. Thanks for sharing.

Replies are listed 'Best First'.
Re^4: Building a new file by filtering a randomized old file on two fields by RonW (Parson) on Apr 30, 2014 at 17:07 UTC
If the file uses UTF-8 encoding, then multibyte characters wont prevent finding the end of line. Or, more correctly, multibyte codepoints. The first byte of of any codepoint is always less than 128. Additional bytes are always greater than 127. Thus you can always find the next codepoint even if your random position lands you in the middle of a codepoint. Perl will be able to find the end of line. Other multibyte encodings are likely to cause problems. (FYI, there are multi codepoint characters, but you only need to worry about that after you find a whole line of text - if at all.)	[reply]

Replies are listed 'Best First'.

Re^4: Building a new file by filtering a randomized old file on two fields
by RonW (Parson) on Apr 30, 2014 at 17:07 UTC

If the file uses UTF-8 encoding, then multibyte characters wont prevent finding the end of line. Or, more correctly, multibyte codepoints. The first byte of of any codepoint is always less than 128. Additional bytes are always greater than 127. Thus you can always find the next codepoint even if your random position lands you in the middle of a codepoint. Perl will be able to find the end of line. Other multibyte encodings are likely to cause problems.

(FYI, there are multi codepoint characters, but you only need to worry about that after you find a whole line of text - if at all.)

[reply]