Category: Utility Scripts
Author/Contact Info
Description:

I frequently find the need to test programs on *real* data. But some of the datasets I have to deal with are rather ... large.

So this program lets me generate much smaller datasets to develop with.

Update: 20090306: I edited the first print statement to remove the trace information.

#!/usr/bin/perl -w
=head1 Simple Data Sampler

This program extracts a set of random lines from the file(s)
specified.


=head1 Usage:

sample_lines.pl [<Option>*] <InFile> <InFile>*

=head1 Options:

=over

=item -per-thousand <p>

Control number of lines in the sample--try to keep <p> lines
for every thousand seen. I<NOTE:> We don't try to enforce the
number of lines per thousand to this value, we just use it to
choose when to print a line (with possible contiguous lines).

=item -contiguous <c>

Keep <c> lines after each line selected to print (so we always
get at least <c> contiguous lines, default=0.
I<NOTE:> See -contig-max

=item -contig-max <cm>

Randomizes number of contiguous lines to print after selected
lines (see -contiguous).  Print between <c> and <cm> lines
after each selected line.

=item -minimum-skip <ms>

Minimum number of lines to skip between selected lines.

=back

The options are implemented very simply, as this isn't supposed
to be the "ultimate data sampler", just a simple way to get a
random set of lines from a text file.

=cut

use strict;
use warnings;
use Getopt::Long;

#####
# Handle command-line options
#####

my $contig_min;
my $contig_max;
my $minimum_skip;
my $per_thousand = 6.5;
my $result = GetOptions (
        "contiguous=i" => \$contig_min,
        "contig-max=i" => \$contig_max,
        "minimum-skip=i" => \$minimum_skip,
        "per-thousand=i" => \$per_thousand,
);
if (defined $contig_max) {
        $contig_min = 0 unless defined $contig_min;
        $contig_max = $contig_min if $contig_max < $contig_min;
}
if (defined $contig_min) {
        $contig_max = $contig_min if !defined $contig_max;
}

$per_thousand = $per_thousand / 1000.0;


#####
# Sample the data
#####

while (my $InFile = shift) {
        open INF, '<', $InFile or die "Can't open '$InFile': $!\n";
        while (<INF>) {
                next if $per_thousand < rand;
                print;
                if (defined $contig_min) {
                        print scalar <INF> for 0 .. $contig_min + rand
+($contig_max-$contig_min);
                }
                if (defined $minimum_skip) {
                        <INF> for (1 .. $minimum_skip);
                }
        }
        close INF or die "...closing '$InFile': $!\n";
}
Replies are listed 'Best First'.
Re: Data Sampler (Extract sample from large text file)
by baxy77bax (Deacon) on Mar 09, 2009 at 15:19 UTC
    i have the same problems constantly, but it never came to my mind to write something like this, i just cut and paste random peaces of a file into a test file and that is it.

    So big ++ for this script and effort! :)