Data Sampler (Extract sample from large text file)

Category:	Utility Scripts
Author/Contact Info
Description:	I frequently find the need to test programs on real data. But some of the datasets I have to deal with are rather ... large. So this program lets me generate much smaller datasets to develop with. Update: 20090306: I edited the first print statement to remove the trace information.
#!/usr/bin/perl -w =head1 Simple Data Sampler This program extracts a set of random lines from the file(s) specified. =head1 Usage: sample_lines.pl [<Option>] <InFile> <InFile> =head1 Options: =over =item -per-thousand <p> Control number of lines in the sample--try to keep <p> lines for every thousand seen. I<NOTE:> We don't try to enforce the number of lines per thousand to this value, we just use it to choose when to print a line (with possible contiguous lines). =item -contiguous <c> Keep <c> lines after each line selected to print (so we always get at least <c> contiguous lines, default=0. I<NOTE:> See -contig-max =item -contig-max <cm> Randomizes number of contiguous lines to print after selected lines (see -contiguous). Print between <c> and <cm> lines after each selected line. =item -minimum-skip <ms> Minimum number of lines to skip between selected lines. =back The options are implemented very simply, as this isn't supposed to be the "ultimate data sampler", just a simple way to get a random set of lines from a text file. =cut use strict; use warnings; use Getopt::Long; ##### # Handle command-line options ##### my $contig_min; my $contig_max; my $minimum_skip; my $per_thousand = 6.5; my $result = GetOptions ( "contiguous=i" => \$contig_min, "contig-max=i" => \$contig_max, "minimum-skip=i" => \$minimum_skip, "per-thousand=i" => \$per_thousand, ); if (defined $contig_max) { $contig_min = 0 unless defined $contig_min; $contig_max = $contig_min if $contig_max < $contig_min; } if (defined $contig_min) { $contig_max = $contig_min if !defined $contig_max; } $per_thousand = $per_thousand / 1000.0; ##### # Sample the data ##### while (my $InFile = shift) { open INF, '<', $InFile or die "Can't open '$InFile': $!\n"; while (<INF>) { next if $per_thousand < rand; print; if (defined $contig_min) { print scalar <INF> for 0 .. $contig_min + rand +($contig_max-$contig_min); } if (defined $minimum_skip) { <INF> for (1 .. $minimum_skip); } } close INF or die "...closing '$InFile': $!\n"; }

Comment on Data Sampler (Extract sample from large text file) Download Code

Replies are listed 'Best First'.
Re: Data Sampler (Extract sample from large text file) by baxy77bax (Deacon) on Mar 09, 2009 at 15:19 UTC
i have the same problems constantly, but it never came to my mind to write something like this, i just cut and paste random peaces of a file into a test file and that is it. So big ++ for this script and effort! :)	[reply]