#!/usr/bin/perl -w =head1 Simple Data Sampler This program extracts a set of random lines from the file(s) specified. =head1 Usage: sample_lines.pl [<Option>*] <InFile> <InFile>* =head1 Options: =over =item -per-thousand <p> Control number of lines in the sample--try to keep <p> lines for every thousand seen. I<NOTE:> We don't try to enforce the number of lines per thousand to this value, we just use it to choose when to print a line (with possible contiguous lines). =item -contiguous <c> Keep <c> lines after each line selected to print (so we always get at least <c> contiguous lines, default=0. I<NOTE:> See -contig-max =item -contig-max <cm> Randomizes number of contiguous lines to print after selected lines (see -contiguous). Print between <c> and <cm> lines after each selected line. =item -minimum-skip <ms> Minimum number of lines to skip between selected lines. =back The options are implemented very simply, as this isn't supposed to be the "ultimate data sampler", just a simple way to get a random set of lines from a text file. =cut use strict; use warnings; use Getopt::Long; ##### # Handle command-line options ##### my $contig_min; my $contig_max; my $minimum_skip; my $per_thousand = 6.5; my $result = GetOptions ( "contiguous=i" => \$contig_min, "contig-max=i" => \$contig_max, "minimum-skip=i" => \$minimum_skip, "per-thousand=i" => \$per_thousand, ); if (defined $contig_max) { $contig_min = 0 unless defined $contig_min; $contig_max = $contig_min if $contig_max < $contig_min; } if (defined $contig_min) { $contig_max = $contig_min if !defined $contig_max; } $per_thousand = $per_thousand / 1000.0; ##### # Sample the data ##### while (my $InFile = shift) { open INF, '<', $InFile or die "Can't open '$InFile': $!\n"; while (<INF>) { next if $per_thousand < rand; print; if (defined $contig_min) { print scalar <INF> for 0 .. $contig_min + rand +($contig_max-$contig_min); } if (defined $minimum_skip) { <INF> for (1 .. $minimum_skip); } } close INF or die "...closing '$InFile': $!\n"; }

In reply to Data Sampler (Extract sample from large text file) by roboticus

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.