comment on

Taking BrowserUK's advice a little further, I think that relative to the OP code, you can replace lines 21-80 with the following (28 lines):

open(IN1, '<', $input1) or die "Can't read source file $input1 : $!\n"
+;
my $minlength = 1<<20;
my %distrbtn;

while(<IN1>) {
    next if />/;
    chomp;
    my $len = length();
    $distrbtn{$len}++;
    $minlength = $len if ( $minlength > $len );
}
close IN1;
$minlength -= 3;

open(IN2, '<', $input2) or die "Can't read source file $input2 : $!\n"
+;
my $header;
my @source;
my %source_lengths;

while(<IN2>) {
    chomp;
    if ( />/ ) {
        $header = $_;
    }
    elsif ( length() >= $minlength ) {
        push @source, $header;
        push @source, $_;
        push @{$source_lengths{length()}}, $#source;
    }
}
close IN2;
[download]

(updated to fix how header lines are skipped in the first input file)

That handles three issues: First, your "%distrbtn_hash" (declared in my code here as "%distrbtn") can be built directly while reading the first input file - this eliminates four redundant arrays and a lot of unnecessary code.

Second, (I don't know whether this makes any difference regarding your second input file, but) the "@source" array doesn't need to store any strings that are shorter than the shortest line (minus 3) found in your first input file. (I'm assuming that you really want to keep the header strings with their associated data strings from file2.)

Third, since your "EXTRACT" block seems to be trying to locate source strings of particular lengths for each of the string lengths found in the first input, it will make things a lot easier if you index the source strings according to their lengths - that is what the "%source_lengths" hash is doing.

That way, as you loop over the lengths found in the first input file, you know exactly how many entries from the second file have a suitable length, and can choose from that set of sources randomly, and know exactly where to find each entry in the @source array according to its length.

I don't understand your selection criteria well enough to finish that part of the code, but it might start with something like this:

my $max_source_length = ( sort {$b<=>$a} keys %source_lengths )[0];

for my $key ( sort {$a<=>$b} keys %distrbtn ) {
    my $size = $key - 3;
    my $freq = $distrbtn{$key};

# find the first set of source strings of equal or greater size:

    my $source_key = $key;
    while  ( $source_key <= $max_source_length and
             not exists( $source_lengths{$source_key} )) {
        $source_key++;
    }
    if ( $source_key > $max_source_length ) {
        die "We can't do this: strings from $input2 aren't long enough
+";
    }
    ...
}
[download]

The point about that last "if" condition is that it's not clear to me that the "input2" data will necessarily satisfy all the selection criteria.

UPDATE: In case you're confused or unsure about using a hash of arrays, the next snippet (which could be placed after the last "if" condition above) might help clarify:

    my @usable_sources = @{$source_lengths{$source_key}};
    printf "for an input1 string of length %d, we can choose from %d i
+nput2 strings\n", $size, scalar @usable_sources;

# in case we want to add more sources that happen to be longer:
    $source_key++;
    while ( $source_key <= $max_source_length and
            scalar @usable_sources < $freq ) {
        push @usable_sources, @{$source_lengths{$source_key}} if exist
+s($source_lengths{$source_key};
    }
    if ( $freq > @usable_sources ) {
        warn "We ran short of desired frequency for length $size\n";
    elsif ( $freq < @usable_sources ) {
        # do something to randomly remove items from @usable_sources..
+.
    }
    for my $offset ( @usable_sources ) {
        my $header = $source[$offset-1];
        my $string = $source[$offset];
        # do whatever...
    }
[download]

Can't really say anything more unless you can show us some sample data from each input, with some desired outputs (that is what GrandFather was asking for - not command-line syntax). It would also be helpful to have some statistics about the "really large file": if page faults are a problem (because the data and index info is larger than available RAM), there are other ways to index into a large data file without using huge in-memory hashes and arrays (and hashes of arrays).

(please note that none of the above is tested; also: updated to spell BrowserUK's name correctly)

(And one last update to add a missing close-curly in the last snippet - which is still untested.)

In reply to Re: Speeding up stalled script by graff
in thread Speeding up stalled script by onlyIDleft

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.