(updated to fix how header lines are skipped in the first input file)open(IN1, '<', $input1) or die "Can't read source file $input1 : $!\n" +; my $minlength = 1<<20; my %distrbtn; while(<IN1>) { next if />/; chomp; my $len = length(); $distrbtn{$len}++; $minlength = $len if ( $minlength > $len ); } close IN1; $minlength -= 3; open(IN2, '<', $input2) or die "Can't read source file $input2 : $!\n" +; my $header; my @source; my %source_lengths; while(<IN2>) { chomp; if ( />/ ) { $header = $_; } elsif ( length() >= $minlength ) { push @source, $header; push @source, $_; push @{$source_lengths{length()}}, $#source; } } close IN2;
That handles three issues: First, your "%distrbtn_hash" (declared in my code here as "%distrbtn") can be built directly while reading the first input file - this eliminates four redundant arrays and a lot of unnecessary code.
Second, (I don't know whether this makes any difference regarding your second input file, but) the "@source" array doesn't need to store any strings that are shorter than the shortest line (minus 3) found in your first input file. (I'm assuming that you really want to keep the header strings with their associated data strings from file2.)
Third, since your "EXTRACT" block seems to be trying to locate source strings of particular lengths for each of the string lengths found in the first input, it will make things a lot easier if you index the source strings according to their lengths - that is what the "%source_lengths" hash is doing.
That way, as you loop over the lengths found in the first input file, you know exactly how many entries from the second file have a suitable length, and can choose from that set of sources randomly, and know exactly where to find each entry in the @source array according to its length.
I don't understand your selection criteria well enough to finish that part of the code, but it might start with something like this:
The point about that last "if" condition is that it's not clear to me that the "input2" data will necessarily satisfy all the selection criteria.my $max_source_length = ( sort {$b<=>$a} keys %source_lengths )[0]; for my $key ( sort {$a<=>$b} keys %distrbtn ) { my $size = $key - 3; my $freq = $distrbtn{$key}; # find the first set of source strings of equal or greater size: my $source_key = $key; while ( $source_key <= $max_source_length and not exists( $source_lengths{$source_key} )) { $source_key++; } if ( $source_key > $max_source_length ) { die "We can't do this: strings from $input2 aren't long enough +"; } ... }
UPDATE: In case you're confused or unsure about using a hash of arrays, the next snippet (which could be placed after the last "if" condition above) might help clarify:
Can't really say anything more unless you can show us some sample data from each input, with some desired outputs (that is what GrandFather was asking for - not command-line syntax). It would also be helpful to have some statistics about the "really large file": if page faults are a problem (because the data and index info is larger than available RAM), there are other ways to index into a large data file without using huge in-memory hashes and arrays (and hashes of arrays).my @usable_sources = @{$source_lengths{$source_key}}; printf "for an input1 string of length %d, we can choose from %d i +nput2 strings\n", $size, scalar @usable_sources; # in case we want to add more sources that happen to be longer: $source_key++; while ( $source_key <= $max_source_length and scalar @usable_sources < $freq ) { push @usable_sources, @{$source_lengths{$source_key}} if exist +s($source_lengths{$source_key}; } if ( $freq > @usable_sources ) { warn "We ran short of desired frequency for length $size\n"; elsif ( $freq < @usable_sources ) { # do something to randomly remove items from @usable_sources.. +. } for my $offset ( @usable_sources ) { my $header = $source[$offset-1]; my $string = $source[$offset]; # do whatever... }
(please note that none of the above is tested; also: updated to spell BrowserUK's name correctly)
(And one last update to add a missing close-curly in the last snippet - which is still untested.)
In reply to Re: Speeding up stalled script
by graff
in thread Speeding up stalled script
by onlyIDleft
For: | Use: | ||
& | & | ||
< | < | ||
> | > | ||
[ | [ | ||
] | ] |