Re^6: Refactoring a large script

Wow. Thank you BrowserUK, you made my week!

Re: $matchesfasta vs. $matches{$fasta}

That was a copy and paste problem on my part. Along with the 'enigmatic' $h. I was originally using $h as the variable that became $fasta_id. But I replaced it because I realized no one else would have a clue what it was for. So, that's 2 mysteries resolved.

As for the sorting issue you brought up next - I had no idea that sort was so memory intensive. Thanks for clearing that up for me.

It's not easy to tell in the absence of test data how big this hash is
+, 
or how many time these loops iterate, but given the sizes of the data 
+files 
mentioned in the OP, and the timing information you've supplied above,
+ it 
appears that they may be substantial.
[download]

Yeah, these hashes get big. The input into this sub is a HoAoA. In a typical run of the program, that HoAoA has about 22,000 key/value pairs, the upper-level Array would have only around 3-5 elements, but the lower level arrays would have several hundred elements. So, safe to say the hash is rather large.

The next thing I looked at was the section of code where you 'band pas
+s' filter 
the matches into a temporary array.
[download]

Thanks for the grep idea! I hadn't even thoguht of that. As for the caveat of it's possibly not exceeding the early loop exit, I'll try them both and benchmark it, as you suggested. That said, I think many of your other changes may help enough that it won't make a be necessary, but we'll see.

Finally (for now), you use this construct:
%{ $sets{ $fasta_id }[ $setscounter ] } = ();
To empty arrays in several places. Whilst there is nothing wrong with 
+this, I think
 it is clearer, and may be slightly quicker to use
 undef $sets{ $fasta_id }[ $setscounter ] };
[download]

Actually, I was deliberately trying to avoid the undef specifically. The reason being, I still am slightly confused about how to remove array elements that are undef. when I was originally using the undef line you suggested, I would get output through Data Dumper that some arrays consisted only of undef, significant amounts of others would have undef as the last element, etc. So, the roundabout way I settled on was the best I could jerry-rig with the minimal understanding of exists/undef at the time. If you (or anyone else) can think of a effecient way to excise any undef elements from an array, and additionally remove array elements that consist of an array containing only a single element (undef) as well, I would love it.

Finally, the global variables that looked suspect. I mentioned $h earlier. $num is a user-defined number that winds up being the number of elements in the upper- level arrays in the HoAoA. In the output of the sub, and HoAoHoA, it is the number of key/value pairs in the subhash. The @fastarray array is essentially identical to the @SortedKeys array you created for me. The reason I kept each of those 2 variables as global was that essentiall each of my 30ish subs use them as-is, so I felt it would be better to just keep them global than pass and dereference them in every sub. Now, considering that @fastarray starts off with around 22,000 elements and eventually only gets reduced to about 2,000 elements, I don't know which would be faster, passing it constantly, or keeping it global. Thoguhts?

Anyway, this response is getting rambling, as I wrote it in chunks over the course of a day, so I'll wrap it up here. Just let me say again, Thank you so much for your help!

Matt

P.S. - I am replacing the huge block of code on my scratchpad with a heavily commented version of this particular sub, though without the changes you have mentioned here. I'll be trying those out tomorrow.

Comment on Re^6: Refactoring a large script Select or Download Code