in reply to Optimize my foreach loop / code

You have a single loop, so the amount of work you are doing increases in direct relationship to the number of filenames you are running through. Your lookups use hash keys which means that as your data set grows the lookup times won't grow significantly. You are pushing onto a couple of arrays, which doesn't cost you any significant growth pains for a dataset of 36000 file names, assuming you have a computer made this decade.

So about all that leaves within the code you demonstrated is how much time it takes to run the regexes on the filenames. In the case of your first regex, there's no need to capture, and no need to use quantifiers, as m|.+/.+| will match the exact same strings that would match against m|./.|. The reason is that if "one or more" on either side of a slash matches, then one on either side of the slash would also match, and vice versa. So that regex has a small amount of room for optimization.

Are you convinced through actual testing that this is the segment of code that takes all the time? Profiling would tell you, but even the minimal change of adding a time call to either side of the loop would tell you. It's possible that we're looking at the wrong code here.

If it turns out that this really is your bottleneck, see if you can further reduce the number of files you have to iterate over. Maybe your listing doesn't need to be quite as inclusive.


Dave

Replies are listed 'Best First'.
Re^2: Optimize my foreach loop / code
by rmocster (Novice) on Aug 25, 2016 at 22:00 UTC

    Thanks for your reply.

    It is this loop that takes most of the time. Unfortunately, the input image array can grow as much as 400k elements (files). I am hoping there is a better way to improve on the hash assignment and/or regexes.

    Best!

      As I tried to illustrate, you will not find optimizations for this existing work-flow that attain an order of magnitude of improvement. It would be improbable that you could even cut the time in half.

      What if you build an index from each file as it comes in, rather than doing a huge chunk of files all at once? Gather whatever meta-data you need on each file as it arrives, and shove that data into a database that you can query as needed. This will spread the computational workload over a longer period of time, and make tallying of results very fast.


      Dave