comment on

It could be beneficial to place Time::HiRes timestamps before the glob, the first foreach, the @permCounts assignment, the second foreach, run the program with different-sized datasets, and see where your program slows down the most.

But outside of any actual profiling data, I do see a candidate for optimization. How about using your @observed strings as a base-(sizeof $alph) "number", and generate your @hits array from that. You are currently scanning the @perms list scalar(@observed) times, an O(N*m)¹ algorithm (where N is the size of the @perms list). If you can convert the scan (the first foreach loop) to a function that can convert the $obs value to an index directly, this becomes an O(1)¹, which, given the right conditions, can be faster than scanning the @perms loop every time.

For example, your @observed values are (in the example) base-26, two-digit numbers. A basic algorithm for conversion would be something like:

use Test::More q(no_plan);

# ngram2number
#
# Convert an ngram into a number given a
# hashref containing the alphabet conversion,
# and the ngram to convert.
#
# An area for improvement would be to cache the
# $base ** ( $position++ ) results.
#
# Untested (ok, now it is), no warranty, blah blah blah
#
# Update: error in code - added reverse to correct
# Update: Multiply current digit, not add; scalar keys %alphabet
# Update: Added testing commands
#
sub ngram2number {
    my ( $alphabet, $ngram ) = @_;
    my $results  = 0;
    my $position = 0;
    my $base     = scalar( keys %$alphabet );
    for my $c ( reverse split( //, $ngram ) ) {
        $results += ( $base ** ( $position++ ) ) * $alphabet->{ $c };
    }
    $results;
}

$hex = { map { ( $_ => hex( $_ ) ) } ( '0'..'9', 'a'..'f' ) };
is( ngram2number( $hex, $_ ), hex( $_ ), "$_ in hex matches" )
    for ( '0'..'9', 'a'..'f' );
is( ngram2number( $hex, $_ ), hex( $_ ), "$_ in hex matches" )
    for ( glob '{0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f}'x2 );
[download]

Depending on the size of the alphabet and the size of the ngram, number may need to be a bignum or float.

One other area (as mentioned above) for optimization could be the assignment of the zeros to the @permCount array, and even the use of the @perms array. Depending on the size of the alphabet and the order of the ngram, it could grow large enough to start swapping (or even exhausting) memory resources, which will be a performance killer. Since the @perms (and thus the @permCounts) array grows exponentially in the form sizeof( alphabet ) ** $ngramOrder, your memory use grows very fast when using your presented method of calculating the answer.

Footnotes:

1 - I think that I have this correct, but I would hope I would get checked on this. I ignored the effect of iterating across each character of the ngram for the string comparison and the cost of iterating across each character in the ngram2number function as effectively cancelling each other out, within a constant multiplier.

--MidLifeXis

In reply to Re: counting instances of one array in another array by MidLifeXis
in thread counting instances of one array in another array by jsmagnuson

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.