That's definately a good idea, but I can't profile my program on the embedded machine because it doesn't have a profile module installed (Devel::Profile) and I am unable to install it on the box. (company policies, it's pretty much a pain in the a$$)
Basically, the way I figured it out was by writing a few test scripts and adding features to it to see when the slow down occurs. From there I narrowed it down to when I try using a module.