G'day Maire,
I see ++toolic has identified the issue with your syntax error;
I want to discuss another aspect of your code.
Given you're talking about "stores all of the text from files held in a specific folder",
it sounds like there's a lot of data,
and %corpus may contain hundreds (thousands? millions?) of key-value pairs.
Returning all that data and using it to create a new hash may be very inefficient.
Without knowing how much data is involved,
what &getCorpus is doing internally,
or how %mycorpus is subsequently used (beyond printing its key-value pairs),
I'm not in a position to provide concrete advice;
however, I would recommend that you at least consider returning a hashref from &getCorpus, instead of a hash.
Here's a couple of Benchmarks
to give you an idea of just how inefficient your current code might be.
(Note: I typically run benchmarks a minimum of five times; discard outlier results;
then look for representative results in the remainder.)
In the first test, I use a hash with a thousand key-value pairs,
and compare (a) returning the hash as is and assigning that to a new hash;
(b) returning a reference to the hash and assigning that to a new scalar; and,
(c) returning a new, anonymous hashref and assigning that to a new scalar.
#!/usr/bin/env perl
use strict;
use warnings;
use Benchmark 'cmpthese';
my %big_hash = map { $_ => 1 } 1 .. 1_000;
sub _get_original_hash { return %big_hash }
sub _get_ref_to_hash { return \%big_hash }
sub _get_anon_hashref { return { %big_hash } }
cmpthese 0 => {
orig => sub { my %hash = _get_original_hash() },
ref => sub { my $ref = _get_ref_to_hash() },
anon => sub { my $anon = _get_anon_hashref() },
};
Here's a representative result:
Rate orig anon ref
orig 4495/s -- -5% -100%
anon 4756/s 6% -- -100%
ref 5554928/s 123471% 116697% --
As you can see, returning a reference to the original hash is orders of magnitude faster than the other two methods.
Although it may appear that "anon" is marginally faster than "orig", do not draw any conclusions from this:
they're too close to call as demonstrated in the next test.
Repeating the above with a million key-value pairs
my %big_hash = map { $_ => 1 } 1 .. 1_000_000;
gives this representative result:
anon 1.23/s -- -4% -100%
orig 1.28/s 4% -- -100%
ref 5557956/s 451583832% 434909964% --
Again, "ref" is orders of magnitude faster than the other two methods.
This time, however, "orig" appears to be marginally faster than "anon"
(which really just confirms that they're "too close to call").
I've only demonstrated a basic principle.
You should write your own benchmarks using realistic data
(which you haven't described in your OP and I can only guess at).
If you do decide to return a hashref, your code might look more like this:
my $mycorpus = getCorpus('C:\Users\li\test');
print "$_ : $mycorpus{$_}\n" for sort keys %$mycorpus;
For printing all the keys and their values, consider one of the dumper modules.
My preference is the CPAN module Data::Dump;
the core module Data::Dumper is also popular;
others exist but those are the only two I've had much experience with.
Using Data::Dump, you could get much the same output as the two lines above,
but you'd only need this one line of code:
dd getCorpus('C:\Users\li\test');
And just a couple of other, unrelated points:
"This subroutine has been written and used by someone with a lot more coding experience than me, so I can be sure that it is not the problem."
That's a dangerous assumption: the best programmer in the world is not infallible and can have an off-day.
Check it yourself: you may learn something; you might spot an error.
"... I'm not sure about the etiquette of making someone else's code available without their consent."
If the code's in the public domain, post it or link to it, and provide suitable attribution.
Consent should not be required (unless the author specified this requirement).
Notification may be a courtesy but could also be a nuisance:
consider whether that's appropriate on a case-by-case basis.
If it's not in the public domain, you should gain consent first;
you may also need to alter sensitive data,
e.g. post strings like "password", "credit_card_number", or "client_contact_details" instead of the real values.
If you're posting a lot of code here,
consider wrapping it in <spoiler>...</spoiler> or <readmore>...</readmore> tags.
See "Writeup Formatting Tips" for more about that.
|