Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Storing output of a subroutine into an hash and then printing hash

by Maire (Scribe)
on Oct 18, 2017 at 14:09 UTC ( [id://1201594]=perlquestion: print w/replies, xml ) Need Help??

Maire has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!

I have a subroutine that essentially stores all of the text from files held in a specific folder into a hash (%corpus). The idea is that you specify the folder when you call the subroutine. This subroutine has been written and used by someone with a lot more coding experience than me, so I can be sure that it is not the problem.

What I want to do is 1) call the subroutine (getCorpus); 2) specify the folder that I want it to work on; 3) instruct it to store its output into a new hash (%mycorpus), and then I want to print out the contents of this hash.

I’ve been browsing the PerlMonk archives for any similar situations to see if I could solve the problem myself, and I’ve managed to cobble together the code below from various places (notably Re: easiest way to print the content of a hash?), but I am quite obviously making some big mistakes that I don't have the skills, as of yet, to spot (as evidenced by the various error messages it returns (reproduced below the code)

I haven't uploaded the original subroutine here because I'm not sure about the etiquette of making someone else's code available without their consent. But essentially the subroutine comes first and then I write:

%mycorpus = getCorpus("C:\Users\li\test") #line 1 foreach (sort keys %mycorpus) { print "$_ : $mycorpus{$_}\n"; }

In line 1, I am trying to call the subroutine (named getCorpus), tell getCorpus what folder I want it to operate on, and store the output of the subroutine into a new hash named %mycorpus. Then in lines 2 and 3, I attempt to print the contents of %mycorpus. The error messages returned for this particular attempt are as follows:

Operator or semicolon missing before %mycorpus at get_corpus.pl line + 51. Ambiguous use of % resolved as operator % at get_corpus.pl line 51. Can't modify modulus (%) in scalar assignment at get_corpus.pl line 52 +, near ") foreach " syntax error at get_corpus.pl line 52, near ") {" Execution of get_corpus.pl aborted due to compilation errors.
Any guidance on this would be greatly appreciated.

Replies are listed 'Best First'.
Re: Storing output of a subroutine into an hash and then printing hash
by toolic (Bishop) on Oct 18, 2017 at 14:17 UTC
    You need to add a semicolon after the right paren:
    # v %mycorpus = getCorpus("C:\Users\li\test"); #line 1
      Semi-colons catch me out every time, thanks for that!
      Thank you!
Re: [OT]: Storing output of a subroutine into an hash and then printing hash
by AnomalousMonk (Archbishop) on Oct 18, 2017 at 15:05 UTC

    You probably just used the  "C:\Users\li\test" string as a quick example, but be aware that escapes (backslashes) in double-quoted strings are not passive. (I use  qq{...} in the following example instead of  "..." because Windoze command line doesn't like double-quotes.) The string you posted has three major pitfalls:

    c:\@Work\Perl\monks>perl -wMstrict -le "my $weird = qq{C:\Users\li\test}; print qq{weird double-quotish interpolation '$weird'}; ;; my $cool = qq{C:/Users/li/test}; print qq{forward slashes '$cool'}; ;; my $single = 'C:\Users\li\test\\'; print qq{non-interpolating single quotes '$single'}; " weird double-quotish interpolation 'C:SERSI EST' forward slashes 'C:/Users/li/test' non-interpolating single quotes 'C:\Users\li\test\'
    Perl treats forward and backward slashes in directory paths equivalently for internal use; it's usually best to use forward slashes in such paths. Also note that the final  \ (backslash) in the single-quote example must be escaped.

    See Quote and Quote-like Operators.


    Give a man a fish:  <%-{-{-{-<

      Ah, I wasn't aware of this (and I think it explains the problem with another script that I've been having issues with). Thanks!
Re: Storing output of a subroutine into an hash and then printing hash
by kcott (Archbishop) on Oct 18, 2017 at 23:25 UTC

    G'day Maire,

    I see ++toolic has identified the issue with your syntax error; I want to discuss another aspect of your code.

    Given you're talking about "stores all of the text from files held in a specific folder", it sounds like there's a lot of data, and %corpus may contain hundreds (thousands? millions?) of key-value pairs. Returning all that data and using it to create a new hash may be very inefficient.

    Without knowing how much data is involved, what &getCorpus is doing internally, or how %mycorpus is subsequently used (beyond printing its key-value pairs), I'm not in a position to provide concrete advice; however, I would recommend that you at least consider returning a hashref from &getCorpus, instead of a hash.

    Here's a couple of Benchmarks to give you an idea of just how inefficient your current code might be. (Note: I typically run benchmarks a minimum of five times; discard outlier results; then look for representative results in the remainder.)

    In the first test, I use a hash with a thousand key-value pairs, and compare (a) returning the hash as is and assigning that to a new hash; (b) returning a reference to the hash and assigning that to a new scalar; and, (c) returning a new, anonymous hashref and assigning that to a new scalar.

    #!/usr/bin/env perl use strict; use warnings; use Benchmark 'cmpthese'; my %big_hash = map { $_ => 1 } 1 .. 1_000; sub _get_original_hash { return %big_hash } sub _get_ref_to_hash { return \%big_hash } sub _get_anon_hashref { return { %big_hash } } cmpthese 0 => { orig => sub { my %hash = _get_original_hash() }, ref => sub { my $ref = _get_ref_to_hash() }, anon => sub { my $anon = _get_anon_hashref() }, };

    Here's a representative result:

    Rate orig anon ref orig 4495/s -- -5% -100% anon 4756/s 6% -- -100% ref 5554928/s 123471% 116697% --

    As you can see, returning a reference to the original hash is orders of magnitude faster than the other two methods. Although it may appear that "anon" is marginally faster than "orig", do not draw any conclusions from this: they're too close to call as demonstrated in the next test.

    Repeating the above with a million key-value pairs

    my %big_hash = map { $_ => 1 } 1 .. 1_000_000;

    gives this representative result:

    anon 1.23/s -- -4% -100% orig 1.28/s 4% -- -100% ref 5557956/s 451583832% 434909964% --

    Again, "ref" is orders of magnitude faster than the other two methods. This time, however, "orig" appears to be marginally faster than "anon" (which really just confirms that they're "too close to call").

    I've only demonstrated a basic principle. You should write your own benchmarks using realistic data (which you haven't described in your OP and I can only guess at).

    If you do decide to return a hashref, your code might look more like this:

    my $mycorpus = getCorpus('C:\Users\li\test'); print "$_ : $mycorpus{$_}\n" for sort keys %$mycorpus;

    For printing all the keys and their values, consider one of the dumper modules. My preference is the CPAN module Data::Dump; the core module Data::Dumper is also popular; others exist but those are the only two I've had much experience with. Using Data::Dump, you could get much the same output as the two lines above, but you'd only need this one line of code:

    dd getCorpus('C:\Users\li\test');

    And just a couple of other, unrelated points:

    "This subroutine has been written and used by someone with a lot more coding experience than me, so I can be sure that it is not the problem."

    That's a dangerous assumption: the best programmer in the world is not infallible and can have an off-day. Check it yourself: you may learn something; you might spot an error.

    "... I'm not sure about the etiquette of making someone else's code available without their consent."

    If the code's in the public domain, post it or link to it, and provide suitable attribution. Consent should not be required (unless the author specified this requirement). Notification may be a courtesy but could also be a nuisance: consider whether that's appropriate on a case-by-case basis. If it's not in the public domain, you should gain consent first; you may also need to alter sensitive data, e.g. post strings like "password", "credit_card_number", or "client_contact_details" instead of the real values.

    If you're posting a lot of code here, consider wrapping it in <spoiler>...</spoiler> or <readmore>...</readmore> tags. See "Writeup Formatting Tips" for more about that.

    — Ken

      Thank you very much for your incredibly helpful and detailed answer to my rather vague enquiry!

      I had never heard of the Benchmark module before, but it seems like a valuable resource; cheers for the pointer!

      At the moment, I am only working with very small amounts of data, but you are correct that eventually I will be working with millions of key-value pairs, so I will attempt to use the hashref method.

      Thanks again!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1201594]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2024-03-29 06:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found