Tutelage, part two

ctp has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Tutelage, part two by davido (Cardinal) on Jan 04, 2004 at 09:21 UTC
In place of your last code paragraph (starting with foreach $key...), try this: `{ local $, = "\n"; print( ( sort { $hash{$b} <=> $hash{$a} } keys %hash )[0..9] ); }` [download] ...if that's too funky for ya-all, you could just do this: `foreach my $key ( (sort {$hash{$b} <=> $hash{$a} } keys %hash )[0..9] +) { print $key, "\n"; }` [download] Both of the preceeding methods take a slice of a list. This assumes that there actually ARE ten words in the list. None of the methods mentioned try to do anything about the possibility of there being more than one word that appears N number of times. The order, in such a case, is undefined. If you want to define it, change the sort to look like this: `sort { $hash{$b} <=> $hash{$a} or $a cmp $b } keys $hash` [download] This will sort by frequency first, and alphabetically second, by short-circuiting with the 'or' to the alphabetic comparison when the frequencies are equal. Also, this wasn't mentioned in response to your previous "Tutlage" question, but it's looked on favorably in the Monastery to use node titles that describe the question, not the nature of the request for help or level of knowledge of the requestor. Titles like "Newbie help", "Need help", or "Tutelage" don't give anyone any idea what it is they're going to be reading (or choosing not to read, whichever the case may be). ...just a friendly tip. ;) Dave	[reply] [d/l] [select]
Re: Tutelage, part two by edan (Curate) on Jan 04, 2004 at 09:19 UTC
Try something like this: `my $top_X = 10; foreach $key (sort { $hash{$a} <=> $hash{$b} } keys %hash) { print "$key\t\t= $hash{$key}\n"; last if --$top_X <= 0; }` [download] -- 3dan	[reply] [d/l]
Re: Re: Tutelage, part two by tilly (Archbishop) on Jan 04, 2004 at 23:22 UTC
Style note. I would take the opportunity to throw my in front of $key and explain how strict.pm could be used to catch potential typos. You could also mention that it is important to indent consistently (as you did in your snippet).	[reply]
Re: Re: Re: Tutelage, part two by edan (Curate) on Jan 05, 2004 at 07:14 UTC
Well.... I think this sort of goes back to the answering the bigger question question. I know that you said over there that you are the 'answer the bigger question' type, so I suppose your comment here is in character. But I think 'just answering the question' is okay too, and that's what I did here - the OP was not looking for style points, and even mentioned that s/he knows that s/he should be using strict, etc. So I chose to simply answer the question that was asked... -- 3dan	[reply]
Re: Tutelage, part two by diotalevi (Canon) on Jan 04, 2004 at 17:51 UTC
This is a general code critique. # Always use strict especially when you are new. Its a boon to debuggi +ng and it helps you from screwing up in some interesting ways (but mo +stly it does some typo-checking) use strict; use warnings; open (TEXTFILE, 'text.txt') or die ("Cannot open file : $!"); # $/ is normally "\n" which is what readline() uses to decide when to +stop reading. You want to assign undef here so that readline() will p +ull in the entire file and not just the first line of the file. $/ = undef; # strict will force you to declare your variables. Here I use my() to +make $big_string be a lexical. my $big_string = <TEXTFILE>; $big_string = lc ($big_string); # I re-wrote this as a tr/// expression so it would be so ugly as a re +gular expression. If you really wanted to use s/// then s/[()]//g wou +ld have been preferred. $big_string =~ tr/()//d; # I declared @words with my() and moved the plus sign out of the chara +cter class. You meant to split on repeated white-space but you actual +ly just indicated that plus signs should also be split on. Moving the + + quantifier onto the character class indicates that any repeated ma +tching of the class should be smushed together. my @words = split (/[,.\s]+/, $big_string); # I replaced your for/push loop with a grep which only passes words th +at are three characters or longer. You should also note that in your +original expression that anything that matches {4,} is also going to +match {3,}. If you really meant to do that then just the single {3,} +would have been good enough. Also, instead of saying [a-zA-Z] you cou +ld have written /[a-z]{3,}/i which is case insensitive. my @gt_three_char_words = grep /[a-z]{3,}/i, @words; # Rename %hash to %word_counter and declare it. my %word_counter; foreach (@gt_three_char_words) { $word_counter{$_}++; } # Sort the list of words so you can then extract whatever portion you +like. @gt_three_char_words = sort { # Sort the word list by occurrances ( $word_counter{$a} <=> $word_counter{$b} ) \|\| # Then alphabetically ( lc( $a ) cmp lc( $b ) } @gt_three_char_words # Print the print join( '', map( "$_: $word_count{$_}\n", @gt_three_char_words[ 0 .. 9 ] ) ) . "\n"; [download] Now here is how I would have written the entire thing. use strict; use warnings; use Getopt::Long; use File::Slurp; my $filename = 'text.txt'; my $min_word_length = 3; my $start_word = 0; my $words_to_display = 10; GetOptions( "file=s" => \$filename, "min=i" => \$min_word_length, "start=i" => \$start_word, "words=i" => \$words_to_display ); my $original_text = lc read_file( $filename ); $original_text =~ tr/()//d; my @words = grep /[a-z]{$min_word_length,}/i, split /[,.\s]+/, $original_text; my %word_counter; $word_counter{ $_ } ++ for @words; @words = sort { $word_counter{ $a } <=> $word_counter{ $b } \|\| $a cmp $b } @ words; my @words_to_print = @words[ $start_word .. ( $start_word + $words_to_display - 1 ) ]; for my $word ( @words_to_print ) { print "$word: $word_counter{ $word }\n"; } [download]	[reply] [d/l] [select]
Re: Re: Tutelage, part two by ysth (Canon) on Jan 04, 2004 at 20:12 UTC
`print join( '', map( "$_: $word_count{$_}\n", @gt_three_char_words[ 0 .. 9 ] ) ) . "\n";` [download] No offense intended, but that is truly ugly. Why not a simple for loop: `for my $word (@gt_three_char_words[ 0 .. 9]) { print "$word: $word_count{$word}\n"; }` [download] Now here is how I would have written the entire thing. You've got a couple of errors there: $word where you meant $_ in the increment loop and an extraneous ( after split.	[reply] [d/l] [select]
Re: Re: Re: Tutelage, part two by diotalevi (Canon) on Jan 04, 2004 at 20:30 UTC
Ugly? Huh. I thought the foreach/print was uglier than the print/map. But then on consideration I would have probably written that as as list instead. `print +(map "$_: $word_count{ $_ }\n", @gt_three_char_words[ 0 .. 9 ] ), "\n";` [download]	[reply] [d/l]
Re: Re: Tutelage, part two by chanio (Priest) on Jan 04, 2004 at 21:01 UTC
Very illuminating brother! To be able to teach is a virtue that surpasses the technical skills.	[reply]
Re: Re: Tutelage, part two by ctp (Beadle) on Jan 05, 2004 at 04:17 UTC
Wow - lots to chew on there. Thanks. A couple of responses/questions/etc: Always use strict especially when you are new. Check out my response to that a few posts up...I'm working on getting to the place where I will use strict, but I'm not quite there yet. I replaced your for/push loop with a grep which only passes words that are three characters or longer That started out to capture all words of 4 or more letterrs, but words like dog's and cat's were skipped. I added the second part, `[a-zA-Z]{3,}'`, to cature those...it seems to work as far as I can tell. I re-wrote this as a tr/// expression -snip- Ooh - I like that a lot better...thanks!	[reply] [d/l]
Re: Tutelage, part two by dreadpiratepeter (Priest) on Jan 04, 2004 at 17:24 UTC
btw - yes, we have already covered that I should be using strict; and declaring my variables, that I should really be slurping my text file (although the script currently does what it should thru some kludgy perl magic ;), and a few other things I need to fix before I turn it in. :-) The fundamental change that you need to make to become a good Perl prgrammer (and a good programmer in general), is that you have to use strict before you code, and declare your variables as you write them, not at the end. You are driving blind. When things don't work, you don't know why. With use strict, 90% of all your errors will be caught by the compiler. Why make things harder on yourself? As for the problem at hand, you don't need two loops, you need one, but you need to exit the loop when you have gone through it 10 times. Try: `my $ctr=1; foreach $key (sort {$hash{$b} <=> $hash{$a}} keys %hash) { print "$key\t\t= $hash{$key}\n"; last if $ctr++ == 10; }` [download] `last` tells the interpreter to exit the loop unconditionally. -pete "Worry is like a rocking chair. It gives you something to do, but it doesn't get you anywhere."	[reply] [d/l] [select]
Re: Re: Tutelage, part two by ctp (Beadle) on Jan 05, 2004 at 00:36 UTC
Thank you for the code snippet, and the advice. I basically just started learning Perl, and I have found that I am of two minds on the use of strict; as a beginner. I now have a script that works, and will get me a good grade - for the beginner's assignment at hand. When I turn on strict; I get several lines of admonishment...yes, very important admomishment, I know...but I also have no output to see if I am even headed in the right direction, and strict; tells me a bunch of new stuff I don't understand on top of it all. I am barely at the point of remembering which part of a hash is the key and which is the value, and I'm just now barely figuring out how `$hash{$_}++` does what it does :-D What I have taken to doing is putting in strict; and then commenting it out until I want to hear from it. As I learn more I know I will want to hear from it all the time, but for now I'm happy that a simple foreach loop even works as written. Thanks again!	[reply] [d/l]
Re: Re: Re: Tutelage, part two by diotalevi (Canon) on Jan 05, 2004 at 04:43 UTC
Here's the cheat sheet on strict. Declare global variables with our() (or declare them up at the top with 'use vars ...'. and everything other variable with my(). The rest of strict's functions will stop you from writing bugs so if you run into other things then your code is already incorrect. `use vars ( '$SOME_GLOBAL' ); $SOME_GLOBAL = 42; our $ANOTHER_GLOBAL = 'fnord'; sub a_function { # A variable declared just for the inside of a_function. my $some_param = shift; print "$some_param\n"; }` [download]	[reply] [d/l]
Re: Tutelage, part two by Zaxo (Archbishop) on Jan 04, 2004 at 17:28 UTC
For `@words = split (/[,.\s+]/, $big_string);`, I think you mean, `@words = split (/[,.\s]+/, $big_string);` and you may want to add more punctuation to the character class. After Compline, Zaxo	[reply] [d/l] [select]
Re: Re: Tutelage, part two by ctp (Beadle) on Jan 05, 2004 at 04:06 UTC
Cool - did it - thanks!	[reply]
Re: Tutelage, part two by Courage (Parson) on Jan 04, 2004 at 10:39 UTC
You'll better place `use strict;` instead of `use warnings;` in top of your code and be sure to follow coding guidelines in "perldoc perlstyle", it is especially usefull for novices Courage, the Cowardly Dog	[reply] [d/l] [select]
Re: Tutelage, part two by ysth (Canon) on Jan 04, 2004 at 20:42 UTC
One thing that bothers me is that you use different criteria for spliting into words and determining the length. In the one case, you are using whatever you get after spliting on certain punctuation; in the other, you are checking only the 26 basic letters. Before you write the code, you should consider what you consider a "word". This is actually a fairly complex task. I would approach it as something like: A "word" is a contiguous sequence of letters or digits with zero or more hyphens or single quotes in the inside. (There are words that begin or end with single quotes, but it isn't possible to easily distinguish those from quoted words.) With a definition like that, you can pick them out more easily with something like `@words = m/((?:\w\|\b[-']\b){4,})/g` than by splitting on non-word characters. One problem with that is that \w will include _. To avoid that, use `[^\W_]` instead of \w and `(?<=[^\W_])[-'](?=[^\W_])` instead of the other part (untested). Another problem is accented characters. If you want to include them, either utf8::encode($big_string) before using it or use locale. And if your input has words split over lines with a hyphen, you may want to `s/-\n//g` your input. If you have test input, you really ought to print out what you are getting as words and compare it to the input to see what special cases you may be missing.	[reply] [d/l] [select]
Re: Re: Tutelage, part two by ctp (Beadle) on Jan 05, 2004 at 04:23 UTC
All great suggestions - thanks! If you have test input, you really ought to print out what you are getting as words and compare it to the input to see what special cases you may be missing. I do indeed have test input that I have become quite intimate with while testing this script. This is how I discovered that nnn'n words were being skipped. I am digging thru the sample file, and I haven't found anything not getting picked up yet...but of course I probably have a lucky set of words in my file (a random text file from work)	[reply]