What's the best way to do a pattern search like this?

supernewbie has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: What's the best way to do a pattern search like this? by MeowChow (Vicar) on Jul 20, 2001 at 10:12 UTC
Neglecting for a moment that the devil is in the details: `sub word_count { my %h; $h{$_}++ for pop =~ /\w+/g; %h; } ## Example ## use Data::Dumper; my $s = 'fee fi fo fo fi fee fo fum fum bar baz'; my %h = word_count($s); print Dumper \%h;` [download] You may want to replace the regex with something like `/[a-z]+(?:'[a-z]+)?/gi`, in order to properly count conjunctive words. MeowChow s aamecha.s a..a\u$&owag.print	[reply] [d/l] [select]
Re: Re: What's the best way to do a pattern search like this? by tachyon (Chancellor) on Jul 20, 2001 at 10:56 UTC
supernewbie wanted an explanation of MeowChows sub: First declare the sub sub word_count { Next we declare a lexically scoped has called %h the % indicates that this is a hash and the h is a typical MeowChow explanatory long var name :-) my %h; This is a bit of very idiomatic perl $h{$_}++ for pop =~ /\w+/g; It is fairly easy to understand if you read it R->L. The expression: pop =~ /\w+/g pop()s the last value off @_ which is the array passed to a subroutine called like `mysub(@myarray)`. This gets us the value passed to the sub. We then use a regular expression to match \w+ which is groups of letters (as many in a row a possible) but not whitespace. Because this is referenced in LIST context by the `for` it returns a list of words which the for iterates over assigning each value to the magical `$_` variable. Finally we use out hash to count the occurances of each word (code). A hash stores a key value pair. Thus the key we are using is $_. The `++` part increments the value of `$h{$_}` by one each time we see the key. %h In a perl sub the sub returns the last value evaluated so this is shorhand for the more usual `return %h` Hope this helps cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply]
Re: What's the best way to do a pattern search like this? by tachyon (Chancellor) on Jul 20, 2001 at 10:33 UTC
Here is an example for you using a hash # declare our vars my (%codes, @array_codes); #undef input record sep to get all data at once local $/; # make an array of codes by splitting DATA on whitespace @array_codes = split /\s+/, <DATA>; # map the codes to a hash, counting duplicates # using a for loop for efficiency foreach $code_key (@array_codes) { $codes{$code_key}++; } # print it out printf "$_\t$codes{$_}\n" for keys %codes; __DATA__ baaba ba abab abab abab baaba baaba babaa. abab aaba ba abab ba. bababab abab abab ba aaba. ba bababab aaba abab babaa baaba ba baaba. aaba ba bababab ba bababab abab ba aaba abab baaba abab. ba abab abab ba. [download] Note that: map{....}@array is just another way of writing: for (@array) { .. }. To do it to a file all you need to do to use this is do somthing like: `sub count_codes { my $file = shift; open (FILE, "<$file") or die "Oops, perl says $!\n"; local $/; my @array_codes = split /\s+/, <FILE>; close FILE; foreach $code_key (@array_codes) { $codes{$code_key}++; } printf "$_\t$codes{$_}\n" for keys %codes; } # call sub count_codes("/path/to/myfile.txt");` [download] You have some full stops in there which I have assumed are part of the codes. If they are not you will need to filter them out using a regex in our for loop like this: `foreach $code_key (@array_codes) { $code_key =~ s/[.]//g; $codes{$code_key}++; }` [download] If you want filter out more characters add them to the character class between the [ ] cheers tachyon Update Removed lazy and inefficient map and replaced with proper for loop. Even typed foreach to remind me not to be so slack. s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l] [select]
Re: Re: What's the best way to do a pattern search like this? by MeowChow (Vicar) on Jul 20, 2001 at 11:07 UTC
If I don't mention it, someone else will. Don't suggest the use of map in a void context. You are taking the trouble to build a whole return list, which you just throw away. It is more efficient and idiomatic to use for for such tasks. MeowChow s aamecha.s a..a\u$&owag.print	[reply]
Re: Re: Re: What's the best way to do a pattern search like this? by tachyon (Chancellor) on Jul 20, 2001 at 11:28 UTC
Good point, I'll update the code. It's too much Golf you know, shaving those two chars by using map instead of for. cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply]
Re: Re: What's the best way to do a pattern search like this? by supernewbie (Beadle) on Jul 20, 2001 at 13:33 UTC
I tried your method. Everything works great, execpt the program will return something like: `ba. 1 ba 2 ........` [download] Should I do a s/\./ / on file.txt before process it through your function? What if there are other things like ? ! : ; " ' ( ) ...etc..	[reply] [d/l]
Re: Re: Re: What's the best way to do a pattern search like this? by davorg (Chancellor) on Jul 20, 2001 at 13:57 UTC
You just need to adjust the regex a little. `my @array_codes = split /\s+/, <FILE>;` [download] assumes that you're interested in all non-whitespace characters. Changing it to: `my @array_codes = split /\W+/, <FILE>;` [download] means that your're only interested in non-word characters (where word chars are A-Z, 0-9 and '-'). -- <http://www.dave.org.uk> Perl Training in the UK <http://www.iterative-software.com>	[reply] [d/l] [select]
Re: Re: Re: What's the best way to do a pattern search like this? by tachyon (Chancellor) on Jul 20, 2001 at 14:01 UTC
Hi, you have two options. If you wish to retain ultimate control split on whitespace and filter the elemets in @array_codes using this (as above) `$code_key =~ s/[.?!:;"'()]//g;` [download] This filters out all the stuff in the char class. Alternatively you can just grab alphanumerics in the first place like this: `@array_codes = <DATA> =~ m/\w+/g;` [download] cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l] [select]
Re: Re: Re: Re: What's the best way to do a pattern search like this? by supernewbie (Beadle) on Jul 20, 2001 at 14:16 UTC
Re: Re: Re: Re: Re: What's the best way to do a pattern search like this? by tachyon (Chancellor) on Jul 20, 2001 at 14:59 UTC
Re: What's the best way to do a pattern search like this? by CharlesClarkson (Curate) on Jul 20, 2001 at 10:58 UTC
Some things to ponder: How should the algorithm handle hyphenated words? Should pre-paid become pre and paid or remain pre-paid? Will any words wrap to the next line using a hyphen? Are there any slang or shortcut words in the file? How should b4 be handled? Is the file short or long? Should the algorithm read the entire file into memory or would it be better to process each line? How might you handle dates: 500 A.D., c. 1500 bc. And what about other abreviations: Mr. Jr. Ave. etc. e.g. HTH, Charles K. Clarkson	[reply]

Update