Re: Newbie Q:How do I compare items within a string?
by Limbic~Region (Chancellor) on May 07, 2006 at 13:16 UTC
|
PerlGrrl,
Welcome to the Monastery. Let's assume for a second that the reason you used an array is because you want to preserve order. You can use a disposable hash to find the unique counts.
my @words; # Defined elsewhere
my %uniq;
++$uniq{$_} for @words;
for my $word (@words) {
if (exists $uniq{$word}) {
print "Word: $word\tCount: $uniq{$word}\n";
delete $uniq{$word};
}
}
FYI, if you don't need to preserve the order than the array is unneccessary (just use the hash) and that would make it a frequently asked question.
| [reply] [d/l] |
|
|
Thnx! You've helped heaps...Unfortunately, preserving order is important, as is matching across mixed case...I guess that's why I was having so much trouble. I knew I had to use a hash, I just wasn't sure how.
And thanks for the welcome - will try not to wear it out...
| [reply] |
|
|
PerlGrrl,
Well, to make it case insensitive, play around with lc.
| [reply] |
|
|
welcome, and though I've only seen it recommended once here recently, you might want to consider Tie::IxHash -- I did, where preserving order was crucial but a hash was the reasonable solution to avoid having to iterate across 2 arrays concurrently -- and it bailed me out nicely!
| [reply] |
Re: Newbie Q:How do I compare items within a string?
by McDarren (Abbot) on May 07, 2006 at 13:23 UTC
|
In a situation like this, a hash is just the ticket.
You haven't given any sample data, but here is an example of how you could do it using the last few lines of your question as the "@words"
#!/usr/bin/perl -w
use strict;
use Data::Dumper::Simple;
my @words = qw(to check now across my string whether any elements in
+my string are repeated,
and if so, how many times. I've read alot about manipulating arrays, b
+ut
they're all based on arrays that you create yourself, rather than arra
+ys
created by opening a textfile, so I'm not sure how to manipulate my ar
+ray.
Any help would be much appreciated.);
my %unique_words;
for (@words) {
$unique_words{$_}++;
}
print Dumper(%unique_words);
The above will print each unique "word", and how many times it appears.
Cheers,
Darren :) | [reply] [d/l] |
Re: Newbie Q:How do I compare items within a string?
by TedPride (Priest) on May 07, 2006 at 18:21 UTC
|
If you want the original text, marked up to show words that appear more than once:
use strict;
use warnings;
my %c;
$_ = join '', <DATA>;
$c{lc($1)}++ while m/([a-zA-Z]+(?:'[a-zA-Z]+)?)/g;
s/([a-zA-Z]+(?:'[a-zA-Z]+)?)/$1 . ($c{lc($1)} > 1 ? "[$c{lc($1)}]" : '
+')/eg;
print;
__DATA__
I need to know how to compare items within a string... I have dropped
+a textfile into an array, but now I need to check whether words in th
+at text are repeated throughout. I have split the text; as I only wan
+t the text to be manipulated. Maybe it's better to split it like this
+; So anyway, I basically need to check now across my string whether a
+ny elements in my string are repeated, and if so, how many times. I'v
+e read alot about manipulating arrays, but they're all based on array
+s that you create yourself, rather than arrays created by opening a t
+extfile, so I'm not sure how to manipulate my array. Any help would b
+e much appreciated.
If you just want a list of all the words that appear more than once, in order of first appearance:
use strict;
use warnings;
my (%c, @w);
$_ = lc join '', <DATA>;
while (m/([a-z]+(?:'[a-z]+)?)/g) {
push @w, $1 if !$c{$1}++;
}
for (@w) {
print "$_ : $c{$_}\n" if $c{$_} > 1;
}
| [reply] [d/l] [select] |
|
|
Ted Pride,
Hi. I've found that your suggested solution has given me the exact output that I was after. Just had to make a minor tweak so it would recognise all words, and output in the format I was after...Thanks so much to everyone for their help.
| [reply] |
Re: Newbie Q:How do I compare items within a string?
by Zaxo (Archbishop) on May 07, 2006 at 17:16 UTC
|
Instead of splitting into an intermediate array, you can extract the word positions directly from a regex scan. It goes like this:
my $string =
q(So anyway, I basically need to check now across my
string whether any elements in my string are repeated, and
if so, how many times. I've read alot about manipulating
arrays, but they're all based on arrays that you create
yourself, rather than arrays created by opening a textfile,
so I'm not sure how to manipulate my array. Any help would
be much appreciated.);
my %positions;
push @{$positions{lc($1)}}, pos() - length($1)
while $string =~ /([A-Za-z']+)/g;
{
local $_;
print "$_\t@{$positions{$_}}\n"
for keys %positions;
}
That hash gives you a reference to a sorted array of string positions for each word found. In scalar context, the referenced arrays give the word count.
Another thing that gives you is that you get to say directly what a word character is, instead of defining what splits them. I used that to include contractions (while messing up any single-quoted passages).
| [reply] [d/l] |
|
|
I am not sure that you need to subtract the length of the word you have just matched from the position in your pos() - length($1). I have been playing around combining elements of your solution and TedPride's to come up with text annotated with occurrence no., total occurrences and offset. My suspicions were raised when the first word "I" came up with an offset of -1.Here's the code without the subtraction
and here's the output
Empirically, this seems to work giving zero-based offsets. The documentation is rather terse but says that it returns the position where the last match left off, implying that your subtraction would be necessary. Strange.
Cheers, JohnGG
| [reply] [d/l] [select] |
|
|
push @{$positions{lc($1)}}, $-[1]
while $string =~ /([A-Za-z']+)/g;
The difference in indexing is that your code is matching on seperator characters instead of word characters. The end of your first match is the start of my second.
| [reply] [d/l] |
|
|
Re: Newbie Q:How do I compare items within a string?
by johngg (Canon) on May 07, 2006 at 16:22 UTC
|
This is one way you could check for repeated words, counting them and also keeping track of the order. I have changed your split slightly to cope with comma and space next to each other and also with newlines. This solution also uses a hash keyed by the lowercase word and the value for each word is another hash with elements for number of occurrences and list of occurrence positions.
use strict;
use warnings;
use Data::Dumper;
{
local $/ = undef;
$_ = <DATA>;
}
my @words = split /[., \n]+/;
print "$_\n" for @words;
my $rhWords = {};
my $order = 0;
foreach my $word (@words)
{
my $lcWord = lc $word;
push @{$rhWords->{$lcWord}->{order}}, ++ $order;
$rhWords->{$lcWord}->{count} ++;
}
my $dd = Data::Dumper->new([$rhWords], [qw(rhWords)]);
print $dd->Dumpxs();
__END__
The quick brown fox jumps over the lazy dog, the slow brown duck swims
over the muddy pond. The chicken is pecking in the yard until the quic
+k
brown fox decides he is hungry.
When run this produces
I hope this is of use. Cheers, JohnGG | [reply] [d/l] [select] |