Re^2: Newbie Q:How do I compare items within a string?

I am not sure that you need to subtract the length of the word you have just matched from the position in your pos() - length($1). I have been playing around combining elements of your solution and TedPride's to come up with text annotated with occurrence no., total occurrences and offset. My suspicions were raised when the first word "I" came up with an offset of -1.

Here's the code without the subtraction

use strict;
use warnings;

my $string;
{
    local $/;
    $string = <DATA>;
}

my @words = split /[.,;:?! \n]+/, $string;

my $rhWords = {};
my $order = 0;
foreach my $word (@words)
{
    my $lcWord = lc $word;
    push @{$rhWords->{$lcWord}->{order}}, ++ $order;
    $rhWords->{$lcWord}->{count} ++;
}

my %found = ();
$string =~ s
   {
      ([^.,;:?! \n]+)(?{++ $found{lc $1}})
   }
   {
      $1 .
      (
         $rhWords->{lc $1}->{count} > 1 ?
         "[$^R/$rhWords->{lc $1}->{count}/@{[pos()]}]" :
         ""
      )
   }xeg;

print "\n$string\n";

__END__
I need to know how to compare items within a string... I 
have dropped a textfile into an array, but now I need to check 
whether words in that text are repeated throughout. I have split 
the text; as I only want the text to be manipulated. Maybe it's 
better to split it like this; So anyway, I basically need to check 
now across my string whether any elements in my string are repeated, 
and if so, how many times. I've read a lot about manipulating arrays, 
but they're all based on arrays that you create yourself, rather 
than arrays created by opening a textfile, so I'm not sure how to 
manipulate my array. Any help would be much appreciated.
[download]

and here's the output

I[1/6/0] need[1/3/2] to[1/7/7] know how[1/3/15] to[2/7/19] compare ite
+ms within a[1/4/43] string[1/3/45]... I[2/6/55] 
have[1/2/58] dropped a[2/4/71] textfile[1/2/73] into an array[1/2/90],
+ but[1/2/97] now[1/2/101] I[3/6/105] need[2/3/107] to[3/7/112] check[
+1/2/115] 
whether[1/2/122] words in[1/2/136] that[1/2/139] text[1/3/144] are[1/2
+/149] repeated[1/2/153] throughout. I[4/6/174] have[2/2/176] split[1/
+2/181] 
the[1/2/188] text[2/3/192]; as I[5/6/201] only want the[2/2/213] text[
+3/3/217] to[4/7/222] be[1/2/225] manipulated. Maybe it's 
better to[5/7/260] split[2/2/263] it like this; So[1/3/283] anyway, I[
+6/6/294] basically need[3/3/306] to[6/7/311] check[2/2/314] 
now[2/2/321] across my[1/3/332] string[2/3/335] whether[2/2/342] any[1
+/2/350] elements in[2/2/363] my[2/3/366] string[3/3/369] are[2/2/376]
+ repeated[2/2/380], 
and if so[2/3/398], how[2/3/402] many times. I've read a[3/4/428] lot 
+about manipulating arrays[1/3/453], 
but[2/2/462] they're all based on arrays[2/3/487] that[2/2/494] you cr
+eate yourself, rather 
than arrays[3/3/533] created by opening a[4/4/559] textfile[2/2/561], 
+so[3/3/571] I'm not sure how[3/3/587] to[7/7/591] 
manipulate my[3/3/606] array[2/2/609]. Any[2/2/616] help would be[2/2/
+631] much appreciated.
[download]

Empirically, this seems to work giving zero-based offsets. The documentation is rather terse but says that it returns the position where the last match left off, implying that your subtraction would be necessary. Strange.

Cheers,

JohnGG

Comment on Re^2: Newbie Q:How do I compare items within a string? Select or Download Code

Replies are listed 'Best First'.
Re^3: Newbie Q:How do I compare items within a string? by Zaxo (Archbishop) on May 09, 2006 at 09:35 UTC
A tidier alternative to my `pos() - length($1)` is to consult `@-` . `push @{$positions{lc($1)}}, $-[1] while $string =~ /([A-Za-z']+)/g;` [download] The difference in indexing is that your code is matching on seperator characters instead of word characters. The end of your first match is the start of my second. After Compline, Zaxo	[reply] [d/l]
Re^4: Newbie Q:How do I compare items within a string? by johngg (Canon) on May 09, 2006 at 10:23 UTC
I don't think that's the difference. I `split` on separator characters when forming the array `@words` but I negate the character class when doing the `s{ ... }{ ...}xeg` to add the annotation. Thus, like you, I am pulling out words but by capturing one or more non-separator characters. Cheers, JohnGG Update: I substituted your pattern `([A-Za-z']+)(?{++ $found{lc $1}})` [download] for my pattern `([^.,;:?! \n]+)(?{++ $found{lc $1}})` [download] and the results were identical.	[reply] [d/l] [select]