in reply to Newbie Q:How do I compare items within a string?

Instead of splitting into an intermediate array, you can extract the word positions directly from a regex scan. It goes like this:

my $string = q(So anyway, I basically need to check now across my string whether any elements in my string are repeated, and if so, how many times. I've read alot about manipulating arrays, but they're all based on arrays that you create yourself, rather than arrays created by opening a textfile, so I'm not sure how to manipulate my array. Any help would be much appreciated.); my %positions; push @{$positions{lc($1)}}, pos() - length($1) while $string =~ /([A-Za-z']+)/g; { local $_; print "$_\t@{$positions{$_}}\n" for keys %positions; }
That hash gives you a reference to a sorted array of string positions for each word found. In scalar context, the referenced arrays give the word count.

Another thing that gives you is that you get to say directly what a word character is, instead of defining what splits them. I used that to include contractions (while messing up any single-quoted passages).

After Compline,
Zaxo

Replies are listed 'Best First'.
Re^2: Newbie Q:How do I compare items within a string?
by johngg (Canon) on May 09, 2006 at 09:16 UTC
    I am not sure that you need to subtract the length of the word you have just matched from the position in your pos() - length($1). I have been playing around combining elements of your solution and TedPride's to come up with text annotated with occurrence no., total occurrences and offset. My suspicions were raised when the first word "I" came up with an offset of -1.

    Here's the code without the subtraction

    and here's the output

    Empirically, this seems to work giving zero-based offsets. The documentation is rather terse but says that it returns the position where the last match left off, implying that your subtraction would be necessary. Strange.

    Cheers,

    JohnGG

      A tidier alternative to my pos() - length($1) is to consult @- .

      push @{$positions{lc($1)}}, $-[1] while $string =~ /([A-Za-z']+)/g;
      The difference in indexing is that your code is matching on seperator characters instead of word characters. The end of your first match is the start of my second.

      After Compline,
      Zaxo

        I don't think that's the difference. I split on separator characters when forming the array @words but I negate the character class when doing the s{ ... }{ ...}xeg to add the annotation. Thus, like you, I am pulling out words but by capturing one or more non-separator characters.

        Cheers,

        JohnGG

        Update: I substituted your pattern

        ([A-Za-z']+)(?{++ $found{lc $1}})

        for my pattern

        ([^.,;:?! \n]+)(?{++ $found{lc $1}})

        and the results were identical.