Aquilae has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks.

I have been fidgeting with code and searching online but can't seem to arrive at the correct answer for my problem.

I am given a string. I need to determine whether or not that string exists within a provided array. However, the string may only be the first part of an element within the array.

For example, if I have @WORDS = {"trying", "helping", "doing"}, and I am given string $_ = "help", I would want to return true, because the "helping" element in the @WORDS array contains "help".

One thing to note, is that the provided string is always going to be the beginning of any word found in the array. So if there was an element in an array "whelp", that would not resolve true.

I hope I have explained this properly and succinctly. I have tried using grep {/^$_.*/} @WORDS, but that does not work. I have also tried using ($_ =~ m/^@WORDS/) to no avail as well.

Any help on this matter would be greatly appreciated!

  • Comment on Searching for a string within an array using regex

Replies are listed 'Best First'.
Re: Searching for a string within an array using regex
by AnomalousMonk (Archbishop) on Aug 14, 2014 at 16:51 UTC

    Two paths. See index, also quotemeta for  \Q \E interpolation modifiers. (Update: Note that, of course, this approach will not work if the sub-string being searched for is longer than the target string; e.g.,  'helper' will not match to anything in the  @WORDS example array below.)

    c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my @WORDS = qw(trying helping doing do xdo help hel xhelp); my $s = 'help'; ;; my @hits = grep m{ \A \Q$s\E }xms, @WORDS; dd \@hits; ;; $s = 'do'; @hits = grep index($_, $s) == 0, @WORDS; dd \@hits; ;; dd \@WORDS; " ["helping", "help"] ["doing", "do"] ["trying", "helping", "doing", "do", "xdo", "help", "hel", "xhelp"]

    Update: If the strings being compared can be of any length relative to each other, here's a way to find strings that are identical at their beginnings. (This can easily be changed to find strings identical at their ends by throwing in a couple of reverse function calls.)

    c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my @words = qw(he xhelper hello hel help helper helping); my $t = 'helper'; ;; my @hits = grep overlap($_, $t, 3), @words; dd \@hits; ;; @hits = grep overlap($_, $t, 4), @words; dd \@hits; ;; @hits = grep overlap($_, $t, 6), @words; dd \@hits; ;; @hits = grep overlap($_, $t, 7), @words; dd \@hits; ;; sub overlap { my ($s, $t, $min) = @_; ;; return ($s ^ $t) =~ m{ \A \x00{$min,} }xms; } " ["hello", "hel", "help", "helper", "helping"] ["help", "helper", "helping"] ["helper"] []

Re: Searching for a string within an array using regex
by Anonymous Monk on Aug 14, 2014 at 17:14 UTC
    grep {/^$_.*/} @WORDS

    Within the block of a grep, $_ is an alias for the current element of the list being inspected. So in your example, it'll be "trying", "helping" etc. A regular expression like /.../ without a target (as opposed to $string=~/.../) searches $_ by default, so with your regex, you're testing $_ against itself, which isn't what you want. (Also, while it's not wrong, /^$x.*/ can in this case be written more simply as /^$x/.)

    $_ =~ m/^@WORDS/

    Here, @WORDS is interpolated into the regular expression, so your regular expression actually becomes /^trying helping doing/. Also, here you'd be searching $_ for @WORDS, which is the wrong way around.

    So your first approach with grep was good, the only mistake being with the $_ variable. Here's some code in which @result will contain only the element "helping":

    my @WORDS = ("trying", "helping", "doing", "whelp"); my $SEARCH = "help"; my @result = grep {/^\Q$SEARCH\E/} @WORDS; print "$_\n" for @result;

    (For the meaning of \Q...\E, see quotemeta.)

Re: Searching for a string within an array using regex
by Aquilae (Novice) on Aug 14, 2014 at 17:37 UTC

    Thank you for your responses. I just realized that I actually offered an incorrect example. Let me share some actual code so you can see what I am working with and it will make more sense.

    I read in lines from a file. The file contains one URL per line each starting with "/". For instance, "/help.html" or "/index.html" or "/help.html?ri=all". Each line gets stored into an array @line_entry.

    Once all lines have been parsed into @line_entry, I iterate through $line_entry[$i]. At each iteration I do the following:

    given ($line_entry[$i]) { when ($_ ~~ @INDEX_PAGE) { $INDEX_PAGE_COUNT++; } when ($_ =~ m/@HELP_PAGE/) { $HELP_PAGE_COUNT++; } }

    In the case of @INDEX_PAGE, the reason I can get away with using that type of when condition is because the only possibly URLs that cane be parsed from the file are the only two that appear in the array @INDEX_PAGE = {"/", "/index.html"}. There is no regex required since there is no "/index.html?something". It is guaranteed to either be "/" or "/index.html".

    However, array @HELP_PAGE = {"/help.html"} is defined as such, but the entries that can be found in the URL file may be "/help.html", but they could also be "/help.html?sometexthere". So I am trying to find the proper means to see if an element in an array is a subset of the string I'm currently examining, and not the other way around as I suggest in my initial post.

    Thus, if the string currently being analyzed is ANYTHING that begins with "/help.html" (the element in the @HELP_PAGE array), it should resolve true.

    Thank you again for your continued assistance!

      use warnings; use strict; use List::Util qw/first/; my %INDEX_PAGE = map {$_=>1} '/','/index.html'; my @HELP_PAGE = map {qr/^\Q$_\E/} '/help.html'; my @line_entry = ('/help.html','/index.html','/help.html?ri=all'); for my $le (@line_entry) { if (exists $INDEX_PAGE{$le}) { print "$le is an index page\n" } elsif (first {$le=~$_} @HELP_PAGE) { print "$le is a help page\n" } else { print "$le is unknown\n" } } # Output: # /help.html is a help page # /index.html is an index page # /help.html?ri=all is a help page
      • For the exact match, I'm using a hash, since that should be faster than a brute-force search via grep or similar.
      • For the regex match, I'm using qr// to precompile the regular expressions.
      • Instead of grep, I'm using first from List::Util, because that will stop after the first match.
      • I'm staying away from given/when/~~ because those are marked experimental and their implementation may change in the future.

      How many "pages" do you have? Because if it's more than a few, your code is still going to get very repetitive, and an even more general-purpose solution will probably be more maintainable in the long run.

        Improvement: The brute-force search of the regular expressions is actually not needed; one can just precompile the whole thing as one regular expression. (If you wanted to generalize even more, the same thing could be done for the index pages as well.)

        use warnings; use strict; my %INDEX_PAGE = map {$_=>1} '/','/index.html'; my $HELP_PAGE = join '|', map {quotemeta} '/help.html'; $HELP_PAGE = qr/^(?:$HELP_PAGE)/; my @line_entry = ('/help.html','/index.html','/help.html?ri=all'); for my $le (@line_entry) { if (exists $INDEX_PAGE{$le}) { print "$le is an index page\n" } elsif ($le=~$HELP_PAGE) { print "$le is a help page\n" } else { print "$le is unknown\n" } }