Searching for a string within an array using regex

Aquilae has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Searching for a string within an array using regex by AnomalousMonk (Archbishop) on Aug 14, 2014 at 16:51 UTC
Two paths. See index, also quotemeta for `\Q \E` interpolation modifiers. (Update: Note that, of course, this approach will not work if the sub-string being searched for is longer than the target string; e.g., `'helper'` will not match to anything in the `@WORDS` example array below.) `c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my @WORDS = qw(trying helping doing do xdo help hel xhelp); my $s = 'help'; ;; my @hits = grep m{ \A \Q$s\E }xms, @WORDS; dd \@hits; ;; $s = 'do'; @hits = grep index($_, $s) == 0, @WORDS; dd \@hits; ;; dd \@WORDS; " ["helping", "help"] ["doing", "do"] ["trying", "helping", "doing", "do", "xdo", "help", "hel", "xhelp"]` [download] Update: If the strings being compared can be of any length relative to each other, here's a way to find strings that are identical at their beginnings. (This can easily be changed to find strings identical at their ends by throwing in a couple of reverse function calls.) c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my @words = qw(he xhelper hello hel help helper helping); my $t = 'helper'; ;; my @hits = grep overlap($_, $t, 3), @words; dd \@hits; ;; @hits = grep overlap($_, $t, 4), @words; dd \@hits; ;; @hits = grep overlap($_, $t, 6), @words; dd \@hits; ;; @hits = grep overlap($_, $t, 7), @words; dd \@hits; ;; sub overlap { my ($s, $t, $min) = @_; ;; return ($s ^ $t) =~ m{ \A \x00{$min,} }xms; } " ["hello", "hel", "help", "helper", "helping"] ["help", "helper", "helping"] ["helper"] [] [download]	[reply] [d/l] [select]
Re: Searching for a string within an array using regex by Anonymous Monk on Aug 14, 2014 at 17:14 UTC
grep {/^$_./} @WORDS* Within the block of a grep, `$_` is an alias for the current element of the list being inspected. So in your example, it'll be "trying", "helping" etc. A regular expression like `/.../` without a target (as opposed to `$string=~/.../`) searches `$_` by default, so with your regex, you're testing `$_` against itself, which isn't what you want. (Also, while it's not wrong, `/^$x./` can in this case be written more simply as `/^$x/`.) $_ =~ m/^@WORDS/* Here, @WORDS is interpolated into the regular expression, so your regular expression actually becomes `/^trying helping doing/`. Also, here you'd be searching `$_` for `@WORDS`, which is the wrong way around. So your first approach with grep was good, the only mistake being with the `$_` variable. Here's some code in which `@result` will contain only the element `"helping"`: `my @WORDS = ("trying", "helping", "doing", "whelp"); my $SEARCH = "help"; my @result = grep {/^\Q$SEARCH\E/} @WORDS; print "$_\n" for @result;` [download] (For the meaning of `\Q...\E`, see quotemeta.)	[reply] [d/l] [select]
Re: Searching for a string within an array using regex by Aquilae (Novice) on Aug 14, 2014 at 17:37 UTC
Thank you for your responses. I just realized that I actually offered an incorrect example. Let me share some actual code so you can see what I am working with and it will make more sense. I read in lines from a file. The file contains one URL per line each starting with "/". For instance, "/help.html" or "/index.html" or "/help.html?ri=all". Each line gets stored into an array `@line_entry`. Once all lines have been parsed into `@line_entry`, I iterate through `$line_entry[$i]`. At each iteration I do the following: `given ($line_entry[$i]) { when ($_ ~~ @INDEX_PAGE) { $INDEX_PAGE_COUNT++; } when ($_ =~ m/@HELP_PAGE/) { $HELP_PAGE_COUNT++; } }` [download] In the case of `@INDEX_PAGE`, the reason I can get away with using that type of when condition is because the only possibly URLs that cane be parsed from the file are the only two that appear in the array `@INDEX_PAGE = {"/", "/index.html"}`. There is no regex required since there is no "/index.html?something". It is guaranteed to either be "/" or "/index.html". However, array `@HELP_PAGE = {"/help.html"}` is defined as such, but the entries that can be found in the URL file may be "/help.html", but they could also be "/help.html?sometexthere". So I am trying to find the proper means to see if an element in an array is a subset of the string I'm currently examining, and not the other way around as I suggest in my initial post. Thus, if the string currently being analyzed is ANYTHING that begins with "/help.html" (the element in the `@HELP_PAGE` array), it should resolve true. Thank you again for your continued assistance!	[reply] [d/l] [select]
Re^2: Searching for a string within an array using regex by Anonymous Monk on Aug 14, 2014 at 18:02 UTC
use warnings; use strict; use List::Util qw/first/; my %INDEX_PAGE = map {$_=>1} '/','/index.html'; my @HELP_PAGE = map {qr/^\Q$_\E/} '/help.html'; my @line_entry = ('/help.html','/index.html','/help.html?ri=all'); for my $le (@line_entry) { if (exists $INDEX_PAGE{$le}) { print "$le is an index page\n" } elsif (first {$le=~$_} @HELP_PAGE) { print "$le is a help page\n" } else { print "$le is unknown\n" } } # Output: # /help.html is a help page # /index.html is an index page # /help.html?ri=all is a help page [download] For the exact match, I'm using a hash, since that should be faster than a brute-force search via grep or similar. For the regex match, I'm using `qr//` to precompile the regular expressions. Instead of grep, I'm using `first` from List::Util, because that will stop after the first match. I'm staying away from `given`/`when`/`~~` because those are marked experimental and their implementation may change in the future. How many "pages" do you have? Because if it's more than a few, your code is still going to get very repetitive, and an even more general-purpose solution will probably be more maintainable in the long run.	[reply] [d/l] [select]
Re^3: Searching for a string within an array using regex by Anonymous Monk on Aug 14, 2014 at 18:14 UTC
Improvement: The brute-force search of the regular expressions is actually not needed; one can just precompile the whole thing as one regular expression. (If you wanted to generalize even more, the same thing could be done for the index pages as well.) `use warnings; use strict; my %INDEX_PAGE = map {$_=>1} '/','/index.html'; my $HELP_PAGE = join '\|', map {quotemeta} '/help.html'; $HELP_PAGE = qr/^(?:$HELP_PAGE)/; my @line_entry = ('/help.html','/index.html','/help.html?ri=all'); for my $le (@line_entry) { if (exists $INDEX_PAGE{$le}) { print "$le is an index page\n" } elsif ($le=~$HELP_PAGE) { print "$le is a help page\n" } else { print "$le is unknown\n" } }` [download]	[reply] [d/l]