Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks. I have a regex question. I have some regex in a script and I'm trying to parse for a new string and having problems. The problem is that the new string has several "/" in it. An example is "/strt/codes/76453". Here's the code that I'm trying to use to watch the regex:

push @matches, $_ if /^.*?(?:\b|_)$parse1(?:\b|_).*?(?:\b|_)$parse2(? +:\b|_).*?$/m;

I tried \/\ in $parse1 file with no success. The other problem is that the #s at the end of the string are all different so they're hard to match. I currently has nothing in $parse2. Any guidance on how to parse for the sought after string would be greatly appreciated. Thanks in advance.

Replies are listed 'Best First'.
Re: regex question
by Athanasius (Archbishop) on Feb 28, 2017 at 06:34 UTC

    I think the problem you’re seeing is probably caused by failure to match \b. From “Assertions” in perlre#Regular-Expressions:

    A word boundary (\b) is a spot between two characters that has a \w on one side of it and a \W on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a \W.

    So, if there is there is no underscore character, the expression (?:\b|_) attempts to match \b, but this fails if $parse1 begins with a forward slash since that character is not a “word” character — unless the preceding character happens to be a “word” character, which is unlikely in this scenario.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: regex question
by choroba (Cardinal) on Feb 28, 2017 at 06:18 UTC
    It's quite unclear what you're after. Anyway, if I set
    my $parse1 = qr{/}; my $parse2 = qr{\d+};

    the regex seems to match. qr is documented in perlop, it creates a regex from the enclosed string, which you can later use to match, or as a part of a larger regex. \d matches any digit.

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Sorry I was unclear. Here's some more info. I have a folder with several text files. In those test files I have lines that I'm trying to match and extract. Here is an example of some of the lines from one of the files:

      /search/detail/1164321 1.html /rsearch/detail/1164327 1.html /search/detail/1164639 1.html /search/detail/1164903 1.html /search/detail/1165763 1.html /search/detail/1191549 1.html /search/detail/1195169 1.html /search/detail/1195781 1.html /search/detail/1196405 1.html /search/detail/1196439

      I have two files that my script references, parse1.txt and parse2.txt to get the strings to match. They currently look like this:

      Parse1 http https Parse2 .com .gov .edu

      I'm trying to use this bit of code to match the '/search/detail/1196439' where before I was just looking to match valid webpages that started with http or https and ended with .com or .gov or .edu. The problem is that the leading '/' is messing me up. Here's more of my code

      my $calls_dir2 = "$response/Bing/1Parsed/Html"; my $parsed_dir = "$response/Bing/1Parsed/Html2"; unless ( -d $parsed_dir ) { make_path( $parsed_dir , { verbose => 1, mode => 0755 } ); } open( my $fh2, '<', $parse1file ) or die $!; chomp( my @parse_terms1 = <$fh2> ); close($fh2); open( $fh2, '<', $parse2file ) or die $!; print "parse1file=$parse1file\n"; print "parse2file=$parse2file\n"; for my $parse1 (@parse_terms1) { seek( $fh2, 0, 0 ); while ( my $parse2 = <$fh2> ) { chomp($parse2); print "$parse1 $parse2\n"; my $wanted = $parse1 . $parse2; my @files = glob "$calls_dir2/*.txt"; printf "Got %d files\n", scalar @files; for my $file (@files) { open my $in_fh, '<', $file; my $basename = fileparse($file); my ($prefix) = $basename =~ /^(.{9})/; my $rnumber = rand(1999); print $prefix, "\n"; my @matches; while (<$in_fh>) { #push @matches, $_ if /^.*?(?:\b|_)$parse1(?:\b|_) +.*?(?:\b|_)$parse2(?:\b|_).*?$/m; push @matches, $_ if /^.*?(?:|_)$parse1(?:|_).*?(? +:|_)$parse2(?:|_).*?$/m; #push @matches, $_ if m/^($parse1)$/i; #push @matches, $_ if m/^'$parse1'$/i; #m/^yes$/i } if ( scalar @matches ) { make_path($parsed_dir); open my $out_fh, '>', "$parsed_dir/${basename}.$wanted.$rnumber.txt" + or die $!; $out_fh->autoflush(1); print $out_fh $_ for @matches; print "$out_fh \n"; close $out_fh; } } } }

      Please let me know if you have enough info now. If not I'm more than happy to provide mode. Thanks in advance for the assistance!

        This was re-posted as a root node here. Normally I might consider one of the two for reaping, but nobody has replied to this node yet, so I'm just posting this notice.

Re: regex question (updated)
by haukex (Archbishop) on Feb 28, 2017 at 08:20 UTC
    the #s at the end of the string

    When asking help on regular expressions, please show several pieces of example input, for example we don't know what these #s are. Even better, use the following template, it will help you while you work on the regex as well:

    use warnings; use strict; use Test::More; my $regex = qr/foo(.+)/; like "foobar", $regex; ok "foobar" =~ $regex; is $1, "bar"; unlike "quzbaz", $regex; # ... lots more test cases here! done_testing;

    Update: Added demonstration of how to test $1 etc. to the above code.

Re: regex question
by Monk::Thomas (Friar) on Feb 28, 2017 at 09:34 UTC

    You can try to defuse $parse1 by using alternate separators, e.g.

    s{}{} or m{}

    or enclosing the string in \Q\E

    s/ \Q$string\E /  ... / or m/ \Q$string\E /