regex question

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: regex question
by Athanasius (Archbishop) on Feb 28, 2017 at 06:34 UTC

I think the problem you’re seeing is probably caused by failure to match \b. From “Assertions” in perlre#Regular-Expressions:

A word boundary (\b) is a spot between two characters that has a \w on one side of it and a \W on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a \W.

So, if there is there is no underscore character, the expression (?:\b|_) attempts to match \b, but this fails if $parse1 begins with a forward slash since that character is not a “word” character — unless the preceding character happens to be a “word” character, which is unlikely in this scenario.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re: regex question
by choroba (Cardinal) on Feb 28, 2017 at 06:18 UTC

my $parse1 = qr{/};
my $parse2 = qr{\d+};
[download]

the regex seems to match. qr is documented in perlop, it creates a regex from the enclosed string, which you can later use to match, or as a part of a larger regex. \d matches any digit.

($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord
}map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
[download]

[reply]
[d/l]
[select]

Re^2: regex question

by Anonymous Monk on Feb 28, 2017 at 16:52 UTC

Sorry I was unclear. Here's some more info. I have a folder with several text files. In those test files I have lines that I'm trying to match and extract. Here is an example of some of the lines from one of the files:

/search/detail/1164321
1.html
/rsearch/detail/1164327
1.html
/search/detail/1164639
1.html
/search/detail/1164903
1.html
/search/detail/1165763
1.html
/search/detail/1191549
1.html
/search/detail/1195169
1.html
/search/detail/1195781
1.html
/search/detail/1196405
1.html
/search/detail/1196439
[download]

I have two files that my script references, parse1.txt and parse2.txt to get the strings to match. They currently look like this:

Parse1

http
https

Parse2

.com
.gov
.edu
[download]

I'm trying to use this bit of code to match the '/search/detail/1196439' where before I was just looking to match valid webpages that started with http or https and ended with .com or .gov or .edu. The problem is that the leading '/' is messing me up. Here's more of my code

 my $calls_dir2 = "$response/Bing/1Parsed/Html";
    
    my $parsed_dir = "$response/Bing/1Parsed/Html2";
    unless ( -d $parsed_dir  ) {
        make_path( $parsed_dir , { verbose => 1, mode => 0755 } );
    }

    open( my $fh2, '<', $parse1file ) or die $!;
    chomp( my @parse_terms1 = <$fh2> );
    close($fh2);

    open( $fh2, '<', $parse2file ) or die $!;
    
    print "parse1file=$parse1file\n";
    print "parse2file=$parse2file\n";

    for my $parse1 (@parse_terms1) {
        seek( $fh2, 0, 0 );

        while ( my $parse2 = <$fh2> ) {
            chomp($parse2);
            print "$parse1 $parse2\n";

            my $wanted = $parse1 . $parse2;

            my @files = glob "$calls_dir2/*.txt";

            printf "Got %d files\n", scalar @files;

            for my $file (@files) {

                open my $in_fh, '<', $file;
                my $basename = fileparse($file);
                my ($prefix) = $basename =~ /^(.{9})/;
                my $rnumber  = rand(1999);
                print $prefix, "\n";

                my @matches;
                while (<$in_fh>) {

                    #push @matches, $_ if /^.*?(?:\b|_)$parse1(?:\b|_)
+.*?(?:\b|_)$parse2(?:\b|_).*?$/m;
                    
                    push @matches, $_ if /^.*?(?:|_)$parse1(?:|_).*?(?
+:|_)$parse2(?:|_).*?$/m;
                    
                    #push @matches, $_ if m/^($parse1)$/i;
                    #push @matches, $_ if m/^'$parse1'$/i;
                    #m/^yes$/i
                    
                }

                if ( scalar @matches ) {
                    make_path($parsed_dir);
                    open my $out_fh, '>',
                        "$parsed_dir/${basename}.$wanted.$rnumber.txt"
+ or die $!;
                    $out_fh->autoflush(1);
                    print $out_fh $_ for @matches;
                    print "$out_fh \n";
                    close $out_fh;
                }
            }
        }
    }
[download]

Please let me know if you have enough info now. If not I'm more than happy to provide mode. Thanks in advance for the assistance!

[reply]
[d/l]
[select]

Re^3: regex question

by haukex (Archbishop) on Mar 01, 2017 at 08:56 UTC

This was re-posted as a root node here. Normally I might consider one of the two for reaping, but nobody has replied to this node yet, so I'm just posting this notice.

[reply]

Re: regex question (updated)
by haukex (Archbishop) on Feb 28, 2017 at 08:20 UTC

the #s at the end of the string

When asking help on regular expressions, please show several pieces of example input, for example we don't know what these #s are. Even better, use the following template, it will help you while you work on the regex as well:

use warnings;
use strict;
use Test::More;

my $regex = qr/foo(.+)/;

like "foobar", $regex;
ok "foobar" =~ $regex;
is $1, "bar";
unlike "quzbaz", $regex;
# ... lots more test cases here!

done_testing;
[download]

Update: Added demonstration of how to test $1 etc. to the above code.

[reply]
[d/l]
[select]

Re: regex question
by Monk::Thomas (Friar) on Feb 28, 2017 at 09:34 UTC

You can try to defuse $parse1 by using alternate separators, e.g.

s{}{}

m{}

or enclosing the string in \Q\E

s/ \Q$string\E / ... /

m/ \Q$string\E /

[reply]
[d/l]
[select]