mmartin has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

Ok, so this is driving me insane and I don't know why it's not matching. I am searching through a list of URLs and can't
figure out why this REGEX does not match any of the lines from the input...?

Originally, the Regex contained a list of directories or paths that were part of the URLs, in which I wanted to skip over. But
when only a single group of the regex didn't match, I changed it for testing to try and ONLY match that one particular string and
yet, it still does NOT match. I'm guessing I am missing something pretty obvious, maybe my brain isn't working today....

The input file contains:
        *FYI, I trimmed down the Input file so there wasn't 100+ lines here...
fakesite.com/fake-url/fake_picture.jpg http://www.google-analytics.com/collect http://www.fakesite.com/files-png/10453229-7.png http://assets.pinterest.com/js/pinit.js http://assets.pinterest.com/js/pinit_main.js http://www.fakesite.com/files-png/10455009-1.png http://tacoda.at.atwola.com/atx/sync/addthis/addt/default? http://epiv.cardlytics.com/c/?pr=11018&prc=550 fakesite.com/images/fake_picture.jpg http://www.rating-system.com/webservice/RatingService.svc/GetReviews http://www.fakesite.com/files-png/1045997098.png http://safebrowsing-cache.google.com/safebrowsing/rd/ChFnb29nLXBo http://dnn506yrbagrg.cloudfront.net/pages/scripts/022/131.js?39616

And below is my code... The REGEX I am having trouble with WAS inside the Inner-most if statement. I will just copy/paste that REGEX
right below HERE since it made the code hard to read with the line continuation stuff. This one seemed to work correctly except
for this one portion of it. Also, I think I found a cleaner way to write my Original REGEX, could someone tell me if they are equivalent?

Original REGEX:
.*\/css\/.*|.*\/js\/.*|.*\/images\/.*|.*\/covers\/.*|.*\/pdf\/.*|.*\/jpg\/.*|.*\/MultiTrack\/.*|.*\/files-png\/.*|.*\/newissues\/.*|.*\/files\/.*|.*\/video\/.*|.*\/mp3\/.*|.*\/audio\/.*|.*\/invoices\/.*|.*\/MYFILES\/.*|.*\/newsclubs\/.*|.*\/scorch\/.*|.*\/images\/myfile\/.*|.*\/video\/.*|.*\/mp3\/.*|.*\/audio\/.*

Is this REGEX Equivalent to the Original One Above?:
.*\/(css|js|images|covers|pdf|jpg|MultiTrack|files-png|newissues|files|video|mp3|audio|invoices|MYFILES|newsclubs|scorch|images\/myfile|video|mp3|audio)\/.*

Quick exlanation of the code... The 1st If/REGEX in the while loop will make sure I'm only checking URL's that contain the correct
domain, whether they begin with "http://" and/or "www." or NOT. The original 2nd REGEX was supposed to exclude/ignore any URL's
that first match the domain, but then DO NOT include the list of PATHs in the If statement. For that REGEX, I needed to include the
forward-slashes "\/" at the start and end of the PATH names since those words could possibly appear as something other then a PATH,
like within a filename or something like that...

Here's the code:
#!/usr/bin/perl use strict; use warnings; my $input_file = "/home/User/Documents/fakeDir/test_urls.txt"; my @RESULTS; open(INPUT, "< $input_file") or die "Error: There was an error opening + the input file: $!\n\n"; my $x = 0; while (<INPUT>) { my $line = "$_"; chomp($line); ### INCLUDE STATEMENT: if ($line =~ /^(http:\/\/)?(www\.)?fakesite.com.*/g) { print "CHECKING --> '$line'\n"; ### EXCLUDE STATEMENT: # *Original Regex went here instead: # --> if ($line !~ /$ORINGAL-REGEX/g) if ($line =~ /files-png/g) { print "\t\tFOUND IT....\n\n"; $RESULTS[$x] = "$line"; $x++; } else { print "\t\tNOT FOUND....\n\n"; } } } close INPUT;


Here is my Output from the Code Above:
      *The 1st if REGEX seems to work just fine finding the domain but not the other...
CHECKING --> 'fakesite.com/fake-url/fake_picture.jpg' NOT FOUND.... CHECKING --> 'http://www.fakesite.com/files-png/10453229-7.png' NOT FOUND.... CHECKING --> 'http://www.fakesite.com/files-png/10455009-1.png' NOT FOUND.... CHECKING --> 'fakesite.com/images/fake_picture.jpg' NOT FOUND.... CHECKING --> 'http://www.fakesite.com/files-png/1045997098.png' NOT FOUND....

As you can see there should be 3 lines from the output above, given the data in the input file which should have matched. The one's I
think should have matched are:
1 --> http://www.fakesite.com/files-png/10453229-7.png
2 --> http://www.fakesite.com/files-png/10455009-1.png
3 --> http://www.fakesite.com/files-png/1045997098.png


Can anyone tell what I'm doing wrong with the regex, I figured something as simple as /files-png/g should easily match...
I am at a loss... If anyone has any thoughts or suggestions please feel free.

Thanks in Advance,
Matt

Replies are listed 'Best First'.
Re: Why this Simple REGEX does Not Match?
by choroba (Cardinal) on Mar 12, 2015 at 22:39 UTC
    The regex is fine, but the modifier is not. You're using the /g modifier in scalar context. When checking the match at line 19, the modifier remembers the position when the match occurred, and the matching at line 26 tries to start from there - and fails. Remove the modifiers and everything will work.
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Why this Simple REGEX does Not Match?
by roboticus (Chancellor) on Mar 12, 2015 at 22:42 UTC

    mmartin:

    If you have a regex like:

    if ($foo =~ /.*\/jpg\/.*/) { ... }

    It's looking for a string containing "/jpg/". Notice that slash at the end. It won't match "http://fakesite/blarf.jpg" because the slash.

    By the way, using m{..regex..} allows you to avoid the backslashes used to escape your forward slashes. So, the regex m{.*/jpg/.*} is better written as m{/jpg/}. So your regex looks like it could be simplified to:

    m{(css|js|images|covers|pdf|jpg|MultiTrack|files-png|newissues|files|v +ideo|mp3|audio|invoices|MYFILES|newsclubs|scorch|images\/myfile|video +|mp3|audio).*}

    assuming you're just looking for paths ending in one of the specified bits. You'd probably do well to read and understand perldoc perlre.

    Update: a little edit in the first couple minutes, trying to get it to read well.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Why this Simple REGEX does Not Match?
by gpapkala (Acolyte) on Mar 12, 2015 at 22:38 UTC
    Hello, just remove /g in the if ($line =~ /files-png/g)
Re: Why this Simple REGEX does Not Match? (re debug match globally pos)
by Anonymous Monk on Mar 12, 2015 at 22:53 UTC

    add use re 'debug'; and see for yourself :) its because you're using m//g, m//atchingglobal, global flag , matching globally affects the pos the regex starts matching from , see pos

    code with m//globally perl -Mre=debug -le " $_=q{fakesite.com/f/files-url/blah.png}; if(m{fakesite.com.*}g){ warn 1; if(/files-url/g){ die 2 }} "

    code without m//globally perl -Mre=debug -le " $_=q{fakesite.com/f/files-url/blah.png}; if(m{fakesite.com.*}){ warn 1; if(/files-url/){ die 2 }} "

    run

    shorter one

Re: Why this Simple REGEX does Not Match?
by mmartin (Monk) on Mar 13, 2015 at 16:16 UTC
    Hey All, thanks for the replies!

    Ahh ha, I thought it had to be something simple that I was missing... Thanks!

    gpapkala/choroba,
    Thanks guys I thought there was something I was missing there. I did not know that the //g modified affected
    the start position when searching. I alwasy just thought that it allowed you to match the pattern any number of times in a file.
    I thought that without the //g the pattern would stop after finding the 1st occurence, but it occurs to me now that I'm searching
    one line at a time and not over the entire file at once, Duhh... Thanks!


    roboticus,
    Thanks for the reply. Sorry, guess I didn't mention that ALL the patterns in the regex are Paths/Directories within the
    Domain's URLs I'm searching over, so it's not necessarily searching for the file extension, but the Path itself of:
          --> "http://www.fakesite/jpg/blah-blah-blah"
    But I see your point...


    AA,
    Oh cool, have not used that use statement before, I'll give it a shot, and thanks for the explanations!


    Thanks Again guys for all the replies, very much appreciated!

    Thanks Again,
    Matt

Re: Why this Simple REGEX does Not Match?
by mmartin (Monk) on Mar 13, 2015 at 20:54 UTC
    For the sake of closure, I just wanted to say I made the change removing the 'g' from //g. And it works perfectly now...

    Thanks again for everyone who commented, much appreciated!

    Thanks,
    Matt