Hello Monks,

Ok, so this is driving me insane and I don't know why it's not matching. I am searching through a list of URLs and can't
figure out why this REGEX does not match any of the lines from the input...?

Originally, the Regex contained a list of directories or paths that were part of the URLs, in which I wanted to skip over. But
when only a single group of the regex didn't match, I changed it for testing to try and ONLY match that one particular string and
yet, it still does NOT match. I'm guessing I am missing something pretty obvious, maybe my brain isn't working today....

The input file contains:
        *FYI, I trimmed down the Input file so there wasn't 100+ lines here...
fakesite.com/fake-url/fake_picture.jpg http://www.google-analytics.com/collect http://www.fakesite.com/files-png/10453229-7.png http://assets.pinterest.com/js/pinit.js http://assets.pinterest.com/js/pinit_main.js http://www.fakesite.com/files-png/10455009-1.png http://tacoda.at.atwola.com/atx/sync/addthis/addt/default? http://epiv.cardlytics.com/c/?pr=11018&prc=550 fakesite.com/images/fake_picture.jpg http://www.rating-system.com/webservice/RatingService.svc/GetReviews http://www.fakesite.com/files-png/1045997098.png http://safebrowsing-cache.google.com/safebrowsing/rd/ChFnb29nLXBo http://dnn506yrbagrg.cloudfront.net/pages/scripts/022/131.js?39616

And below is my code... The REGEX I am having trouble with WAS inside the Inner-most if statement. I will just copy/paste that REGEX
right below HERE since it made the code hard to read with the line continuation stuff. This one seemed to work correctly except
for this one portion of it. Also, I think I found a cleaner way to write my Original REGEX, could someone tell me if they are equivalent?

Original REGEX:
.*\/css\/.*|.*\/js\/.*|.*\/images\/.*|.*\/covers\/.*|.*\/pdf\/.*|.*\/jpg\/.*|.*\/MultiTrack\/.*|.*\/files-png\/.*|.*\/newissues\/.*|.*\/files\/.*|.*\/video\/.*|.*\/mp3\/.*|.*\/audio\/.*|.*\/invoices\/.*|.*\/MYFILES\/.*|.*\/newsclubs\/.*|.*\/scorch\/.*|.*\/images\/myfile\/.*|.*\/video\/.*|.*\/mp3\/.*|.*\/audio\/.*

Is this REGEX Equivalent to the Original One Above?:
.*\/(css|js|images|covers|pdf|jpg|MultiTrack|files-png|newissues|files|video|mp3|audio|invoices|MYFILES|newsclubs|scorch|images\/myfile|video|mp3|audio)\/.*

Quick exlanation of the code... The 1st If/REGEX in the while loop will make sure I'm only checking URL's that contain the correct
domain, whether they begin with "http://" and/or "www." or NOT. The original 2nd REGEX was supposed to exclude/ignore any URL's
that first match the domain, but then DO NOT include the list of PATHs in the If statement. For that REGEX, I needed to include the
forward-slashes "\/" at the start and end of the PATH names since those words could possibly appear as something other then a PATH,
like within a filename or something like that...

Here's the code:
#!/usr/bin/perl use strict; use warnings; my $input_file = "/home/User/Documents/fakeDir/test_urls.txt"; my @RESULTS; open(INPUT, "< $input_file") or die "Error: There was an error opening + the input file: $!\n\n"; my $x = 0; while (<INPUT>) { my $line = "$_"; chomp($line); ### INCLUDE STATEMENT: if ($line =~ /^(http:\/\/)?(www\.)?fakesite.com.*/g) { print "CHECKING --> '$line'\n"; ### EXCLUDE STATEMENT: # *Original Regex went here instead: # --> if ($line !~ /$ORINGAL-REGEX/g) if ($line =~ /files-png/g) { print "\t\tFOUND IT....\n\n"; $RESULTS[$x] = "$line"; $x++; } else { print "\t\tNOT FOUND....\n\n"; } } } close INPUT;


Here is my Output from the Code Above:
      *The 1st if REGEX seems to work just fine finding the domain but not the other...
CHECKING --> 'fakesite.com/fake-url/fake_picture.jpg' NOT FOUND.... CHECKING --> 'http://www.fakesite.com/files-png/10453229-7.png' NOT FOUND.... CHECKING --> 'http://www.fakesite.com/files-png/10455009-1.png' NOT FOUND.... CHECKING --> 'fakesite.com/images/fake_picture.jpg' NOT FOUND.... CHECKING --> 'http://www.fakesite.com/files-png/1045997098.png' NOT FOUND....

As you can see there should be 3 lines from the output above, given the data in the input file which should have matched. The one's I
think should have matched are:
1 --> http://www.fakesite.com/files-png/10453229-7.png
2 --> http://www.fakesite.com/files-png/10455009-1.png
3 --> http://www.fakesite.com/files-png/1045997098.png


Can anyone tell what I'm doing wrong with the regex, I figured something as simple as /files-png/g should easily match...
I am at a loss... If anyone has any thoughts or suggestions please feel free.

Thanks in Advance,
Matt


In reply to Why this Simple REGEX does Not Match? by mmartin

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.