regex problem

grasshopper!!! has asked for the wisdom of the Perl Monks concerning the following question:

I having trouble parsing links out of a web page source. The regexp picks up the beginning of one similar match and the second part of a similar match.I have used lookback's and lookforward's to no avail.The false match is the first match the rest are correct. Thanks in advance.

use strict;
use warnings;
use WWW::Curl::Easy;
my $curl = WWW::Curl::Easy->new;

$curl->setopt(CURLOPT_HEADER,1);
$curl->setopt(CURLOPT_URL, 'http://www.reddit.com/r/wallpapers.rss');
my $response_body;
$curl->setopt(CURLOPT_WRITEDATA,\$response_body);
# Starts the actual request
my $retcode = $curl->perform;
# Looking at the results...
if ($retcode == 0){
  print("Transfer went ok\n\n");
  my $response_code = $curl->getinfo(CURLINFO_HTTP_CODE);
  # judge result and next action based on $response_code
  open(F ,">/home/philip/Desktop/reg-out.txt");
  print F "$response_body\n\n";
  close(F);
 my @urls=$response_body =~ m{(http://b.thumbs.redditmedia.com/(?!png)
+(?<!png).+?(?!png)(?<!png)\.jpg)}gi;
  #print("Received response: $response_body\n\n");
  print scalar @urls ."\n\n";
  $" ="\n\n";
  
  print "@urls\n";

  #`feh --bg-seamless $urls[$number]`;
} else {
  # Error code, type of error, error message
  print("An error happened: $retcode ".$curl->strerror($retcode)." ".$
+curl->errbuf."\n");
}
[download]

Comment on regex problem Download Code

Replies are listed 'Best First'.
Re: regex problem by AnomalousMonk (Archbishop) on Nov 03, 2015 at 22:16 UTC
What stevieb said. Handing someone a bunch of code and saying "This doesn't work. Please figure out how it should work and fix it" is not likely to be productive unless you also hand over a bunch of money. That said, one note: In the regex `my @urls=$response_body =~ m{(http://b.thumbs.redditmedia.com/ ... }gi;` the sub-pattern `b.thumbs.redditmedia.com` has embedded `.` (dot) metacharacters that match anything (except a newline, unless the `/s` switch is asserted, which it isn't). Here's the effect: `c:\@Work\Perl\monks>perl -wMstrict -le "for my $str (qw(aXbXc a.b.c)) { printf qq{for '$str' }; print $str =~ m{ a.b.c }xms ? 'match' : 'NO match'; } " for 'aXbXc' match for 'a.b.c' match` [download] Now try meta-quoting in some way, e.g.: `c:\@Work\Perl\monks>perl -wMstrict -le "for my $str (qw(aXbXc a.b.c)) { printf qq{for '$str' }; print $str =~ m{ \Qa.b.c\E }xms ? 'match' : 'NO match'; } " for 'aXbXc' NO match for 'a.b.c' match` [download] Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: regex problem by grasshopper!!! (Beadle) on Nov 04, 2015 at 21:28 UTC
I was just wondering how to parse out the jpgs links from reddit wallpaper page following a commandline fu bash script,just for fun. The main problem is how to reject a match which matches beginning and end but by using .+? matches falsely in the middle.I know you can reject matching letters with [^ahi] type expression.But I dont know how to reject a string in the middle of a large amount of data. The following shows the problem match. http://b.thumbs.redditmedia.com/HUX1reWBCHSIQunAgKXYkb8nXEXY6cw0cTizkTcEw4U.png random html etc alot http://b.thumbs.redditmedia.com/bqYiA dIiTp01k7ca6UIjpWSJqOjHGeTv7JPwko4WrEQ.jpg Rejecting the png but matching random file names in a mess of data is what Im struggling to do. Thanks for any help.This is just for fun so no sweat.	[reply]
Re^3: regex problem solved by grasshopper!!! (Beadle) on Nov 05, 2015 at 00:18 UTC
Have solved problem but I dont understand why one lookahead works and another does not. use strict; use warnings; use WWW::Curl::Easy; my $curl = WWW::Curl::Easy->new; $curl->setopt(CURLOPT_HEADER,1); $curl->setopt(CURLOPT_URL, 'http://www.reddit.com/r/wallpapers.rss'); my $response_body; $curl->setopt(CURLOPT_WRITEDATA,\$response_body); # Starts the actual request my $retcode = $curl->perform; # Looking at the results... if ($retcode == 0){ print("Transfer went ok\n\n"); my $response_code = $curl->getinfo(CURLINFO_HTTP_CODE); my @urls=$response_body =~ m{(http://b.thumbs.redditmedia\.com/(?:( +?!png).)*?\.jpg)}gi; $" ="\n\n"; print "@urls\n"; } else { # Error code, type of error, error message print("An error happened: $retcode ".$curl->strerror($retcode)." ".$ +curl->errbuf."\n"); } [download] Thank you all that helped.	[reply] [d/l]
Re^4: regex problem solved by AnomalousMonk (Archbishop) on Nov 05, 2015 at 00:48 UTC
Re: regex problem by stevieb (Canon) on Nov 03, 2015 at 21:14 UTC
Hey grasshopper!!!, Could you please provide us with some sample links, and then some expected output (including both urls that should match, and a couple that should not)?	[reply]
Re: regex problem by ExReg (Priest) on Nov 03, 2015 at 22:30 UTC
It is unclear what you are looking for. Are you looking for .jpg files under http://b.thumbs.redditmedia.com/ that do not have png in their names?	[reply]
Re^2: regex problem by ExReg (Priest) on Nov 03, 2015 at 23:36 UTC
If indeed you are looking for .jpg files under http://b.thumbs.redditmedia.com/ that do not have png in their names, it might be easiest to just get rid of your lookaheads and lookbehinds `my @urls=$response_body =~ m{(http://b.thumbs.redditmedia.com/.+?\.jpg)}gi;` and then add another line to get rid of the lines in the just created @urls that have png in them. Writing regexes that must match one thing but cannot match another are prone to caveats or are not blazingly intuitive.	[reply] [d/l]

http://b.thumbs.redditmedia.com/HUX1reWBCHSIQunAgKXYkb8nXEXY6cw0cTizkTcEw4U.png

random html etc alot

http://b.thumbs.redditmedia.com/bqYiA dIiTp01k7ca6UIjpWSJqOjHGeTv7JPwko4WrEQ.jpg