regex not matching how I want it to :(

glwa has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: regex not matching how I want it to :( by Corion (Patriarch) on Oct 18, 2018 at 19:35 UTC
Are you sure that you are running the code that you posted? Because for me your code works and matches each `href=` attribute separately. Also see Regexp::Debugger for interactively stepping through a regular expression to see how it behaves.	[reply] [d/l]
Re^2: regex not matching how I want it to :( by glwa (Acolyte) on Oct 18, 2018 at 20:46 UTC
some try and fail and I got solution myself: `$line=~/<a href=\"([^<br>]*?)\.htm\">/ig;`	[reply] [d/l]
Re^3: regex not matching how I want it to :( by Laurent_R (Canon) on Oct 18, 2018 at 22:25 UTC
I don't see how the code in your OP did not work, it seems to me it should. And you're saying that your new latest code matches what you want (but I don't think it does that for the right reason, it is probably a happy coincidence -- more on this below). So, there isn't much more help to be provided, since your problem is solved. I would suggest however that you probably want to avoid quantifiers such as * or + with the match-all dot (i.e. the `.` and `.+` patterns) when possible, and even also `.?` and `.+?`, although these latter two are much less dangerous in terms of matching more than what you want. It is often safer and better to be more specific on the characters you want to match, using the appropriate character class. In the case in point, using a regex like `/<a href=\"(\w?)\.htm\">/ig` would probably be safer, because, with `\w+` or `\w`, you're guaranteed to match only alphanumeric characters (and possible underscores), so you know for sure that you're not gonna match any tag-opening and tag-closing characters (angle brackets), backslashes, quote marks, etc. As for your latest code you posted, `[^<br>]*?` doesn't do what you probably think it does, even though you report that the result happens to be what you want. `[^<br>]` is a negative character class that matches everything except the following individual characters: `< > b r`. I doubt somewhat that what you really meant it to be. The fact that it contains the `<` character (and will therefore stop matching at the first tag-opening character) is probably enough to save your day in this specific case, but be aware that it won't work if the string that you want to retrieve contains any "b" or any "r." HTH.	[reply] [d/l] [select]
Re^2: regex not matching how I want it to :( by glwa (Acolyte) on Oct 18, 2018 at 19:47 UTC
I spent so many hours on this single issue and so many hours total with the perl today that no, I am not sure I am running the code I posted anymore. I have a string with multiple "a href" and I want to replace each one of them with different urls. I have a feeling like the while loop is causing some problem here	[reply]
Re^3: regex not matching how I want it to :( by Corion (Patriarch) on Oct 18, 2018 at 19:56 UTC
Maybe the following program helps you debug the regex: `#!perl -w use strict; # use Regexp::Debugger; for my $line (<DATA>) { while ( $line=~/<a href=\"(.*?)\.htm\">/ig ) { print "$1\n"; }; } __DATA__ <a href="test1.htm"> test1</a><br> <a href="test2.htm"> test2</a><br>< +a href="test3.htm"> test3</a><br>` [download] For me this outputs: `test1 test2 test3` [download]	[reply] [d/l] [select]
Re^3: regex not matching how I want it to :( by glwa (Acolyte) on Oct 18, 2018 at 20:10 UTC
okay, once again sorry I am really tired, I stripped my code with all unnecessary stuff and this is what I come up with, the result is very strange for me `$line='<p><a href="test1.htm"> test1</a><br> <a href="test2.htm"> test +2</a><br> <a href="test3.htm"> test3</a><br> <a href="test4.htm"> tes +t4</a><br>'; while ( $line=~/<a href=\"(.*?)\.htm\">/ig ) { $tmp=$1; print "LINE: $line\n"; print "TMP: $tmp KK\n\n"; $line=~s/<a href=\"$tmp.htm\">/<a href=\"\/xxx.html\">/i; }` [download]	[reply] [d/l]
Re^4: regex not matching how I want it to :( by glwa (Acolyte) on Oct 18, 2018 at 20:25 UTC
Re^5: regex not matching how I want it to :( by glwa (Acolyte) on Oct 18, 2018 at 20:28 UTC
Some notes below your chosen depth have not been shown here
Re: regex not matching how I want it to :( by haukex (Archbishop) on Oct 19, 2018 at 08:00 UTC
I would very much recommend not parsing HTML with regular expressions - see Parsing HTML/XML with Regular Expressions. For example, see HTML::LinkExtor: `use HTML::LinkExtor; my $p = HTML::LinkExtor->new(); $p->parse_file("test.html"); my @hrefs = map { {@$_[1..$#$_]}->{href} } $p->links;` [download] On your sample data, `@hrefs` is `("test1.htm", "test2.htm", "test3.htm")`.	[reply] [d/l] [select]
Re: regex not matching how I want it to :( by glwa (Acolyte) on Oct 18, 2018 at 19:30 UTC
OMG I do not know what happened with my post formatting, sorry `while ( $line=~/<a href=\"(.*?)\.htm\">/ig ) {` on `<a href="test1.htm"> test1</a><br> <a href="test2.htm"> test2</a><br> <a href="test3.htm"> test3</a><br>` is matching `test1"> test1</a><br> <a href="test2` and not test2 ok someone corrected my formatting in the original post, thank you	[reply] [d/l] [select]