in reply to Pattern matching and deriving the data between the "(double quotes) in HTML tag

G'day sp4rperl,

Welcome to the Monastery.

I see tybalt89 has provided a fix for your specific problem and Athanasius has provided an explanation of that fix along with some additional information.

As a general rule for matching between delimiters, consider simply finding the start delimiter and then matching everything which follows that isn't the end delimiter. So, your captures would look like ([^"]*). I find this:

Here's some quick examples showing same/different delimiter pairs matching some/no enclosed text:

$ perl -E 'my ($s, $e) = qw{" "}; q{a"b"c} =~ /$s([^$e]*)/; say "|$1|" +' |b| $ perl -E 'my ($s, $e) = qw{" "}; q{a""c} =~ /$s([^$e]*)/; say "|$1|"' || $ perl -E 'my ($s, $e) = qw{< >}; q{a<b>c} =~ /$s([^$e]*)/; say "|$1|" +' |b| $ perl -E 'my ($s, $e) = qw{< >}; q{a<>c} =~ /$s([^$e]*)/; say "|$1|"' ||

Here's a few more examples, with embedded newlines, showing:

  1. ([^"]*) capturing text as is.
  2. (.*?) capturing nothing as is.
  3. (.*?) capturing text when the 's' modifier is added.
$ perl -E 'my ($s, $e) = qw{" "}; qq{a"b\n"c} =~ /$s([^$e]*)/; say "|$ +1|"' |b | $ perl -E 'my ($s, $e) = qw{" "}; qq{a"b\n"c} =~ /$s(.*?)$e/; say "|$1 +|"' || $ perl -E 'my ($s, $e) = qw{" "}; qq{a"b\n"c} =~ /$s(.*?)$e/s; say "|$ +1|"' |b |

When dealing with data where the enclosed text may include an escaped delimiter (e.g. "abc\"xyz") neither the (.*?) nor the ([^"]*) will work (for that example, both will capture 'abc\'). In these cases, you'll need a somewhat more complex regular expression: see perlre: Quantifiers and search for 'the typical "match a double-quoted string" problem'. [Note: You won't have this issue with HTML.]

— Ken