Need help using regex to extract multiple matches

SergioQ has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Need help using regex to extract multiple matches by Your Mother (Archbishop) on Nov 26, 2019 at 06:19 UTC
This is probably easy. If you show some work, you might get better answers. This might do the trick, run on a file containing the web page from the command line. If you’re on WIN, you might need to reverse/escape the quotes. `perl -0nE 'say join "\n", /data-src-hq="([^"]+)/g' {filename}` [download] That is not robust and comes with caveats but it’s the kind of thing that might fit your problem. Assuming you described your problem properly and your data is tame. :P It’s also overly terse for a good example. If you provide a full skeleton of what you’re already doing with sample data, you’ll likely get a better solution with less idiomatic Perl involved.	[reply] [d/l]
Re: Need help using regex to extract multiple matches by AnomalousMonk (Archbishop) on Nov 26, 2019 at 07:14 UTC
And the obligatory variation using the `\K` operator (see Lookaround Assertions in Extended Patterns in perlre) introduced in Perl version 5.10: `c:\@Work\Perl\monks>perl use 5.010; # needs perl version 5.10+ for \K operator use strict; use warnings; my $string = '... data-src-hq="qwe" ... ' . '... data-src-hq="asd" ... ' . '... data-src-hq="zxc" ...'; my @matches = $string =~ m{ data-src-hq=" \K [^"]+ (?= ") }xmsg; print map "'$_' ", @matches; __END__ 'qwe' 'asd' 'zxc'` [download] Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: Need help using regex to extract multiple matches by GrandFather (Saint) on Nov 26, 2019 at 06:22 UTC
Something like: `use strict; use warnings; my $webStr = <<EOS; data-src-hq="image location1" <p>other stuff</p> <p>data-src-hq="image location2"</p> EOS my @matches = $webStr =~ /data-src-hq="([^"]+)"/g; print join "\n", @matches, '';` [download] Prints: `image location1 image location2` [download] Note the /g switch on the regex. Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond	[reply] [d/l] [select]
Re^2: Need help using regex to extract multiple matches by SergioQ (Scribe) on Dec 02, 2019 at 03:36 UTC
Thank you for the solution, may I ask a little more? I've read up on regex, but sometimes I think I have a mild reading disorder. Could you break down for me how the solution worked: Here's how I believe I read your solution: `img src="` is the first part to match `[^"]`match everything but a quote `+"` stop when you hit a quote `()` return only what matches within the brackets Am also curious what's the difference between +" and +?" since both seem to work Thank you again SergioQ	[reply] [d/l] [select]
Re^3: Need help using regex to extract multiple matches by AnomalousMonk (Archbishop) on Dec 02, 2019 at 07:54 UTC
I'm assuming you're referring to the regex in GrandFather's reply: `/data-src-hq="([^"]+)"/g` First, let me draw your attention to YAPE::Regex::Explain, which can explain regexes that do not have regex operators or features added after Perl verion 5.6: c:\@Work\Perl\monks>perl use strict; use warnings; use YAPE::Regex::Explain; print YAPE::Regex::Explain->new(qr/data-src-hq="([^"]+)"/)->explain; __END__ The regular expression: (?-imsx:data-src-hq="([^"]+)") matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- data-src-hq=" 'data-src-hq="' ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- [^"]+ any character except: '"' (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- " '"' ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- [download] There are also on-line regex explainers. Now let me address your narration. `img src="` is the first part to match Ok. `[^"]` match everything but a quote I would word this as match a single* character from the class of all characters except a `"` (double-quote).* It's important to realize that the `[...]` regex operator defines a character class or set (see Character Classes and other Special Escapes in perlre and also this topic in perlretut, perlrequick and perlrecharclass), and that all by itself, any `[...]` matches only a single character. `+"` stop when you hit a quote I would quarrel with this description. The `+` quantifier (see Quantifiers in perlre; see also the topic of quantifiers in perlretut and perlrequick) is associated with the expression before it, i.e., `[^"]+` and I would read it as match one or more* characters from the class/set of all characters except a double-quote.* Again, the double-quote is not directly associated with the `+` quantifier in your `+"` — but see below because they ~~are~~ \| can be related. `()` return only what matches within the brackets Ok. Am also curious what's the difference between +" and +?" since both seem to work Again, note that the `+` or `+?` quantifiers affect the preceding `[^"]` character class, not the double-quote that follows. In the `/data-src-hq="([^"]+)"/g` match regex, the final `"` (double-quote) is not absolutely needed because `[^"]+` will match as much as possible until it either hits a `"` or the end of the string. (I would still tend to use it because I like the feeling of security that well-defined boundaries give me. Also, a final `"` in the match will prevent a match with a "runaway" quote in a string in which the closing `"` is missing.) However, if you use a `[^"]+?` "lazy" or "non-greedy" expression instead, the final `"` becomes vital to matching the entire contents of the double-quoted substring. Try this: `c:\@Work\Perl\monks>perl use strict; use warnings; my $s = 'foo "xyzzy" bar'; print qq{+? (lazy) quantifier with final ": matched '$1' \n} if $s +=~ /"([^"]+?)"/; print qq{+? (lazy) quantifier without final ": matched '$1' \n} if $s +=~ /"([^"]+?)/; __END__ +? (lazy) quantifier with final ": matched 'xyzzy' +? (lazy) quantifier without final ": matched 'x'` [download] A lazy quantifier matches the minimum necessary for an overall match. A final `"` in the regex is necessary in this case to capture the entire quoted substring. Take a look at this and be sure you understand what's going on, i.e., the difference between lazy and greedy matching. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: Need help using regex to extract multiple matches by kcott (Archbishop) on Nov 26, 2019 at 06:26 UTC
G'day SergioQ, You capture multiple matches with a regex by using the '`g`' modifier. Example code: `#!/usr/bin/env perl use strict; use warnings; my $string = '... data-src-hq="qwe" ... ' . '... data-src-hq="asd" ... ' . '... data-src-hq="zxc" ...'; my @matches = $string =~ /data-src-hq="([^"]+)"/g; print "@matches\n";` [download] Output: `qwe asd zxc` [download] — Ken	[reply] [d/l] [select]