This is probably easy. If you show some work, you might get better answers. This *might* do the trick, run on a file containing the web page from the command line. If you’re on WIN, you might need to reverse/escape the quotes.
perl -0nE 'say join "\n", /data-src-hq="([^"]+)/g' {filename}
That is not robust and comes with caveats but it’s the kind of thing that might fit your problem. Assuming you described your problem properly and your data is tame. :P It’s also overly terse for a good example. If you provide a full skeleton of what you’re already doing with sample data, you’ll likely get a better solution with less idiomatic Perl involved.
| [reply] [d/l] |
c:\@Work\Perl\monks>perl
use 5.010; # needs perl version 5.10+ for \K operator
use strict;
use warnings;
my $string = '... data-src-hq="qwe" ... '
. '... data-src-hq="asd" ... '
. '... data-src-hq="zxc" ...';
my @matches = $string =~ m{ data-src-hq=" \K [^"]+ (?= ") }xmsg;
print map "'$_' ", @matches;
__END__
'qwe' 'asd' 'zxc'
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
use strict;
use warnings;
my $webStr = <<EOS;
data-src-hq="image location1"
<p>other stuff</p>
<p>data-src-hq="image location2"</p>
EOS
my @matches = $webStr =~ /data-src-hq="([^"]+)"/g;
print join "\n", @matches, '';
Prints:
image location1
image location2
Note the /g switch on the regex.
Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
| [reply] [d/l] [select] |
Thank you for the solution, may I ask a little more? I've read up on regex, but sometimes I think I have a mild reading disorder. Could you break down for me how the solution worked:
Here's how I believe I read your solution:
img src=" is the first part to match
[^"]match everything but a quote
+" stop when you hit a quote
() return only what matches within the brackets
Am also curious what's the difference between +" and +?" since both seem to work
Thank you again
SergioQ
| [reply] [d/l] [select] |
I'm assuming you're referring to the regex in GrandFather's reply:
/data-src-hq="([^"]+)"/g
First, let me draw your attention to YAPE::Regex::Explain, which can explain regexes that do not have regex operators or features added after Perl verion 5.6:
c:\@Work\Perl\monks>perl
use strict;
use warnings;
use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new(qr/data-src-hq="([^"]+)"/)->explain;
__END__
The regular expression:
(?-imsx:data-src-hq="([^"]+)")
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
data-src-hq=" 'data-src-hq="'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^"]+ any character except: '"' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
There are also on-line regex explainers.
Now let me address your narration.
img src=" is the first part to match
Ok.
[^"] match everything but a quote
I would word this as match a single character from the class of all characters except a " (double-quote). It's important to realize that the [...] regex operator defines a character class or set (see Character Classes and other Special Escapes in perlre and also this topic in perlretut, perlrequick and perlrecharclass), and that all by itself, any [...] matches only a single character.
+" stop when you hit a quote
I would quarrel with this description. The + quantifier (see Quantifiers in perlre; see also the topic of quantifiers in perlretut and perlrequick) is associated with the expression before it, i.e., [^"]+ and I would read it as match one or more characters from the class/set of all characters except a double-quote. Again, the double-quote is not directly associated with the + quantifier in your +" — but see below because they are | can be related.
() return only what matches within the brackets
Ok.
Am also curious what's the difference between +" and +?" since both seem to work
Again, note that the + or +? quantifiers affect the preceding [^"] character class, not the double-quote that follows. In the /data-src-hq="([^"]+)"/g match regex, the final " (double-quote) is not absolutely needed because [^"]+ will match as much as possible until it either hits a " or the end of the string. (I would still tend to use it because I like the feeling of security that well-defined boundaries give me. Also, a final " in the match will prevent a match with a "runaway" quote in a string in which the closing " is missing.) However, if you use a [^"]+? "lazy" or "non-greedy" expression instead, the final " becomes vital to matching the entire contents of the double-quoted substring. Try this:
c:\@Work\Perl\monks>perl
use strict;
use warnings;
my $s = 'foo "xyzzy" bar';
print qq{+? (lazy) quantifier with final ": matched '$1' \n} if $s
+=~ /"([^"]+?)"/;
print qq{+? (lazy) quantifier without final ": matched '$1' \n} if $s
+=~ /"([^"]+?)/;
__END__
+? (lazy) quantifier with final ": matched 'xyzzy'
+? (lazy) quantifier without final ": matched 'x'
A lazy quantifier matches the minimum necessary for an overall match. A final " in the regex is necessary in this case to capture the entire quoted substring. Take a look at this and be sure you understand what's going on, i.e., the difference between lazy and greedy matching.
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
#!/usr/bin/env perl
use strict;
use warnings;
my $string = '... data-src-hq="qwe" ... '
. '... data-src-hq="asd" ... '
. '... data-src-hq="zxc" ...';
my @matches = $string =~ /data-src-hq="([^"]+)"/g;
print "@matches\n";
Output:
qwe asd zxc
| [reply] [d/l] [select] |