Re: Use global flag when declaring regular expressions with qr?
by ikegami (Patriarch) on Oct 28, 2024 at 14:59 UTC
|
if ( $row =~ $re )
with
while ( $row =~ /$re/g )
Certain flags control how a pattern is parsed or compiled. Since these are associated with the pattern, they can be provided to qr// and (?:).
These are: msixpodualn.
(It's surprising to find "p" on this list. It's weird. But "p" doesn't do anything since 5.20, so we can pretend it's not there.)
Certain flag affects how the operator itself works. These are provided to the operator itself.
For m//, these are: gc.
For s///, these are: gcer.
msixpodualn can also be provided to these, but they won't be recursively applied to embedded compiled patterns. (The same goes for qr//.)
Therefore, you want to add "g" to the match operator implied by $row =~ $re, and it would result in using $row =~ /$re/g.
Note that /$re/ doesn't cause the pattern to be stringified and re-compiled as it would be if it was interpolated into a larger pattern.
| [reply] [d/l] [select] |
Re: Use global flag when declaring regular expressions with qr?
by ysth (Canon) on Oct 28, 2024 at 14:53 UTC
|
For the flags like /g that affect the operator, not the regular expression, you need to just interpolate the re:
if ($row =~ /$re/gc) {
(Internally, I believe this is able to just use the regular expression without having to recompile it.)
You want the /c flag to not reset the match position on failure.
--
A math joke: r = | |csc(θ)|+|sec(θ)| |-| |csc(θ)|-|sec(θ)| |
| [reply] [d/l] |
|
|
| [reply] [d/l] [select] |
|
|
| [reply] |
Re: Use global flag when declaring regular expressions with qr?
by ikegami (Patriarch) on Oct 28, 2024 at 15:47 UTC
|
Unrelated tips:
- I hope you're using use strict; use warnings; or equivalent, but simply left it out for brevity.
- It's good that you included the file name in the error message, but it would be nicer if you included the error cause ($!) too.
- next if $row eq ""; and chomp $row; don't actually achieve anything here.
- my $qfn = $ARGV[0]; would be less weird than foreach ($ARGV[0]). But why not go for the flexibility of for my $qfn ( @ARGV )? At which point, the loop and the file handling stuff can be replaced with while ( <> ).
- Adding use v5.10; or higher would allow you to use the nicer say $1; instead of print "$1\n";.
- You could use $& instead of $1 and the capturing parens. (This makes no difference.)
use v5.10;
use strict;
use warnings;
my $re = ...;
while ( <> ) {
say $1 while /$re/g;
}
| [reply] [d/l] [select] |
|
|
I hope you're using use strict; use warnings; or equivalent, but simply left it out for brevity.
Yes, of course, that's one of the very first things that I learned! The rest of the code is just a bunch of regular expressions. Although I'm not using use v5.10, didn't know it was necessary. I would've guessed that the code runs at the same version as the interpreter does.
Thank you for all these tips. With the main issue out of the way, all that's left is refactor into a nice looking script with more options. On that note, how would I implement it such that I could either provide any number of files, or read from standard input? This is what I came up with (should I open a new question for this?):
sub match_url {
my $row = shift @_;
while ($row =~ /$re/g) {
say "$&";
}
}
if ( !@ARGV || $ARGV[0] eq "-") {
@ARGV = <STDIN>;
chomp(@ARGV);
match_url(@ARGV);
} else {
# Do I need to handle the error case with this syntax?
while (<>) {
chomp $_;
match_url($_);
}
}
Again, thank you all for your input! | [reply] [d/l] [select] |
|
|
| [reply] [d/l] |
|
|
|
|
sub match_url {
my $row = shift; # No @_ required, that is the default
my @matches = ($row =~ m/$re/g) or return;
# do something with @matches
}
if (!@ARGV or $ARGV[0] eq "-") {
chomp (@ARGV = <STDIN>); # chomp can be done in one go
match_url ($_) for @ARGV; # Your match_url takes one arg, not a li
+st
}
else {
# Do I need to handle the error case with this syntax?
while (<>) {
chomp $_; # Is chomp really needed? match_url probably ignores
+ trailing EOL in $re
match_url ($_);
}
}
for vs while
($str =~ m/$re/g) (note the parens) returns all matches in one list
use 5.012003;
use warnings;
my $str = "abacadabra";
my $rex = qr{(a\w)};
say "-- say";
say for ($str =~ m/$rex/g);
say "-- while";
while ($str =~ m/$rex/g) {
say $1;
}
->
-- say
ab
ac
ad
ab
-- while
ab
ac
ad
ab
Enjoy, Have FUN! H.Merijn
| [reply] [d/l] [select] |
|
|
|
|
|
|
|
|
|
|
while ( <> ) already defaults to STDIN.
$ t() { perl -Mv5.10 -e'while (<>) { chomp; say "<$_>"; }' "$@"; }
$ t \
<( echo 'Contents of virtual file 1' ) \
<( echo 'Contents of virtual file 2' )
<Contents of virtual file 1>
<Contents of virtual file 2>
$ echo 'Contents of virtual file' | t
<Contents of virtual file>
| [reply] [d/l] [select] |
|
|
Re: Use global flag when declaring regular expressions with qr?
by unmatched (Sexton) on Oct 28, 2024 at 15:10 UTC
|
Oh, I see. Thank you both, that seems to do the trick just fine :)
This makes me curious about something I read regarding the qr operator, and that is that it's supposed to compile the pattern provided as it's being saved into the variable. But, wouldn't this approach of enclosing it again with /$re/g, as in this case, defeat the purpose of it? Are there any penalties in terms of performance? Not that I actually care much about it right now, I'm just curious.
Thanks again!
PS: Do I need to mark this question as solved somewhere?
| [reply] [d/l] [select] |
|
|
wouldn't this approach of enclosing it again with /$re/g, as in this case, defeat the purpose of it?
No. $x =~ $y is already short for $x =~ m/$y/, so it doesn't add a penalty. And as per the last line of my answer, it's not stringified and recompiled when the entire pattern is a compiled regex, so the purpose isn't defeated.
As for marking as "solved", you just did :)
| [reply] [d/l] [select] |
|
|
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my $re = qr/^ *(?:say|print)/;
my $s = qq/(?^:^ *(?:say|print))/;
open my $in, '-|', qw{ perldoc perlfunc } or die $!;
my @lines = <$in>;
push @lines, @lines for 1 .. 5;
say scalar @lines;
use Benchmark qw{ cmpthese };
my ($ir, $is);
cmpthese(-3, {
regex => sub { $ir = 0; /$re/ and ++$ir for @lines },
string => sub { $is = 0; /$s/ and ++$is for @lines },
});
$ir == $is or die;
__END__
259616
Rate regex string
regex 13.5/s -- -23%
string 17.6/s 31% --
Is it because of dereferencing of the regex object?
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
|
|
Awesome, thanks very much. I'll be sure to read up on that!
| [reply] |
Re: Use global flag when declaring regular expressions with qr?
by Discipulus (Canon) on Oct 29, 2024 at 11:25 UTC
|
Hello and welcome to the monastery unmatched,
as I told you in the cb, better the question better answers.. and it happened :)
as was not mentioned you can also do:
# $opts = 'g'; DONT WORK
# $rex = eval qq(qr/here/$opts);
See also: /o is dead, long live qr//! and Use global flag when declaring regular expressions with qr?
L*
UPDATED because it doesnt work with g ..sorry for the noise
There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
| [reply] [d/l] [select] |
|
|
Unknown regexp modifier "/g" at (eval 1) line 1, at end of line
As mentioned elsewhere, g is one of the flags that is a property of the match/substitution operator, not of the regex, and can't be used with qr.
--
A math joke: r = | |csc(θ)|+|sec(θ)| |-| |csc(θ)|-|sec(θ)| |
| [reply] [d/l] |
|
|
| [reply] [d/l] |
|
|
Thank you, Discipulus! You were absolutely right, I tried to make a best question as I could to not be too lengthy but also avoid the back and forth of asking for more details, etc. And I have to say that I'm very grateful for the many replies and useful tips!
| [reply] |
Re: Use global flag when declaring regular expressions with qr?
by unmatched (Sexton) on Oct 30, 2024 at 17:28 UTC
|
Thank you all for your help!
I wanted to share the final script (at least at its current state) and get some feedback from you if at all possible. You can probably tell that I'm more used to writing Bash scripts, I'm sure I'm breaking a few Perl's style guidelines but here it goes. The goal
#!/bin/perl
use v5.36;
use strict;
use warnings;
use Getopt::Long qw/GetOptions/;
my %files_with_matches = ();
my %options = (
'line_numbers' => 0,
'files_with_matches' => 0,
'print_filename' => -1,
);
sub parse_args {
GetOptions(
'line-numbers' => \$options{line_numbers},
'files-with-matches' => \$options{files_with_matches},
'print-filename!' => \$options{print_filename},
);
if (!@ARGV or $ARGV[0] eq "-") {
$options{line_numbers} = 0;
$options{files_with_matches} = 0;
$options{print_filename} = 0;
} else {
my $total_files = scalar @ARGV;
# -1 indicates this option was not modified. Fallback to
# default values depending on how many files are provided.
if ($options{print_filename} == -1) {
$options{print_filename} = $total_files == 1 ? 0 : 1;
}
}
}
# see: https://www.rfc-editor.org/rfc/rfc3986#appendix-A
my $hexdig = qr/[a-f0-9]/i;
my $pct_encoded = qr/%${hexdig}{2}/;
my $unreserved = qr/[a-z0-9\-\._~]/i;
my $sub_delims = qr/[!\$&'\(\)\*\+,;=]/;
my $pchar = qr/(${unreserved}|${pct_encoded}|${sub_delims}|:|\@)/n;
my $dec_oct = qr((\d)|(\d\d)|(1\d\d)|(2[0-4]\d)|(25[0-5]))n;
my $ipv4 = qr(${dec_oct}\.${dec_oct}\.${dec_oct}\.${dec_oct});
my $h16 = qr(${hexdig}{1,4});
my $ls32 = qr((${h16} : ${h16}) | ${ipv4})xn;
my $ipv6 = qr(
( (${h16} :){6} ${ls32})
| ( :: (${h16} :){5} ${ls32})
| (( ${h16})? :: (${h16} :){4} ${ls32})
| (((${h16} :){0,1} ${h16})? :: (${h16} :){3} ${ls32})
| (((${h16} :){0,2} ${h16})? :: (${h16} :){2} ${ls32})
| (((${h16} :){0,3} ${h16})? :: (${h16} :){1} ${ls32})
| (((${h16} :){0,4} ${h16})? :: ${ls32})
| (((${h16} :){0,5} ${h16})? :: ${h16} )
| (((${h16} :){0,6} ${h16})? :: )
)xn;
my $ipvf = qr/v${hexdig}{1,}\.(${unreserved}|${sub_delims}|:){1,}/xn;
my $ip_literal = qr{\[(${ipv6}|${ipvf})\]}n;
my $re = qr(
\b(
([a-z][a-z0-9+-.]*)
://
(
(
(${unreserved}|${pct_encoded}|${sub_delims}|:)*@
)?
(
${ip_literal}
| ${ipv4}
| (${unreserved}|${pct_encoded}|${sub_delims})*
)
(:\d+)?
)
(
(\/ ( ${pchar}+(\/ ${pchar}*)*)?)
| (\/ ${pchar}*)*
| (${pchar}+(\/ ${pchar}*)*)
| (${pchar}{0})
)
(\?(${pchar}|\/|\?)*)?
(\#(${pchar}|\/|\?)*)?
)\b
)xin;
sub match_url {
my $row = shift;
my $file = shift;
while ($row =~ /$re/g) {
if ($options{files_with_matches}) {
if (! $files_with_matches{$file}) {
$files_with_matches{$file} = 1;
say $file;
}
return;
}
print "$file " if $options{print_filename};
print "$. " if $options{line_numbers};
say "$&";
}
}
sub main {
parse_args();
if (!@ARGV or $ARGV[0] eq "-") {
chomp(@ARGV = <STDIN>);
match_url($_) for @ARGV;
return;
}
# I went back to this form instead of while(<>) as I wanted to
# provide the filename to the match_url subroutine. This was the
# only way I knew how to do that.
foreach (@ARGV) {
open my $fh, '<', $_ or die("Error opening file $_ ($!)\n");
while (my $row = <$fh>) {
chomp $row;
match_url($row, $_);
}
close($fh);
}
}
main();
EDIT:
Fixed the issue, script works now perfectly (as far as I can tell anyway :D) | [reply] [d/l] |