unmatched has asked for the wisdom of the Perl Monks concerning the following question:

Hello!

I'm just learning Perl and as an exercise I'm writing a small script that reads a file and outputs every URL it can find. I know there are modules for this, this is just for learning purposes, and to brush up on regular expressions. I came up with the following regular expression:

my $re = qr( ( (?:[a-z][a-z0-9+-.]*) :// (?: (?: [a-z0-9._~!:;=&'\$\(\)\*\+\-\,]+@ )? (?: \[${ipv6}\] | ${ipv4} | [a-z0-9._~!;=&'\$\(\)\*\+\-\,]+ ) ) ) )xi; foreach ($ARGV[0]) { open my $fh, '<', $_ or die("Error opening file $_.\n"); while (my $row = <$fh>) { chomp $row; next if $row eq ""; if ($row =~ $re) { print "$1\n"; } } close($fh); }

As you can see, I'm using qr to define the regular expression, as it's composed of other regular expressions defined in the code (omitted here for brevity). This gives me the most flexibility to later on refactor this script to make it more general purpose, or at least that is the idea.

The file is read line by line, comparing against $re, and correctly printing the first URL it finds on that line. And that's the issue, it only finds the first match even when there are multiple URLs on that line. Typically, this is where I'd use the global flag, except that apparently I cannot use it with qr as I get an error: Unknown regexp modifier "/g".

I've been reading about this but haven't been able to figure out a way to search the entire line to capture all matches. I tried using the s flag, different delimiters for qr, in case that made any difference, and of course tried modifying $re to use operators like + and *, but without any results.

So, I don't know if I'm misunderstanding the problem that I need to solve, or I just don't know enough about Perl to use it effectively. I would say the issue is that declaring regular expressions with qr is not what I need for this particular case but I'm just not sure. Any ideas? Thank you!

Replies are listed 'Best First'.
Re: Use global flag when declaring regular expressions with qr?
by ikegami (Patriarch) on Oct 28, 2024 at 14:59 UTC

    You would replace

    if ( $row =~ $re )
    with
    while ( $row =~ /$re/g )

    Certain flags control how a pattern is parsed or compiled. Since these are associated with the pattern, they can be provided to qr// and (?:).

    These are: msixpodualn.

    (It's surprising to find "p" on this list. It's weird. But "p" doesn't do anything since 5.20, so we can pretend it's not there.)


    Certain flag affects how the operator itself works. These are provided to the operator itself.

    For m//, these are: gc.

    For s///, these are: gcer.

    msixpodualn can also be provided to these, but they won't be recursively applied to embedded compiled patterns. (The same goes for qr//.)


    Therefore, you want to add "g" to the match operator implied by $row =~ $re, and it would result in using $row =~ /$re/g.

    Note that /$re/ doesn't cause the pattern to be stringified and re-compiled as it would be if it was interpolated into a larger pattern.

Re: Use global flag when declaring regular expressions with qr?
by ysth (Canon) on Oct 28, 2024 at 14:53 UTC
    For the flags like /g that affect the operator, not the regular expression, you need to just interpolate the re:
    if ($row =~ /$re/gc) {
    (Internally, I believe this is able to just use the regular expression without having to recompile it.)

    You want the /c flag to not reset the match position on failure.

    --
    A math joke: r = | |csc(θ)|+|sec(θ)| |-| |csc(θ)|-|sec(θ)| |

      Good job catching that if ( $row =~ /$re/g ) doesn't make sense. But the solution to finding all matches is replacing the if with a while, not adding "c".

        Oh, right. Not sure what I was thinking.
        --
        A math joke: r = | |csc(θ)|+|sec(θ)| |-| |csc(θ)|-|sec(θ)| |
Re: Use global flag when declaring regular expressions with qr?
by ikegami (Patriarch) on Oct 28, 2024 at 15:47 UTC

    Unrelated tips:

    • I hope you're using use strict; use warnings; or equivalent, but simply left it out for brevity.
    • It's good that you included the file name in the error message, but it would be nicer if you included the error cause ($!) too.
    • next if $row eq ""; and chomp $row; don't actually achieve anything here.
    • my $qfn = $ARGV[0]; would be less weird than foreach ($ARGV[0]). But why not go for the flexibility of for my $qfn ( @ARGV )? At which point, the loop and the file handling stuff can be replaced with while ( <> ).
    • Adding use v5.10; or higher would allow you to use the nicer say $1; instead of print "$1\n";.
    • You could use $& instead of $1 and the capturing parens. (This makes no difference.)
    use v5.10; use strict; use warnings; my $re = ...; while ( <> ) { say $1 while /$re/g; }
      I hope you're using use strict; use warnings; or equivalent, but simply left it out for brevity.

      Yes, of course, that's one of the very first things that I learned! The rest of the code is just a bunch of regular expressions. Although I'm not using use v5.10, didn't know it was necessary. I would've guessed that the code runs at the same version as the interpreter does.

      Thank you for all these tips. With the main issue out of the way, all that's left is refactor into a nice looking script with more options. On that note, how would I implement it such that I could either provide any number of files, or read from standard input? This is what I came up with (should I open a new question for this?):

      sub match_url { my $row = shift @_; while ($row =~ /$re/g) { say "$&"; } } if ( !@ARGV || $ARGV[0] eq "-") { @ARGV = <STDIN>; chomp(@ARGV); match_url(@ARGV); } else { # Do I need to handle the error case with this syntax? while (<>) { chomp $_; match_url($_); } }

      Again, thank you all for your input!

        The use VERSION says that the perl running the code must be at least that version and enables the bundle of features recommended for that version. For 5.10, this is bareword_filehandles, indirect, multidimensional, say, state, and switch. See feature for details.
        --
        A math joke: r = | |csc(θ)|+|sec(θ)| |-| |csc(θ)|-|sec(θ)| |

        Some operations can be shortened even further. It is up to the reader to decide if that increases or decreases maintainability :)

        sub match_url { my $row = shift; # No @_ required, that is the default my @matches = ($row =~ m/$re/g) or return; # do something with @matches } if (!@ARGV or $ARGV[0] eq "-") { chomp (@ARGV = <STDIN>); # chomp can be done in one go match_url ($_) for @ARGV; # Your match_url takes one arg, not a li +st } else { # Do I need to handle the error case with this syntax? while (<>) { chomp $_; # Is chomp really needed? match_url probably ignores + trailing EOL in $re match_url ($_); } }

        for vs while

        ($str =~ m/$re/g) (note the parens) returns all matches in one list

        use 5.012003; use warnings; my $str = "abacadabra"; my $rex = qr{(a\w)}; say "-- say"; say for ($str =~ m/$rex/g); say "-- while"; while ($str =~ m/$rex/g) { say $1; } -> -- say ab ac ad ab -- while ab ac ad ab

        Enjoy, Have FUN! H.Merijn
        while ( <> ) already defaults to STDIN.
        $ t() { perl -Mv5.10 -e'while (<>) { chomp; say "<$_>"; }' "$@"; } $ t \ <( echo 'Contents of virtual file 1' ) \ <( echo 'Contents of virtual file 2' ) <Contents of virtual file 1> <Contents of virtual file 2> $ echo 'Contents of virtual file' | t <Contents of virtual file>
Re: Use global flag when declaring regular expressions with qr?
by unmatched (Sexton) on Oct 28, 2024 at 15:10 UTC

    Oh, I see. Thank you both, that seems to do the trick just fine :)

    This makes me curious about something I read regarding the qr operator, and that is that it's supposed to compile the pattern provided as it's being saved into the variable. But, wouldn't this approach of enclosing it again with /$re/g, as in this case, defeat the purpose of it? Are there any penalties in terms of performance? Not that I actually care much about it right now, I'm just curious.

    Thanks again!

    PS: Do I need to mark this question as solved somewhere?

      wouldn't this approach of enclosing it again with /$re/g, as in this case, defeat the purpose of it?

      No. $x =~ $y is already short for $x =~ m/$y/, so it doesn't add a penalty. And as per the last line of my answer, it's not stringified and recompiled when the entire pattern is a compiled regex, so the purpose isn't defeated.

      As for marking as "solved", you just did :)

        > it's not stringified and recompiled

        Interestingly, using a string is still faster than using a compiled regex (even when I wrap the string in (?:^...) like the compilation does).

        #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; my $re = qr/^ *(?:say|print)/; my $s = qq/(?^:^ *(?:say|print))/; open my $in, '-|', qw{ perldoc perlfunc } or die $!; my @lines = <$in>; push @lines, @lines for 1 .. 5; say scalar @lines; use Benchmark qw{ cmpthese }; my ($ir, $is); cmpthese(-3, { regex => sub { $ir = 0; /$re/ and ++$ir for @lines }, string => sub { $is = 0; /$s/ and ++$is for @lines }, }); $ir == $is or die; __END__ 259616 Rate regex string regex 13.5/s -- -23% string 17.6/s 31% --

        Is it because of dereferencing of the regex object?

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
        Awesome, thanks very much. I'll be sure to read up on that!
Re: Use global flag when declaring regular expressions with qr?
by Discipulus (Canon) on Oct 29, 2024 at 11:25 UTC
    Hello and welcome to the monastery unmatched,

    as I told you in the cb, better the question better answers.. and it happened :)

    as was not mentioned you can also do:

    # $opts = 'g'; DONT WORK # $rex = eval qq(qr/here/$opts);

    See also: /o is dead, long live qr//! and Use global flag when declaring regular expressions with qr?

    L*

    UPDATED because it doesnt work with g ..sorry for the noise

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      Gives:
      Unknown regexp modifier "/g" at (eval 1) line 1, at end of line
      As mentioned elsewhere, g is one of the flags that is a property of the match/substitution operator, not of the regex, and can't be used with qr.
      --
      A math joke: r = | |csc(θ)|+|sec(θ)| |-| |csc(θ)|-|sec(θ)| |

      If you're building a match or substitute operator, generally better to use String::Substitution than eval.

      Thank you, Discipulus! You were absolutely right, I tried to make a best question as I could to not be too lengthy but also avoid the back and forth of asking for more details, etc. And I have to say that I'm very grateful for the many replies and useful tips!

Re: Use global flag when declaring regular expressions with qr?
by unmatched (Sexton) on Oct 30, 2024 at 17:28 UTC

    Thank you all for your help!

    I wanted to share the final script (at least at its current state) and get some feedback from you if at all possible. You can probably tell that I'm more used to writing Bash scripts, I'm sure I'm breaking a few Perl's style guidelines but here it goes. The goal

    #!/bin/perl use v5.36; use strict; use warnings; use Getopt::Long qw/GetOptions/; my %files_with_matches = (); my %options = ( 'line_numbers' => 0, 'files_with_matches' => 0, 'print_filename' => -1, ); sub parse_args { GetOptions( 'line-numbers' => \$options{line_numbers}, 'files-with-matches' => \$options{files_with_matches}, 'print-filename!' => \$options{print_filename}, ); if (!@ARGV or $ARGV[0] eq "-") { $options{line_numbers} = 0; $options{files_with_matches} = 0; $options{print_filename} = 0; } else { my $total_files = scalar @ARGV; # -1 indicates this option was not modified. Fallback to # default values depending on how many files are provided. if ($options{print_filename} == -1) { $options{print_filename} = $total_files == 1 ? 0 : 1; } } } # see: https://www.rfc-editor.org/rfc/rfc3986#appendix-A my $hexdig = qr/[a-f0-9]/i; my $pct_encoded = qr/%${hexdig}{2}/; my $unreserved = qr/[a-z0-9\-\._~]/i; my $sub_delims = qr/[!\$&'\(\)\*\+,;=]/; my $pchar = qr/(${unreserved}|${pct_encoded}|${sub_delims}|:|\@)/n; my $dec_oct = qr((\d)|(\d\d)|(1\d\d)|(2[0-4]\d)|(25[0-5]))n; my $ipv4 = qr(${dec_oct}\.${dec_oct}\.${dec_oct}\.${dec_oct}); my $h16 = qr(${hexdig}{1,4}); my $ls32 = qr((${h16} : ${h16}) | ${ipv4})xn; my $ipv6 = qr( ( (${h16} :){6} ${ls32}) | ( :: (${h16} :){5} ${ls32}) | (( ${h16})? :: (${h16} :){4} ${ls32}) | (((${h16} :){0,1} ${h16})? :: (${h16} :){3} ${ls32}) | (((${h16} :){0,2} ${h16})? :: (${h16} :){2} ${ls32}) | (((${h16} :){0,3} ${h16})? :: (${h16} :){1} ${ls32}) | (((${h16} :){0,4} ${h16})? :: ${ls32}) | (((${h16} :){0,5} ${h16})? :: ${h16} ) | (((${h16} :){0,6} ${h16})? :: ) )xn; my $ipvf = qr/v${hexdig}{1,}\.(${unreserved}|${sub_delims}|:){1,}/xn; my $ip_literal = qr{\[(${ipv6}|${ipvf})\]}n; my $re = qr( \b( ([a-z][a-z0-9+-.]*) :// ( ( (${unreserved}|${pct_encoded}|${sub_delims}|:)*@ )? ( ${ip_literal} | ${ipv4} | (${unreserved}|${pct_encoded}|${sub_delims})* ) (:\d+)? ) ( (\/ ( ${pchar}+(\/ ${pchar}*)*)?) | (\/ ${pchar}*)* | (${pchar}+(\/ ${pchar}*)*) | (${pchar}{0}) ) (\?(${pchar}|\/|\?)*)? (\#(${pchar}|\/|\?)*)? )\b )xin; sub match_url { my $row = shift; my $file = shift; while ($row =~ /$re/g) { if ($options{files_with_matches}) { if (! $files_with_matches{$file}) { $files_with_matches{$file} = 1; say $file; } return; } print "$file " if $options{print_filename}; print "$. " if $options{line_numbers}; say "$&"; } } sub main { parse_args(); if (!@ARGV or $ARGV[0] eq "-") { chomp(@ARGV = <STDIN>); match_url($_) for @ARGV; return; } # I went back to this form instead of while(<>) as I wanted to # provide the filename to the match_url subroutine. This was the # only way I knew how to do that. foreach (@ARGV) { open my $fh, '<', $_ or die("Error opening file $_ ($!)\n"); while (my $row = <$fh>) { chomp $row; match_url($row, $_); } close($fh); } } main();
    EDIT: Fixed the issue, script works now perfectly (as far as I can tell anyway :D)