InfiniteSilence has asked for the wisdom of the Perl Monks concerning the following question:

I have a directory with the following files:
1.1.1.txt 1.1.txt 1.txt 2.txt 3.txt
My objective is to print off only the files with names of the format nnnnn.txt where n is a number. This prints too many:
perl -e "print grep {/\d+\.txt/} glob('*.txt');" 1.1.1.txt1.1.txt1.txt2.txt3.txt
I suspect the problem lies with the regex. The original grep is looking for n.txt, so we'll make $s textural:
perl -e "$s=qq(1.1.1); if ($s=~/\d+/) {print 'it is a number'};" it is a number
But it isn't a 'number'. I still want to be able to match files with the type 200000.txt at some point. I thought this problem had to do with greediness, so I tried this:
perl -e "$s=qq(1.1.1); if ($s=~/\d+?/) {print 'it is a number'};" it is a number
To no avail. What is the correct regex and why didn't this work?

Celebrate Intellectual Diversity

Replies are listed 'Best First'.
Re: regex matching number
by Paladin (Vicar) on Sep 10, 2003 at 18:13 UTC
    /\d+/ matches any string that has 1 or more digits anywhere in it. If you want a string with only digits, you need to anchor the regex: /^\d+$/. Or in your case, match digits followed by a period and "txt": /^\d+\.txt$/

    That says "match the start of the string, followed by 1 or more digits, then a period, then txt, then the end of the string.

Re: regex matching number
by davido (Cardinal) on Sep 10, 2003 at 18:41 UTC
    Your problem is that a match succeeds if what you're matching for is found in the string. That sounds like an obvious statement, but think about the ramifications. The following will result in a match:

    my $string = "abcdefDaveghijkl"; if ( $string =~ /Dave/ ) { print "I found Dave.\n"; }

    It matches because Dave is found within the string you're searching.

    Now to look at your situation with the following example:

    my $string = "1.23.45"; print "Found a number.\n" if /\d+/;

    A number is, of course, found within that string. The only real qualification you've given is that the match has to contain at least one and possibly more numeric digits. '1' qualifies.

    Now to show an example closer to your issue.

    my $string = "1.234.567.txt"; print "Found a good filename.\n" if /\d+\.txt/;

    You've now told the regexp engine to return true if anywhere in the string, it finds one or more digits followed by a period (or dot, or decimal), followed by the literal letters 'txt'. Given that description, the 567.txt portion of the string triggers a match, and you've done nothing to prevent the regexp engine from accepting the match. Just as /Dave/ can be found in part of a string, 567.txt can as well. For example:

    my $string = "1.234.567.txt"; print "Match.\n" if /567.txt/;

    Well, 567.txt does exist within $string, doesn't it? So of course it matches. \d+.txt is not so different, except that it will match any number, not just 567.

    What you really need to do is tell the regexp engine that your filename has to contain only numbers preceeding the '.txt' extension. As with anything in Perl, there is more than one way to do it. The simplest way is to anchor the match to the beginning of the string.

    /^\d+\.txt/

    But that would also match '4567.txtish', or even '5678.txt.90210', because you've left the door open for trailing stuff. If there is any chance of some 'whitespace' characters being padded at the beginning or end of the string, you should take that into consideration, and also should prevent unwanted characters past the '.txt' extension by anchoring at the end of the string as well:

    my $testname = " 456.txt"; my $filename; if ($filename = $testname=~ /^\s*?(\d+\.txt)\s*?$/ ) { print "Found a file named $filename.\n"; }

    Now you've gotten a lot more robust. The preceeding regexp will accept only filenames that contain nnnn.txt (where nnnn is any number of numeric digits). If the filename is preceeded or followed by whitespace, that whitespace is permitted but ignored, and you're anchored to both the beginning and the end of the string. That prevents a filename like 1234.txtish from being matched. Though probably unnecessary in this case, I also forced non-greed in my matches that absorbed whitespace. The reason I did this wasn't necessarily to make YOUR regexp more robust, but to reinforce a good habbit that makes most regexp's more robust. In fact, in Larry Wall's description of regexp's in Perl 6, he discusses the fact that non-greedy matching ought to be the default, and greedy matching ought to be the exception in Perl 6 regular expressions. Non-greedy matches are more than often, what you're looking for. And finally, this match assigns the portion that you consider to be a filename, to the scalar variable $filename.

    Hope this helps!

    Dave

    "If I had my life to do over again, I'd be a plumber." -- Albert Einstein

      This is the answer I was looking for. I did not understand why anchoring (using ^ or $) would have made this result any different until I read davido's response. I get it now.

      Celebrate Intellectual Diversity

Re: regex matching number
by DrHyde (Prior) on Sep 10, 2003 at 18:36 UTC
    If you only want to match a bunch of digits followed by .txt, then
    /^\d+\.txt$/
    should do the trick. The ^ and $ "anchor" the regex to the beginning and end of the string, so that matches the beginning of the string, followed by one or more digits, followed by '.txt', followed by the end of the string. If you define 1.1.txt as being OK (and the way I read it, 1.1 is a number), then:
    /^\d+(\.\d+)?\.txt$/
    will work. (...)? matches zero or one instances of whatever's inside the brackets, making the decimal point and any digits after it optional. And finally, to optionally allow negative numbers:
    /^-?\d+(\.\d+)?\.txt$/
    The brackets in those last two, which I'm really just using for grouping, will have the side-effect of capturing their contents into the $1 variable. You can avoid that by inserting more line-noise into the pattern, which I left out here in the interests of clarity. Search for the word 'clustering' in the perlre manpage for details.

    A completely different approach would be to use perl's automatic conversion between strings and numbers. If you evaluate a scalar that begins with a number in a numberish way - eg by adding it to 0 - the result is the number at the beginning of the string. For instance:

    print 0+"1.1.1.txt" # 1.1 print 0+"-3.14.foo" # -3.14
    or if you only want integers ...
    print int("1.1.1.txt") # 1 print int("-3.14.foo") # -3
    Adapting your code to use this method - and to round your numbers properly - is left as an exercise for the reader :-)
Re: regex matching number
by InfiniteSilence (Curate) on Sep 10, 2003 at 19:47 UTC
    Come to think of it it would have been good it I were able to debug a regex:
    perl -de "$s=qq(1a2a3a4a5a); if ($s=~/(\d+)/){print $1};" Default die handler restored. Loading DB routines from perl5db.pl version 1.07 Editor support available. Enter h or `h h' for help, or `perldoc perldebug' for more help. main::(-e:1): $s=qq(1a2a3a4a5a); if ($s=~/(\d+)/){print $1}; DB<1> n main::(-e:1): $s=qq(1a2a3a4a5a); if ($s=~/(\d+)/){print $1}; DB<1> p $_ DB<2>
    Is there any way to do this?

    Celebrate Intellectual Diversity

      Works for me, that is, I am able to play with conditions and see what happens as I experiment.
        DB<1> $s=qq(1a2a3a4a5a); if ($s=~/(\d+)/){print $1};
      1
        DB<2> $s=qq(1a2a3a4a5a); while ($s=~/(\d+)/g){print $1};
      12345
        DB<3> $s=q(1a22b333c4567d89);
      
        DB<4> $re=qr~(\d+)~;
      
        DB<5> print $1 while $s =~ /$re/g;
      122333456789
        DB<6> $re=qr~(\d{3,})~;
      
        DB<7> print $1 while $s =~ /$re/g;
      3334567
      
      Just keep playing with inputs and patterns until you understand what is going on   (or until you get something that works and then put 'study' on your to-do list ;-)

      And strangely enough, shortly before/after your question, in another thread antirice offered a suggestion to use a tool to try explaining what a RE means - see Re: Pattern Matching Question.   Hey, that became next on my todo list!