Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Question regarding a regex

by CrashBlossom (Beadle)
on Jul 22, 2021 at 18:34 UTC ( [id://11135314]=perlquestion: print w/replies, xml ) Need Help??

CrashBlossom has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,
I would like some help in understanding the regular expression in the following code:
sub TextFile { return 0 if (! -f $_[0]); return 0 if (! -r $_[0]); open FH, "<" . $_[0]; my $block = " " x 4096; my $bytesread = sysread FH, $block, 4096; close FH; if (! defined $bytesread) { print "*** ERROR: TextFile: $_[0]: $!\n"; return 0; } return $block =~ /^[\r\n\t -~]*$/s; }
It attempts to guess whether a file is a text file based on what it sees in the first 4096 characters. I ran across it while seeking an alternative to -T to check for a text file. It seems to work in my tests, but I don't understand why because I don't understand what the regex is matching.

My understanding is that [] defines a character class. A ^ before [] means negation. * means zero or more. $ means end of line. So if I put that all together, it seems to mean that
$block =~ /^[\r\n\t -~]*$/s
is true if $block does nor include any of \r\n\t -~ before an end of line. But that doesn't make sense. I'm also mystified by the inclusion of the characters -~ in the character class.

Can anyone unpack all this for me?

I am running strawberry perl 5.30 on windows 10.

Thanks!

Replies are listed 'Best First'.
Re: Question regarding a regex
by jdporter (Paladin) on Jul 22, 2021 at 19:10 UTC
    the characters -~ in the character class

    Unless the hyphen is the very first character, it signifies a range. In this case, it's the range ' ' (space) through '~' (tilde) - which is essentially the entire range of printable ASCII characters.

    So what this test is saying: is there a "line" of "text" - where "line" is defined by the ^ and $ anchors, and "text" is defined as "all the printable ASCII characters, plus the selected whitespace characters newline, carriage return, tab, and space."

      > Unless the hyphen is the very first character, it signifies a range.

      or the very last.

      DB<1> p '-' =~ /[a-]/ 1 DB<2> p '-' =~ /[a- ]/ Invalid [] range "a- " in re ..

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

Re: Question regarding a regex
by hippo (Bishop) on Jul 22, 2021 at 19:08 UTC
    A ^ before [] means negation.

    That's a misunderstanding. The ^ at the start means match the start of the string. A ^ inside [] would be negation.


    🦛

Re: Question regarding a regex
by eyepopslikeamosquito (Archbishop) on Jul 22, 2021 at 23:21 UTC

    > I ran across it while seeking an alternative to -T to check for a text file

    FYI, it seems the OP ran across this function at this stackoverflow question.

Re: Question regarding a regex
by AnomalousMonk (Archbishop) on Jul 23, 2021 at 00:40 UTC

    Perl can help you understand this kind of thing on your own. I find YAPE::Regex::Explain handy for explaining "simple" regexes. Note that there is no support for regex syntax added after Perl version 5.6.

    Win8 Strawberry 5.30.3.1 (64) Thu 07/22/2021 20:17:20 C:\@Work\Perl\monks >perl -wMstrict -MYAPE::Regex::Explain -e "print YAPE::Regex::Explain->new(qr/^[\r\n\t -~]*$/s)->explain;" The regular expression: (?s-imx:^[\r\n\t -~]*$) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?s-imx: group, but do not capture (with . matching \n) (case-sensitive) (with ^ and $ matching normally) (matching whitespace and # normally): ---------------------------------------------------------------------- ^ the beginning of the string ---------------------------------------------------------------------- [\r\n\t -~]* any character of: '\r' (carriage return), '\n' (newline), '\t' (tab), ' ' to '~' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- $ before an optional \n, and the end of the string ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------
    There are, of course, any number of on-line regex parsers, explainers and checkers, but I can't think of a good one ATM. Google is your friend.


    Give a man a fish:  <%-{-{-{-<

Re: Question regarding a regex
by Anonymous Monk on Jul 22, 2021 at 19:26 UTC

    No, the caret only negates the character set if it appears immediately after the left square bracket. Outside square brackets, it matches the beginning of the string or immediately after a newline. Also: the  -~ sequence inside square brackets means 'any character between space and tilde, inclusive.'

    So in words, the regular expression specifies 'all characters in the first 4096 (or end-of-file, whichever comes first) are "\r", "\n", "\t", or characters in the range space to tilde, inclusive, in your machine's native encoding'.

    Off-topic comments:

    • The -T operator will tell you if your file is ASCII/UTF-8. Just say return -T $_[0];.
    • The special file handle _ (i.e. underscore) means "whatever file was last tested" under any recent Perl, and can be faster because it makes use of the same stat() structure.
    • The three-argument form of open() is preferred because it handles file names with strange characters better. In your case it would be open FH, "<", $_[0]
    • You should probably ensure that your open() succeeded. Something like open FH, "<", $_[0] or die "Failed to open $_[0]: $!"; is the usual idiom.
    • People usually use lexical file handles rather than bareword file handles these days because they get closed automatically when they go out of scope (say, if you throw an unexpected exception). In your code that would look like open my $fh, "<", $_[0] ....
      Outside square brackets, it [caret] matches the beginning of the string or immediately after a newline.

      By default, ^ (caret) outside a character class matches only at the beginning of a string, exactly as \A does. Caret (outside a character class) also matches immediately after an embedded newline if the /m modifier is asserted. See Modifiers in perlre.


      Give a man a fish:  <%-{-{-{-<

Re: Question regarding a regex
by BillKSmith (Monsignor) on Jul 23, 2021 at 02:31 UTC
    Depending on your definition of 'text file', this test could give a false negative if the file contained any non-ASCII 'text'. It could give a false positive if every byte of a 'binary' file happened to contain the code of an ASCII character. (Can you tolerate an extremely rare error?)
    Bill
      It could give a false positive if every byte of a 'binary' file happened to contain the code of an ASCII character. (Can you tolerate an extremely rare error?)

      I know only one binary file that intentionally contains only ASCII characters: The EICAR test file. Here it is, in its full glory:

      X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*

      It is a DOS executable that just prints the embedded text "EICAR-STANDARD-ANTIVIRUS-TEST-FILE!" and then exits.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
        Clever. I was thinking of a binary data file (Perhaps a sequence of readings from a 7-bit A/D converter).
        Bill

        impressive! and another form of golf is born.

Re: Question regarding a regex
by CrashBlossom (Beadle) on Jul 23, 2021 at 21:55 UTC
    Thanks for all the responses. My misunderstanding of the regex is a bit embarrassing. Thanks for setting me straight, and for providing some tidbits of perl knowledge that will probably come in handy in the future.

    Case closed.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11135314]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (6)
As of 2024-04-23 17:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found