in reply to regex for unicode email addresses

G'day Aldebaran,

For testing these email addresses, you could try Regexp::Pattern::Email.

I used this common alias of mine:

$ alias perlu alias perlu='perl -Mstrict -Mwarnings -Mautodie=:all -Mutf8 -C -E'

Here's my test code and output.

$ perlu '
    use Regexp::Pattern;

    my @addrs = (
        q{Elmer Fudd},
        q{Daffy Duck},
        q{Alternate},
        q{Phone},
        q{No},
        q{7/13/2017},
        q{Yes},
        q{9/09/2006},
        q{daffy@gmail.com},
        q{Elmer.am@gmail.com},
        q{12/5/2019},
        q{бесполезное.использование.кота@gmail.com},
        q{kobernIU@hotmail.comp},
        q{drüben@msn.com},
        q{manilow@barry76@gmail.com},
        q{moc.liamg@نالی بلی},
        q{時髦的貓@gmail.com},
        q{pen@ничего.net},
        q{last@nothing.nyet},
    );

    for my $addr (@addrs) {
        if ($addr =~ re("Email::email_address")) {
            say "OK:  |$addr|";
        }
        else {
            say "NOK: |$addr|";
        }
    }
'
NOK: |Elmer Fudd|
NOK: |Daffy Duck|
NOK: |Alternate|
NOK: |Phone|
NOK: |No|
NOK: |7/13/2017|
NOK: |Yes|
NOK: |9/09/2006|
OK:  |daffy@gmail.com|
OK:  |Elmer.am@gmail.com|
NOK: |12/5/2019|
OK:  |бесполезное.использование.кота@gmail.com|
OK:  |kobernIU@hotmail.comp|
OK:  |drüben@msn.com|
OK:  |manilow@barry76@gmail.com|
OK:  |moc.liamg@نالی بلی|
OK:  |時髦的貓@gmail.com|
OK:  |pen@ничего.net|
OK:  |last@nothing.nyet|

— Ken

Replies are listed 'Best First'.
Re^2: regex for unicode email addresses
by Aldebaran (Curate) on Mar 10, 2022 at 20:01 UTC
    For testing these email addresses, you could try Regexp::Pattern::Email.

    Thx, kcott, it seems to do the trick:

    $ ./2.email.kcott.pl 
    NOK: |Elmer Fudd
    |
    NOK: |Daffy Duck
    |
    NOK: |Alternate
    |
    NOK: |Phone
    |
    NOK: |No
    |
    NOK: |7/13/2017
    |
    NOK: |Yes
    |
    NOK: |9/09/2006
    |
    OK:  |daffy@gmail.com
    |
    OK:  |Elmer.am@gmail.com
    |
    NOK: |12/5/2019
    |
    OK:  |бесполезное.использование.кота@gmail.com
    |
    OK:  |kobernIU@hotmail.comp
    |
    OK:  |drüben@msn.com
    |
    OK:  |manilow@barry76@gmail.com
    |
    OK:  |moc.liamg@نالی بلی
    |
    OK:  |時髦的貓@gmail.com
    |
    OK:  |pen@ничего.net 
    |
    OK:  |last@nothing.nyet
    |
    NOK: |
    |
    cardinality: 10
    Elmer.am@gmail.com
     daffy@gmail.com
     drüben@msn.com
     kobernIU@hotmail.comp
     last@nothing.nyet
     manilow@barry76@gmail.com
     moc.liamg@نالی بلی
     pen@ничего.net 
     бесполезное.использование.кота@gmail.com
     時髦的貓@gmail.com
    
    $ cat 2.email.kcott.pl
    
    #!/usr/bin/perl use v5.028; # strictness implied use warnings; use Path::Tiny; binmode STDOUT, ":utf8"; # to install: cpanm Regexp::Pattern::Email use Regexp::Pattern; my $file_in = path("/home/pi/Documents/curate/1.sscce.email.txt"); my $file_out = path('/home/pi/Documents/curate/1.kcott.email.output.tx +t'); my @addrs = $file_in->lines_utf8; my @matching; for my $addr (@addrs) { if ( $addr =~ re("Email::email_address") ) { say "OK: |$addr|"; push( @matching, $addr ); } else { say "NOK: |$addr|"; } } @matching = sort(@matching); say "cardinality: ", scalar @matching; my $string = join( " ", @matching ); say "$string"; $file_out->spew_utf8($string); __END__ $

    This seems to accomplish its task, but I had a side-effect on this platform that I'm struggling to understand. Output was to be marshaled by Path::Tiny. What I ended up with every time I ran it was the proper output plus a phantom file like:

    1.kcott.email.output.txt93601288741312

    , of zero size, that appeared in my file explorer. I don't even know what to call that on this raspberry pi, even having looked through its menus. When I selected them and hit the delete key, I got:

    1.kcott.email.output.txt323160262002: Error when getting information f +or file “/home/pi/Documents/curate/1.kcott.email.output.txt3231602620 +02”: No such file or directory 1.kcott.email.output.txt3642662573981: Error when getting information +for file “/home/pi/Documents/curate/1.kcott.email.output.txt364266257 +3981”: No such file or directory 1.kcott.email.output.txt35531339026259: Error when getting information + for file “/home/pi/Documents/curate/1.kcott.email.output.txt35531339 +026259”: No such file or directory 1.kcott.email.output.txt35631638814375: Error when getting information + for file “/home/pi/Documents/curate/1.kcott.email.output.txt35631638 +814375”: No such file or directory 1.kcott.email.output.txt93601288741312: Error when getting information + for file “/home/pi/Documents/curate/1.kcott.email.output.txt93601288 +741312”: No such file or directory

    , and the terminal with ls -al showed nothing of them. I took a screenshot to prove to myself that it was happening.

    Is there an io layer going on that I'm not accounting for?

    Anyways, the world will keep spinning despite this. Curious as I am, I took a look inside Regexp-Pattern-Email/source/lib/Regexp/Pattern/Email.pm

    How on earth could anyone or anything figure out what is going on in the regex that lies in the middle of otherwise short module:

    pat => qr((?:(?^:(?:(?^:(?>(?^:(?^:(?>(?^:(?>(?^:(?>(?^:(?^:(?>\s*\((? +:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s ++))*[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+(?^:(?^:(?>\s*\((?:\s*(?^:(?^: +(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*))|\.|\s +*"(?^:(?^:[^\\"])|(?^:\\(?^:[^\x0A\x0D])))+"\s*))+))|(?>(?^:(?^:(?>(? +^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))* +\s*\)\s*))|(?>\s+))*[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+(?^:(?^:(?>\s* +\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|( +?>\s+))*))|(?^:(?>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\( +?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*"(?^:(?^:[^\\"])|(?^:\\(?^:[^ +\x0A\x0D])))*"(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[ +^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*)))+))?)(?^:(?>(?^:(?^:(?>\s*\((? +:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s ++))*<(?^:(?^:(?^:(?>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\ +\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*(?^:(?>[^\x00-\x1F\x7F()<>\ +[\]:;@\\,."\s]+(?:\.[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+)*))(?^:(?^:(? +>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s* +))|(?>\s+))*))|(?^:(?>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^ +:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*"(?^:(?^:[^\\"])|(?^:\\(? +^:[^\x0A\x0D])))*"(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\( +?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*)))\@(?^:(?^:(?>(?^:(?^:(?>\s +*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))| +(?>\s+))*(?^:(?>[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+(?:\.[^\x00-\x1F\x +7F()<>\[\]:;@\\,."\s]+)*))(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+)) +|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*))|(?^:(?>(?^:(?^:(?> +\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*) +)|(?>\s+))*\[(?:\s*(?^:(?^:[^\[\]\\])|(?^:\\(?^:[^\x0A\x0D]))))*\s*\] +(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|) +)*\s*\)\s*))|(?>\s+))*))))>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+) +)|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*)))|(?^:(?^:(?^:(?>( +?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|)) +*\s*\)\s*))|(?>\s+))*(?^:(?>[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+(?:\.[ +^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+)*))(?^:(?^:(?>\s*\((?:\s*(?^:(?^:( +?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*))|(?^:(? +>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))| +))*\s*\)\s*))|(?>\s+))*"(?^:(?^:[^\\"])|(?^:\\(?^:[^\x0A\x0D])))*"(?^ +:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\ +s*\)\s*))|(?>\s+))*)))\@(?^:(?^:(?>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[ +^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*(?^:(?>[^\x0 +0-\x1F\x7F()<>\[\]:;@\\,."\s]+(?:\.[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s] ++)*))(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D +]))|))*\s*\)\s*))|(?>\s+))*))|(?^:(?>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(? +>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*\[(?:\s*(? +^:(?^:[^\[\]\\])|(?^:\\(?^:[^\x0A\x0D]))))*\s*\](?^:(?^:(?>\s*\((?:\s +*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+)) +*)))))(?>(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D +]))|))*\s*\)\s*))*)))),

    Why does this have to be so complicated?

      How on earth could anyone or anything figure out what is going on in the regex that lies in the middle of otherwise short module ...

      The docs say the regexp is taken from Email::Address, which makes it a lot clearer how it is put together: I guess you'd start at line 137: our $addr_spec  = qr/$local_part\@$domain/; and work backwards from there.

      We also read there that this is implementing RFC 2822, which mandates what a valid email address consists of. At 51 pages - that describe the whole message format, not just email addresses - that's pretty light as standards documents go. :)