in reply to Re: regex for unicode email addresses
in thread regex for unicode email addresses

For testing these email addresses, you could try Regexp::Pattern::Email.

Thx, kcott, it seems to do the trick:

$ ./2.email.kcott.pl 
NOK: |Elmer Fudd
|
NOK: |Daffy Duck
|
NOK: |Alternate
|
NOK: |Phone
|
NOK: |No
|
NOK: |7/13/2017
|
NOK: |Yes
|
NOK: |9/09/2006
|
OK:  |daffy@gmail.com
|
OK:  |Elmer.am@gmail.com
|
NOK: |12/5/2019
|
OK:  |бесполезное.использование.кота@gmail.com
|
OK:  |kobernIU@hotmail.comp
|
OK:  |drüben@msn.com
|
OK:  |manilow@barry76@gmail.com
|
OK:  |moc.liamg@نالی بلی
|
OK:  |時髦的貓@gmail.com
|
OK:  |pen@ничего.net 
|
OK:  |last@nothing.nyet
|
NOK: |
|
cardinality: 10
Elmer.am@gmail.com
 daffy@gmail.com
 drüben@msn.com
 kobernIU@hotmail.comp
 last@nothing.nyet
 manilow@barry76@gmail.com
 moc.liamg@نالی بلی
 pen@ничего.net 
 бесполезное.использование.кота@gmail.com
 時髦的貓@gmail.com

$ cat 2.email.kcott.pl
#!/usr/bin/perl use v5.028; # strictness implied use warnings; use Path::Tiny; binmode STDOUT, ":utf8"; # to install: cpanm Regexp::Pattern::Email use Regexp::Pattern; my $file_in = path("/home/pi/Documents/curate/1.sscce.email.txt"); my $file_out = path('/home/pi/Documents/curate/1.kcott.email.output.tx +t'); my @addrs = $file_in->lines_utf8; my @matching; for my $addr (@addrs) { if ( $addr =~ re("Email::email_address") ) { say "OK: |$addr|"; push( @matching, $addr ); } else { say "NOK: |$addr|"; } } @matching = sort(@matching); say "cardinality: ", scalar @matching; my $string = join( " ", @matching ); say "$string"; $file_out->spew_utf8($string); __END__ $

This seems to accomplish its task, but I had a side-effect on this platform that I'm struggling to understand. Output was to be marshaled by Path::Tiny. What I ended up with every time I ran it was the proper output plus a phantom file like:

1.kcott.email.output.txt93601288741312

, of zero size, that appeared in my file explorer. I don't even know what to call that on this raspberry pi, even having looked through its menus. When I selected them and hit the delete key, I got:

1.kcott.email.output.txt323160262002: Error when getting information f +or file “/home/pi/Documents/curate/1.kcott.email.output.txt3231602620 +02”: No such file or directory 1.kcott.email.output.txt3642662573981: Error when getting information +for file “/home/pi/Documents/curate/1.kcott.email.output.txt364266257 +3981”: No such file or directory 1.kcott.email.output.txt35531339026259: Error when getting information + for file “/home/pi/Documents/curate/1.kcott.email.output.txt35531339 +026259”: No such file or directory 1.kcott.email.output.txt35631638814375: Error when getting information + for file “/home/pi/Documents/curate/1.kcott.email.output.txt35631638 +814375”: No such file or directory 1.kcott.email.output.txt93601288741312: Error when getting information + for file “/home/pi/Documents/curate/1.kcott.email.output.txt93601288 +741312”: No such file or directory

, and the terminal with ls -al showed nothing of them. I took a screenshot to prove to myself that it was happening.

Is there an io layer going on that I'm not accounting for?

Anyways, the world will keep spinning despite this. Curious as I am, I took a look inside Regexp-Pattern-Email/source/lib/Regexp/Pattern/Email.pm

How on earth could anyone or anything figure out what is going on in the regex that lies in the middle of otherwise short module:

pat => qr((?:(?^:(?:(?^:(?>(?^:(?^:(?>(?^:(?>(?^:(?>(?^:(?^:(?>\s*\((? +:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s ++))*[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+(?^:(?^:(?>\s*\((?:\s*(?^:(?^: +(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*))|\.|\s +*"(?^:(?^:[^\\"])|(?^:\\(?^:[^\x0A\x0D])))+"\s*))+))|(?>(?^:(?^:(?>(? +^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))* +\s*\)\s*))|(?>\s+))*[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+(?^:(?^:(?>\s* +\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|( +?>\s+))*))|(?^:(?>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\( +?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*"(?^:(?^:[^\\"])|(?^:\\(?^:[^ +\x0A\x0D])))*"(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[ +^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*)))+))?)(?^:(?>(?^:(?^:(?>\s*\((? +:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s ++))*<(?^:(?^:(?^:(?>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\ +\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*(?^:(?>[^\x00-\x1F\x7F()<>\ +[\]:;@\\,."\s]+(?:\.[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+)*))(?^:(?^:(? +>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s* +))|(?>\s+))*))|(?^:(?>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^ +:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*"(?^:(?^:[^\\"])|(?^:\\(? +^:[^\x0A\x0D])))*"(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\( +?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*)))\@(?^:(?^:(?>(?^:(?^:(?>\s +*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))| +(?>\s+))*(?^:(?>[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+(?:\.[^\x00-\x1F\x +7F()<>\[\]:;@\\,."\s]+)*))(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+)) +|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*))|(?^:(?>(?^:(?^:(?> +\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*) +)|(?>\s+))*\[(?:\s*(?^:(?^:[^\[\]\\])|(?^:\\(?^:[^\x0A\x0D]))))*\s*\] +(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|) +)*\s*\)\s*))|(?>\s+))*))))>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+) +)|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*)))|(?^:(?^:(?^:(?>( +?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|)) +*\s*\)\s*))|(?>\s+))*(?^:(?>[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+(?:\.[ +^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+)*))(?^:(?^:(?>\s*\((?:\s*(?^:(?^:( +?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*))|(?^:(? +>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))| +))*\s*\)\s*))|(?>\s+))*"(?^:(?^:[^\\"])|(?^:\\(?^:[^\x0A\x0D])))*"(?^ +:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\ +s*\)\s*))|(?>\s+))*)))\@(?^:(?^:(?>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[ +^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*(?^:(?>[^\x0 +0-\x1F\x7F()<>\[\]:;@\\,."\s]+(?:\.[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s] ++)*))(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D +]))|))*\s*\)\s*))|(?>\s+))*))|(?^:(?>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(? +>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*\[(?:\s*(? +^:(?^:[^\[\]\\])|(?^:\\(?^:[^\x0A\x0D]))))*\s*\](?^:(?^:(?>\s*\((?:\s +*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+)) +*)))))(?>(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D +]))|))*\s*\)\s*))*)))),

Why does this have to be so complicated?

Replies are listed 'Best First'.
Re^3: regex for unicode email addresses
by hv (Prior) on Mar 11, 2022 at 00:55 UTC
    How on earth could anyone or anything figure out what is going on in the regex that lies in the middle of otherwise short module ...

    The docs say the regexp is taken from Email::Address, which makes it a lot clearer how it is put together: I guess you'd start at line 137: our $addr_spec  = qr/$local_part\@$domain/; and work backwards from there.

    We also read there that this is implementing RFC 2822, which mandates what a valid email address consists of. At 51 pages - that describe the whole message format, not just email addresses - that's pretty light as standards documents go. :)