regex for unicode email addresses

Aldebaran has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

Happy March! An unusual amount of madness going on, and I find that attending to my own mundane problems, I can be marginally helpful to everyone else. I turn to perl for solution to problems that would otherwise baffle me. Unfortunately, it's thick stuff, and the bafflement is always nearby when I start to work with unicode.

One of the tasks I was assigned was to compile an email list that originated as a .pdf. I looked through the PDF:: family at cpan and did not see a way to slurp out a column directly, so I availed myself of open source software called calibre, wherein I was able to save the document as txt. I've worked up an sscce-sized list to imitate it, with the additions that unicode could be involved.

Elmer Fudd
Daffy Duck
Alternate
Phone
No
7/13/2017
Yes
9/09/2006
daffy@gmail.com
Elmer.am@gmail.com
12/5/2019
&#1073;&#1077;&#1089;&#1087;&#1086;&#1083;&#1077;&#1079;&#1085;&#1086;
+&#1077;.&#1080;&#1089;&#1087;&#1086;&#1083;&#1100;&#1079;&#1086;&#107
+4;&#1072;&#1085;&#1080;&#1077;.&#1082;&#1086;&#1090;&#1072;@gmail.com
kobernIU@hotmail.comp
drüben@msn.com
manilow@barry76@gmail.com
moc.liamg@&#1606;&#1575;&#1604;&#1740; &#1576;&#1604;&#1740;
&#26178;&#39654;&#30340;&#35987;@gmail.com
pen@&#1085;&#1080;&#1095;&#1077;&#1075;&#1086;.net 
last@nothing.nyet
[download]

Q1) Is there a perl or posix standard for what comprises a valid email?

Q2) Are unicode characters allowed in every part?

This is my current script:

#!/usr/bin/perl 

use v5.028;    # strictness implied
use warnings;
use Path::Tiny;
binmode STDOUT, ":utf8";
my $file_in = path("/home/pi/Documents/curate/1.sscce.email.txt");
my $file_out = path('/home/pi/Documents/curate/1.sscce.email.output.tx
+t');
my @lines = $file_in->lines_utf8;
say @lines;
my @matching;
for my $line (@lines){
 
    if ( $line =~ /([A-Za-z0-9._]+\@[a-z0-9.-]+)/){
      push( @matching, $1 );
    }
}

@matching=sort(@matching);
say @matching;
say "cardinality: ", scalar @matching;
my $string = join( " ", @matching );
$file_out->spew_utf8( $string );
__END__
[download]

I would like to extend this so that I'm getting the unicode values too. I threw one from a right to left language (Urdu) in, and I was surprised at how hard it fought me in gedit. Q3) Do people in Pakistan have emails that read from right to left? Q4) Does a person handle right to left data fundamentally differently in things like a regex?

So, this is my current output, which I will put in pre tags, so that you can see the characters:

$ ./3.ee.pl 
Elmer Fudd
Daffy Duck
Alternate
Phone
No
7/13/2017
Yes
9/09/2006
daffy@gmail.com
Elmer.am@gmail.com
12/5/2019
бесполезное.использование.кота@gmail.com
kobernIU@hotmail.comp
drüben@msn.com
manilow@barry76@gmail.com
moc.liamg@نالی بلی
時髦的貓@gmail.com
pen@ничего.net 
last@nothing.nyet


Elmer.am@gmail.comben@msn.comdaffy@gmail.comkobernIU@hotmail.complast@nothing.nyetmanilow@barry76
cardinality: 6
$

I'm looking to extend it to "conforming" unicode values, whatever that is.

Thanks for your comment,

Comment on regex for unicode email addresses Select or Download Code

Replies are listed 'Best First'.
Re: regex for unicode email addresses by kcott (Archbishop) on Mar 03, 2022 at 09:36 UTC
G'day Aldebaran, For testing these email addresses, you could try Regexp::Pattern::Email. I used this common alias of mine: `$ alias perlu alias perlu='perl -Mstrict -Mwarnings -Mautodie=:all -Mutf8 -C -E'` [download] Here's my test code and output. $ perlu ' use Regexp::Pattern; my @addrs = ( q{Elmer Fudd}, q{Daffy Duck}, q{Alternate}, q{Phone}, q{No}, q{7/13/2017}, q{Yes}, q{9/09/2006}, q{daffy@gmail.com}, q{Elmer.am@gmail.com}, q{12/5/2019}, q{бесполезное.использование.кота@gmail.com}, q{kobernIU@hotmail.comp}, q{drüben@msn.com}, q{manilow@barry76@gmail.com}, q{moc.liamg@نالی بلی}, q{時髦的貓@gmail.com}, q{pen@ничего.net}, q{last@nothing.nyet}, ); for my $addr (@addrs) { if ($addr =~ re("Email::email_address")) { say "OK: \|$addr\|"; } else { say "NOK: \|$addr\|"; } } ' NOK: \|Elmer Fudd\| NOK: \|Daffy Duck\| NOK: \|Alternate\| NOK: \|Phone\| NOK: \|No\| NOK: \|7/13/2017\| NOK: \|Yes\| NOK: \|9/09/2006\| OK: \|daffy@gmail.com\| OK: \|Elmer.am@gmail.com\| NOK: \|12/5/2019\| OK: \|бесполезное.использование.кота@gmail.com\| OK: \|kobernIU@hotmail.comp\| OK: \|drüben@msn.com\| OK: \|manilow@barry76@gmail.com\| OK: \|moc.liamg@نالی بلی\| OK: \|時髦的貓@gmail.com\| OK: \|pen@ничего.net\| OK: \|last@nothing.nyet\| — Ken	[reply] [d/l]
Re^2: regex for unicode email addresses by Aldebaran (Curate) on Mar 10, 2022 at 20:01 UTC
For testing these email addresses, you could try Regexp::Pattern::Email. Thx, kcott, it seems to do the trick: $ ./2.email.kcott.pl NOK: \|Elmer Fudd \| NOK: \|Daffy Duck \| NOK: \|Alternate \| NOK: \|Phone \| NOK: \|No \| NOK: \|7/13/2017 \| NOK: \|Yes \| NOK: \|9/09/2006 \| OK: \|daffy@gmail.com \| OK: \|Elmer.am@gmail.com \| NOK: \|12/5/2019 \| OK: \|бесполезное.использование.кота@gmail.com \| OK: \|kobernIU@hotmail.comp \| OK: \|drüben@msn.com \| OK: \|manilow@barry76@gmail.com \| OK: \|moc.liamg@نالی بلی \| OK: \|時髦的貓@gmail.com \| OK: \|pen@ничего.net \| OK: \|last@nothing.nyet \| NOK: \| \| cardinality: 10 Elmer.am@gmail.com daffy@gmail.com drüben@msn.com kobernIU@hotmail.comp last@nothing.nyet manilow@barry76@gmail.com moc.liamg@نالی بلی pen@ничего.net бесполезное.использование.кота@gmail.com 時髦的貓@gmail.com $ cat 2.email.kcott.pl #!/usr/bin/perl use v5.028; # strictness implied use warnings; use Path::Tiny; binmode STDOUT, ":utf8"; # to install: cpanm Regexp::Pattern::Email use Regexp::Pattern; my $file_in = path("/home/pi/Documents/curate/1.sscce.email.txt"); my $file_out = path('/home/pi/Documents/curate/1.kcott.email.output.tx +t'); my @addrs = $file_in->lines_utf8; my @matching; for my $addr (@addrs) { if ( $addr =~ re("Email::email_address") ) { say "OK: \|$addr\|"; push( @matching, $addr ); } else { say "NOK: \|$addr\|"; } } @matching = sort(@matching); say "cardinality: ", scalar @matching; my $string = join( " ", @matching ); say "$string"; $file_out->spew_utf8($string); __END__ $ [download] This seems to accomplish its task, but I had a side-effect on this platform that I'm struggling to understand. Output was to be marshaled by Path::Tiny. What I ended up with every time I ran it was the proper output plus a phantom file like: `1.kcott.email.output.txt93601288741312` , of zero size, that appeared in my file explorer. I don't even know what to call that on this raspberry pi, even having looked through its menus. When I selected them and hit the delete key, I got: 1.kcott.email.output.txt323160262002: Error when getting information f +or file “/home/pi/Documents/curate/1.kcott.email.output.txt3231602620 +02”: No such file or directory 1.kcott.email.output.txt3642662573981: Error when getting information +for file “/home/pi/Documents/curate/1.kcott.email.output.txt364266257 +3981”: No such file or directory 1.kcott.email.output.txt35531339026259: Error when getting information + for file “/home/pi/Documents/curate/1.kcott.email.output.txt35531339 +026259”: No such file or directory 1.kcott.email.output.txt35631638814375: Error when getting information + for file “/home/pi/Documents/curate/1.kcott.email.output.txt35631638 +814375”: No such file or directory 1.kcott.email.output.txt93601288741312: Error when getting information + for file “/home/pi/Documents/curate/1.kcott.email.output.txt93601288 +741312”: No such file or directory [download] , and the terminal with `ls -al` showed nothing of them. I took a screenshot to prove to myself that it was happening. Is there an io layer going on that I'm not accounting for? Anyways, the world will keep spinning despite this. Curious as I am, I took a look inside Regexp-Pattern-Email/source/lib/Regexp/Pattern/Email.pm How on earth could anyone or anything figure out what is going on in the regex that lies in the middle of otherwise short module: pat => qr((?:(?^:(?:(?^:(?>(?^:(?^:(?>(?^:(?>(?^:(?>(?^:(?^:(?>\s$(? +:\s(?^:(?^:(?>[^()\\]+))\|(?^:\\(?^:[^\x0A\x0D]))\|))\s$\s))\|(?>\s ++))[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+(?^:(?^:(?>\s$(?:\s(?^:(?^: +(?>[^()\\]+))\|(?^:\\(?^:[^\x0A\x0D]))\|))\s$\s))\|(?>\s+))))\|\.\|\s +"(?^:(?^:[^\\"])\|(?^:\$?^:[^\x0A\x0D])))+"\s))+))\|(?>(?^:(?^:(?>(? +^:(?^:(?>\s\((?:\s(?^:(?^:(?>[^()\\]+))\|(?^:\\(?^:[^\x0A\x0D]))\|))* +\s$\s))\|(?>\s+))[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+(?^:(?^:(?>\s +$(?:\s(?^:(?^:(?>[^()\\]+))\|(?^:\\(?^:[^\x0A\x0D]))\|))\s$\s))\|( +?>\s+))))\|(?^:(?>(?^:(?^:(?>\s$(?:\s(?^:(?^:(?>[^()\\]+))\|(?^:\\( +?^:[^\x0A\x0D]))\|))\s$\s))\|(?>\s+))"(?^:(?^:[^\\"])\|(?^:\$?^:[^ +\x0A\x0D])))"(?^:(?^:(?>\s\((?:\s(?^:(?^:(?>[^()\\]+))\|(?^:\\(?^:[ +^\x0A\x0D]))\|))\s$\s))\|(?>\s+)))))+))?)(?^:(?>(?^:(?^:(?>\s$(? +:\s(?^:(?^:(?>[^()\\]+))\|(?^:\\(?^:[^\x0A\x0D]))\|))\s$\s))\|(?>\s ++))<(?^:(?^:(?^:(?>(?^:(?^:(?>\s$(?:\s(?^:(?^:(?>[^()\\]+))\|(?^:\ +\(?^:[^\x0A\x0D]))\|))\s$\s))\|(?>\s+))(?^:(?>[^\x00-\x1F\x7F()<>\ +[\]:;@\\,."\s]+(?:\.[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+)))(?^:(?^:(? +>\s$(?:\s(?^:(?^:(?>[^()\\]+))\|(?^:\\(?^:[^\x0A\x0D]))\|))\s$\s +))\|(?>\s+))))\|(?^:(?>(?^:(?^:(?>\s$(?:\s(?^:(?^:(?>[^()\\]+))\|(?^ +:\\(?^:[^\x0A\x0D]))\|))\s$\s))\|(?>\s+))"(?^:(?^:[^\\"])\|(?^:\$? +^:[^\x0A\x0D])))"(?^:(?^:(?>\s\((?:\s(?^:(?^:(?>[^()\\]+))\|(?^:\\( +?^:[^\x0A\x0D]))\|))\s$\s))\|(?>\s+)))))\@(?^:(?^:(?>(?^:(?^:(?>\s +$(?:\s(?^:(?^:(?>[^()\\]+))\|(?^:\\(?^:[^\x0A\x0D]))\|))\s$\s))\| +(?>\s+))(?^:(?>[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+(?:\.[^\x00-\x1F\x +7F()<>\[\]:;@\\,."\s]+)))(?^:(?^:(?>\s$(?:\s(?^:(?^:(?>[^()\\]+)) +\|(?^:\\(?^:[^\x0A\x0D]))\|))\s$\s))\|(?>\s+))))\|(?^:(?>(?^:(?^:(?> +\s$(?:\s(?^:(?^:(?>[^()\\]+))\|(?^:\\(?^:[^\x0A\x0D]))\|))\s$\s) +)\|(?>\s+))\[(?:\s(?^:(?^:[^\[\]\\])\|(?^:\$?^:[^\x0A\x0D]))))\s\] +(?^:(?^:(?>\s\((?:\s(?^:(?^:(?>[^()\\]+))\|(?^:\\(?^:[^\x0A\x0D]))\|) +)\s$\s))\|(?>\s+))))))>(?^:(?^:(?>\s$(?:\s(?^:(?^:(?>[^()\\]+) +)\|(?^:\\(?^:[^\x0A\x0D]))\|))\s$\s))\|(?>\s+)))))\|(?^:(?^:(?^:(?>( +?^:(?^:(?>\s$(?:\s(?^:(?^:(?>[^()\\]+))\|(?^:\\(?^:[^\x0A\x0D]))\|)) +\s$\s))\|(?>\s+))(?^:(?>[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+(?:\.[ +^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+)))(?^:(?^:(?>\s$(?:\s(?^:(?^:( +?>[^()\\]+))\|(?^:\\(?^:[^\x0A\x0D]))\|))\s$\s))\|(?>\s+))))\|(?^:(? +>(?^:(?^:(?>\s$(?:\s(?^:(?^:(?>[^()\\]+))\|(?^:\\(?^:[^\x0A\x0D]))\| +))\s$\s))\|(?>\s+))"(?^:(?^:[^\\"])\|(?^:\$?^:[^\x0A\x0D])))"(?^ +:(?^:(?>\s\((?:\s(?^:(?^:(?>[^()\\]+))\|(?^:\\(?^:[^\x0A\x0D]))\|))\ +s$\s))\|(?>\s+)))))\@(?^:(?^:(?>(?^:(?^:(?>\s$(?:\s(?^:(?^:(?>[ +^()\\]+))\|(?^:\\(?^:[^\x0A\x0D]))\|))\s$\s))\|(?>\s+))(?^:(?>[^\x0 +0-\x1F\x7F()<>\[\]:;@\\,."\s]+(?:\.[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s] ++)))(?^:(?^:(?>\s$(?:\s(?^:(?^:(?>[^()\\]+))\|(?^:\\(?^:[^\x0A\x0D +]))\|))\s$\s))\|(?>\s+))))\|(?^:(?>(?^:(?^:(?>\s$(?:\s(?^:(?^:(? +>[^()\\]+))\|(?^:\\(?^:[^\x0A\x0D]))\|))\s$\s))\|(?>\s+))\[(?:\s(? +^:(?^:[^\[\]\\])\|(?^:\$?^:[^\x0A\x0D]))))\s\](?^:(?^:(?>\s\((?:\s +(?^:(?^:(?>[^()\\]+))\|(?^:\\(?^:[^\x0A\x0D]))\|))\s$\s))\|(?>\s+)) +)))))(?>(?^:(?>\s$(?:\s(?^:(?^:(?>[^()\\]+))\|(?^:\\(?^:[^\x0A\x0D +]))\|))\s$\s)))))), [download] Why does this have to be so complicated?	[reply] [d/l] [select]
Re^3: regex for unicode email addresses by hv (Prior) on Mar 11, 2022 at 00:55 UTC
How on earth could anyone or anything figure out what is going on in the regex that lies in the middle of otherwise short module ... The docs say the regexp is taken from Email::Address, which makes it a lot clearer how it is put together: I guess you'd start at line 137: `our $addr_spec = qr/$local_part\@$domain/;` and work backwards from there. We also read there that this is implementing RFC 2822, which mandates what a valid email address consists of. At 51 pages - that describe the whole message format, not just email addresses - that's pretty light as standards documents go. :)	[reply] [d/l]
Re: regex for unicode email addresses by soonix (Chancellor) on Mar 03, 2022 at 09:54 UTC
I was not aware that Unicode is allowed in the localpart, but apparently, both Wikipedia and RFC 5322 say they can, if encoded as UTF-8. I assume Domain Names are to be encoded as IDN. Of course, neither WP nor RFC are authoritative for POSIX or Perl, but most probably POSIX is based upon some RFC, and the Perl modules are, too.	[reply]
Re: regex for unicode email addresses by LanX (Saint) on Mar 03, 2022 at 09:49 UTC
> I looked through the PDF:: family at cpan and did not see a way to slurp out a column directly, I don't think your solution works well, you are showing us a mix of multiple columns. As long as the fonts are not scrambled, you can use a proper solution like described here: Parsing PDFs by text position? Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re: regex for unicode email addresses by Anonymous Monk on Mar 03, 2022 at 09:14 UTC
regexp common email https://metacpan.org/pod/Regexp::Common::Email::Address https://metacpan.org/pod/Email::Address	[reply]