in reply to Re: regex for unicode email addresses
in thread regex for unicode email addresses
Thx, kcott, it seems to do the trick:
$ ./2.email.kcott.pl NOK: |Elmer Fudd | NOK: |Daffy Duck | NOK: |Alternate | NOK: |Phone | NOK: |No | NOK: |7/13/2017 | NOK: |Yes | NOK: |9/09/2006 | OK: |daffy@gmail.com | OK: |Elmer.am@gmail.com | NOK: |12/5/2019 | OK: |бесполезное.использование.кота@gmail.com | OK: |kobernIU@hotmail.comp | OK: |drüben@msn.com | OK: |manilow@barry76@gmail.com | OK: |moc.liamg@نالی بلی | OK: |時髦的貓@gmail.com | OK: |pen@ничего.net | OK: |last@nothing.nyet | NOK: | | cardinality: 10 Elmer.am@gmail.com daffy@gmail.com drüben@msn.com kobernIU@hotmail.comp last@nothing.nyet manilow@barry76@gmail.com moc.liamg@نالی بلی pen@ничего.net бесполезное.использование.кота@gmail.com 時髦的貓@gmail.com $ cat 2.email.kcott.pl
#!/usr/bin/perl use v5.028; # strictness implied use warnings; use Path::Tiny; binmode STDOUT, ":utf8"; # to install: cpanm Regexp::Pattern::Email use Regexp::Pattern; my $file_in = path("/home/pi/Documents/curate/1.sscce.email.txt"); my $file_out = path('/home/pi/Documents/curate/1.kcott.email.output.tx +t'); my @addrs = $file_in->lines_utf8; my @matching; for my $addr (@addrs) { if ( $addr =~ re("Email::email_address") ) { say "OK: |$addr|"; push( @matching, $addr ); } else { say "NOK: |$addr|"; } } @matching = sort(@matching); say "cardinality: ", scalar @matching; my $string = join( " ", @matching ); say "$string"; $file_out->spew_utf8($string); __END__ $
This seems to accomplish its task, but I had a side-effect on this platform that I'm struggling to understand. Output was to be marshaled by Path::Tiny. What I ended up with every time I ran it was the proper output plus a phantom file like:
1.kcott.email.output.txt93601288741312, of zero size, that appeared in my file explorer. I don't even know what to call that on this raspberry pi, even having looked through its menus. When I selected them and hit the delete key, I got:
1.kcott.email.output.txt323160262002: Error when getting information f +or file “/home/pi/Documents/curate/1.kcott.email.output.txt3231602620 +02”: No such file or directory 1.kcott.email.output.txt3642662573981: Error when getting information +for file “/home/pi/Documents/curate/1.kcott.email.output.txt364266257 +3981”: No such file or directory 1.kcott.email.output.txt35531339026259: Error when getting information + for file “/home/pi/Documents/curate/1.kcott.email.output.txt35531339 +026259”: No such file or directory 1.kcott.email.output.txt35631638814375: Error when getting information + for file “/home/pi/Documents/curate/1.kcott.email.output.txt35631638 +814375”: No such file or directory 1.kcott.email.output.txt93601288741312: Error when getting information + for file “/home/pi/Documents/curate/1.kcott.email.output.txt93601288 +741312”: No such file or directory
, and the terminal with ls -al showed nothing of them. I took a screenshot to prove to myself that it was happening.
Is there an io layer going on that I'm not accounting for?
Anyways, the world will keep spinning despite this. Curious as I am, I took a look inside Regexp-Pattern-Email/source/lib/Regexp/Pattern/Email.pm
How on earth could anyone or anything figure out what is going on in the regex that lies in the middle of otherwise short module:
pat => qr((?:(?^:(?:(?^:(?>(?^:(?^:(?>(?^:(?>(?^:(?>(?^:(?^:(?>\s*\((? +:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s ++))*[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+(?^:(?^:(?>\s*\((?:\s*(?^:(?^: +(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*))|\.|\s +*"(?^:(?^:[^\\"])|(?^:\\(?^:[^\x0A\x0D])))+"\s*))+))|(?>(?^:(?^:(?>(? +^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))* +\s*\)\s*))|(?>\s+))*[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+(?^:(?^:(?>\s* +\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|( +?>\s+))*))|(?^:(?>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\( +?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*"(?^:(?^:[^\\"])|(?^:\\(?^:[^ +\x0A\x0D])))*"(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[ +^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*)))+))?)(?^:(?>(?^:(?^:(?>\s*\((? +:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s ++))*<(?^:(?^:(?^:(?>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\ +\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*(?^:(?>[^\x00-\x1F\x7F()<>\ +[\]:;@\\,."\s]+(?:\.[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+)*))(?^:(?^:(? +>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s* +))|(?>\s+))*))|(?^:(?>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^ +:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*"(?^:(?^:[^\\"])|(?^:\\(? +^:[^\x0A\x0D])))*"(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\( +?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*)))\@(?^:(?^:(?>(?^:(?^:(?>\s +*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))| +(?>\s+))*(?^:(?>[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+(?:\.[^\x00-\x1F\x +7F()<>\[\]:;@\\,."\s]+)*))(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+)) +|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*))|(?^:(?>(?^:(?^:(?> +\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*) +)|(?>\s+))*\[(?:\s*(?^:(?^:[^\[\]\\])|(?^:\\(?^:[^\x0A\x0D]))))*\s*\] +(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|) +)*\s*\)\s*))|(?>\s+))*))))>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+) +)|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*)))|(?^:(?^:(?^:(?>( +?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|)) +*\s*\)\s*))|(?>\s+))*(?^:(?>[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+(?:\.[ +^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+)*))(?^:(?^:(?>\s*\((?:\s*(?^:(?^:( +?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*))|(?^:(? +>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))| +))*\s*\)\s*))|(?>\s+))*"(?^:(?^:[^\\"])|(?^:\\(?^:[^\x0A\x0D])))*"(?^ +:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\ +s*\)\s*))|(?>\s+))*)))\@(?^:(?^:(?>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[ +^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*(?^:(?>[^\x0 +0-\x1F\x7F()<>\[\]:;@\\,."\s]+(?:\.[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s] ++)*))(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D +]))|))*\s*\)\s*))|(?>\s+))*))|(?^:(?>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(? +>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*\[(?:\s*(? +^:(?^:[^\[\]\\])|(?^:\\(?^:[^\x0A\x0D]))))*\s*\](?^:(?^:(?>\s*\((?:\s +*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+)) +*)))))(?>(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D +]))|))*\s*\)\s*))*)))),
Why does this have to be so complicated?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: regex for unicode email addresses
by hv (Prior) on Mar 11, 2022 at 00:55 UTC |