Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

regex for unicode email addresses

by Aldebaran (Curate)
on Mar 03, 2022 at 07:42 UTC ( [id://11141774]=perlquestion: print w/replies, xml ) Need Help??

Aldebaran has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

Happy March! An unusual amount of madness going on, and I find that attending to my own mundane problems, I can be marginally helpful to everyone else. I turn to perl for solution to problems that would otherwise baffle me. Unfortunately, it's thick stuff, and the bafflement is always nearby when I start to work with unicode.

One of the tasks I was assigned was to compile an email list that originated as a .pdf. I looked through the PDF:: family at cpan and did not see a way to slurp out a column directly, so I availed myself of open source software called calibre, wherein I was able to save the document as txt. I've worked up an sscce-sized list to imitate it, with the additions that unicode could be involved.

Elmer Fudd Daffy Duck Alternate Phone No 7/13/2017 Yes 9/09/2006 daffy@gmail.com Elmer.am@gmail.com 12/5/2019 бесполезно +е.использо&#107 +4;ание.кота@gmail.com kobernIU@hotmail.comp drüben@msn.com manilow@barry76@gmail.com moc.liamg@نالی بلی 時髦的貓@gmail.com pen@ничего.net last@nothing.nyet

Q1) Is there a perl or posix standard for what comprises a valid email?

Q2) Are unicode characters allowed in every part?

This is my current script:

#!/usr/bin/perl use v5.028; # strictness implied use warnings; use Path::Tiny; binmode STDOUT, ":utf8"; my $file_in = path("/home/pi/Documents/curate/1.sscce.email.txt"); my $file_out = path('/home/pi/Documents/curate/1.sscce.email.output.tx +t'); my @lines = $file_in->lines_utf8; say @lines; my @matching; for my $line (@lines){ if ( $line =~ /([A-Za-z0-9._]+\@[a-z0-9.-]+)/){ push( @matching, $1 ); } } @matching=sort(@matching); say @matching; say "cardinality: ", scalar @matching; my $string = join( " ", @matching ); $file_out->spew_utf8( $string ); __END__

I would like to extend this so that I'm getting the unicode values too. I threw one from a right to left language (Urdu) in, and I was surprised at how hard it fought me in gedit. Q3) Do people in Pakistan have emails that read from right to left? Q4) Does a person handle right to left data fundamentally differently in things like a regex?

So, this is my current output, which I will put in pre tags, so that you can see the characters:

$ ./3.ee.pl 
Elmer Fudd
Daffy Duck
Alternate
Phone
No
7/13/2017
Yes
9/09/2006
daffy@gmail.com
Elmer.am@gmail.com
12/5/2019
бесполезное.использование.кота@gmail.com
kobernIU@hotmail.comp
drüben@msn.com
manilow@barry76@gmail.com
moc.liamg@نالی بلی
時髦的貓@gmail.com
pen@ничего.net 
last@nothing.nyet


Elmer.am@gmail.comben@msn.comdaffy@gmail.comkobernIU@hotmail.complast@nothing.nyetmanilow@barry76
cardinality: 6
$ 

I'm looking to extend it to "conforming" unicode values, whatever that is.

Thanks for your comment,

Replies are listed 'Best First'.
Re: regex for unicode email addresses
by kcott (Archbishop) on Mar 03, 2022 at 09:36 UTC

    G'day Aldebaran,

    For testing these email addresses, you could try Regexp::Pattern::Email.

    I used this common alias of mine:

    $ alias perlu alias perlu='perl -Mstrict -Mwarnings -Mautodie=:all -Mutf8 -C -E'

    Here's my test code and output.

    $ perlu '
        use Regexp::Pattern;
    
        my @addrs = (
            q{Elmer Fudd},
            q{Daffy Duck},
            q{Alternate},
            q{Phone},
            q{No},
            q{7/13/2017},
            q{Yes},
            q{9/09/2006},
            q{daffy@gmail.com},
            q{Elmer.am@gmail.com},
            q{12/5/2019},
            q{бесполезное.использование.кота@gmail.com},
            q{kobernIU@hotmail.comp},
            q{drüben@msn.com},
            q{manilow@barry76@gmail.com},
            q{moc.liamg@نالی بلی},
            q{時髦的貓@gmail.com},
            q{pen@ничего.net},
            q{last@nothing.nyet},
        );
    
        for my $addr (@addrs) {
            if ($addr =~ re("Email::email_address")) {
                say "OK:  |$addr|";
            }
            else {
                say "NOK: |$addr|";
            }
        }
    '
    NOK: |Elmer Fudd|
    NOK: |Daffy Duck|
    NOK: |Alternate|
    NOK: |Phone|
    NOK: |No|
    NOK: |7/13/2017|
    NOK: |Yes|
    NOK: |9/09/2006|
    OK:  |daffy@gmail.com|
    OK:  |Elmer.am@gmail.com|
    NOK: |12/5/2019|
    OK:  |бесполезное.использование.кота@gmail.com|
    OK:  |kobernIU@hotmail.comp|
    OK:  |drüben@msn.com|
    OK:  |manilow@barry76@gmail.com|
    OK:  |moc.liamg@نالی بلی|
    OK:  |時髦的貓@gmail.com|
    OK:  |pen@ничего.net|
    OK:  |last@nothing.nyet|
    

    — Ken

      For testing these email addresses, you could try Regexp::Pattern::Email.

      Thx, kcott, it seems to do the trick:

      $ ./2.email.kcott.pl 
      NOK: |Elmer Fudd
      |
      NOK: |Daffy Duck
      |
      NOK: |Alternate
      |
      NOK: |Phone
      |
      NOK: |No
      |
      NOK: |7/13/2017
      |
      NOK: |Yes
      |
      NOK: |9/09/2006
      |
      OK:  |daffy@gmail.com
      |
      OK:  |Elmer.am@gmail.com
      |
      NOK: |12/5/2019
      |
      OK:  |бесполезное.использование.кота@gmail.com
      |
      OK:  |kobernIU@hotmail.comp
      |
      OK:  |drüben@msn.com
      |
      OK:  |manilow@barry76@gmail.com
      |
      OK:  |moc.liamg@نالی بلی
      |
      OK:  |時髦的貓@gmail.com
      |
      OK:  |pen@ничего.net 
      |
      OK:  |last@nothing.nyet
      |
      NOK: |
      |
      cardinality: 10
      Elmer.am@gmail.com
       daffy@gmail.com
       drüben@msn.com
       kobernIU@hotmail.comp
       last@nothing.nyet
       manilow@barry76@gmail.com
       moc.liamg@نالی بلی
       pen@ничего.net 
       бесполезное.использование.кота@gmail.com
       時髦的貓@gmail.com
      
      $ cat 2.email.kcott.pl
      
      #!/usr/bin/perl use v5.028; # strictness implied use warnings; use Path::Tiny; binmode STDOUT, ":utf8"; # to install: cpanm Regexp::Pattern::Email use Regexp::Pattern; my $file_in = path("/home/pi/Documents/curate/1.sscce.email.txt"); my $file_out = path('/home/pi/Documents/curate/1.kcott.email.output.tx +t'); my @addrs = $file_in->lines_utf8; my @matching; for my $addr (@addrs) { if ( $addr =~ re("Email::email_address") ) { say "OK: |$addr|"; push( @matching, $addr ); } else { say "NOK: |$addr|"; } } @matching = sort(@matching); say "cardinality: ", scalar @matching; my $string = join( " ", @matching ); say "$string"; $file_out->spew_utf8($string); __END__ $

      This seems to accomplish its task, but I had a side-effect on this platform that I'm struggling to understand. Output was to be marshaled by Path::Tiny. What I ended up with every time I ran it was the proper output plus a phantom file like:

      1.kcott.email.output.txt93601288741312

      , of zero size, that appeared in my file explorer. I don't even know what to call that on this raspberry pi, even having looked through its menus. When I selected them and hit the delete key, I got:

      1.kcott.email.output.txt323160262002: Error when getting information f +or file “/home/pi/Documents/curate/1.kcott.email.output.txt3231602620 +02”: No such file or directory 1.kcott.email.output.txt3642662573981: Error when getting information +for file “/home/pi/Documents/curate/1.kcott.email.output.txt364266257 +3981”: No such file or directory 1.kcott.email.output.txt35531339026259: Error when getting information + for file “/home/pi/Documents/curate/1.kcott.email.output.txt35531339 +026259”: No such file or directory 1.kcott.email.output.txt35631638814375: Error when getting information + for file “/home/pi/Documents/curate/1.kcott.email.output.txt35631638 +814375”: No such file or directory 1.kcott.email.output.txt93601288741312: Error when getting information + for file “/home/pi/Documents/curate/1.kcott.email.output.txt93601288 +741312”: No such file or directory

      , and the terminal with ls -al showed nothing of them. I took a screenshot to prove to myself that it was happening.

      Is there an io layer going on that I'm not accounting for?

      Anyways, the world will keep spinning despite this. Curious as I am, I took a look inside Regexp-Pattern-Email/source/lib/Regexp/Pattern/Email.pm

      How on earth could anyone or anything figure out what is going on in the regex that lies in the middle of otherwise short module:

      pat => qr((?:(?^:(?:(?^:(?>(?^:(?^:(?>(?^:(?>(?^:(?>(?^:(?^:(?>\s*\((? +:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s ++))*[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+(?^:(?^:(?>\s*\((?:\s*(?^:(?^: +(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*))|\.|\s +*"(?^:(?^:[^\\"])|(?^:\\(?^:[^\x0A\x0D])))+"\s*))+))|(?>(?^:(?^:(?>(? +^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))* +\s*\)\s*))|(?>\s+))*[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+(?^:(?^:(?>\s* +\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|( +?>\s+))*))|(?^:(?>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\( +?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*"(?^:(?^:[^\\"])|(?^:\\(?^:[^ +\x0A\x0D])))*"(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[ +^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*)))+))?)(?^:(?>(?^:(?^:(?>\s*\((? +:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s ++))*<(?^:(?^:(?^:(?>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\ +\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*(?^:(?>[^\x00-\x1F\x7F()<>\ +[\]:;@\\,."\s]+(?:\.[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+)*))(?^:(?^:(? +>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s* +))|(?>\s+))*))|(?^:(?>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^ +:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*"(?^:(?^:[^\\"])|(?^:\\(? +^:[^\x0A\x0D])))*"(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\( +?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*)))\@(?^:(?^:(?>(?^:(?^:(?>\s +*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))| +(?>\s+))*(?^:(?>[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+(?:\.[^\x00-\x1F\x +7F()<>\[\]:;@\\,."\s]+)*))(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+)) +|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*))|(?^:(?>(?^:(?^:(?> +\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*) +)|(?>\s+))*\[(?:\s*(?^:(?^:[^\[\]\\])|(?^:\\(?^:[^\x0A\x0D]))))*\s*\] +(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|) +)*\s*\)\s*))|(?>\s+))*))))>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+) +)|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*)))|(?^:(?^:(?^:(?>( +?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|)) +*\s*\)\s*))|(?>\s+))*(?^:(?>[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+(?:\.[ +^\x00-\x1F\x7F()<>\[\]:;@\\,."\s]+)*))(?^:(?^:(?>\s*\((?:\s*(?^:(?^:( +?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*))|(?^:(? +>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))| +))*\s*\)\s*))|(?>\s+))*"(?^:(?^:[^\\"])|(?^:\\(?^:[^\x0A\x0D])))*"(?^ +:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\ +s*\)\s*))|(?>\s+))*)))\@(?^:(?^:(?>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[ +^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*(?^:(?>[^\x0 +0-\x1F\x7F()<>\[\]:;@\\,."\s]+(?:\.[^\x00-\x1F\x7F()<>\[\]:;@\\,."\s] ++)*))(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D +]))|))*\s*\)\s*))|(?>\s+))*))|(?^:(?>(?^:(?^:(?>\s*\((?:\s*(?^:(?^:(? +>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+))*\[(?:\s*(? +^:(?^:[^\[\]\\])|(?^:\\(?^:[^\x0A\x0D]))))*\s*\](?^:(?^:(?>\s*\((?:\s +*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D]))|))*\s*\)\s*))|(?>\s+)) +*)))))(?>(?^:(?>\s*\((?:\s*(?^:(?^:(?>[^()\\]+))|(?^:\\(?^:[^\x0A\x0D +]))|))*\s*\)\s*))*)))),

      Why does this have to be so complicated?

        How on earth could anyone or anything figure out what is going on in the regex that lies in the middle of otherwise short module ...

        The docs say the regexp is taken from Email::Address, which makes it a lot clearer how it is put together: I guess you'd start at line 137: our $addr_spec  = qr/$local_part\@$domain/; and work backwards from there.

        We also read there that this is implementing RFC 2822, which mandates what a valid email address consists of. At 51 pages - that describe the whole message format, not just email addresses - that's pretty light as standards documents go. :)

Re: regex for unicode email addresses
by soonix (Canon) on Mar 03, 2022 at 09:54 UTC

    I was not aware that Unicode is allowed in the localpart, but apparently, both Wikipedia and RFC 5322 say they can, if encoded as UTF-8. I assume Domain Names are to be encoded as IDN.

    Of course, neither WP nor RFC are authoritative for POSIX or Perl, but most probably POSIX is based upon some RFC, and the Perl modules are, too.
Re: regex for unicode email addresses
by LanX (Saint) on Mar 03, 2022 at 09:49 UTC
    > I looked through the PDF:: family at cpan and did not see a way to slurp out a column directly,

    I don't think your solution works well, you are showing us a mix of multiple columns.

    As long as the fonts are not scrambled, you can use a proper solution like described here:

    Parsing PDFs by text position?

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

Re: regex for unicode email addresses
by Anonymous Monk on Mar 03, 2022 at 09:14 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11141774]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (4)
As of 2024-04-23 15:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found