Aldebaran has asked for the wisdom of the Perl Monks concerning the following question:
Hello Monks,
Happy March! An unusual amount of madness going on, and I find that attending to my own mundane problems, I can be marginally helpful to everyone else. I turn to perl for solution to problems that would otherwise baffle me. Unfortunately, it's thick stuff, and the bafflement is always nearby when I start to work with unicode.
One of the tasks I was assigned was to compile an email list that originated as a .pdf. I looked through the PDF:: family at cpan and did not see a way to slurp out a column directly, so I availed myself of open source software called calibre, wherein I was able to save the document as txt. I've worked up an sscce-sized list to imitate it, with the additions that unicode could be involved.
Elmer Fudd Daffy Duck Alternate Phone No 7/13/2017 Yes 9/09/2006 daffy@gmail.com Elmer.am@gmail.com 12/5/2019 бесполезно +е.использоk +4;ание.кота@gmail.com kobernIU@hotmail.comp drüben@msn.com manilow@barry76@gmail.com moc.liamg@نالی بلی 時髦的貓@gmail.com pen@ничего.net last@nothing.nyet
Q1) Is there a perl or posix standard for what comprises a valid email?
Q2) Are unicode characters allowed in every part?
This is my current script:
#!/usr/bin/perl use v5.028; # strictness implied use warnings; use Path::Tiny; binmode STDOUT, ":utf8"; my $file_in = path("/home/pi/Documents/curate/1.sscce.email.txt"); my $file_out = path('/home/pi/Documents/curate/1.sscce.email.output.tx +t'); my @lines = $file_in->lines_utf8; say @lines; my @matching; for my $line (@lines){ if ( $line =~ /([A-Za-z0-9._]+\@[a-z0-9.-]+)/){ push( @matching, $1 ); } } @matching=sort(@matching); say @matching; say "cardinality: ", scalar @matching; my $string = join( " ", @matching ); $file_out->spew_utf8( $string ); __END__
I would like to extend this so that I'm getting the unicode values too. I threw one from a right to left language (Urdu) in, and I was surprised at how hard it fought me in gedit. Q3) Do people in Pakistan have emails that read from right to left? Q4) Does a person handle right to left data fundamentally differently in things like a regex?
So, this is my current output, which I will put in pre tags, so that you can see the characters:
$ ./3.ee.pl Elmer Fudd Daffy Duck Alternate Phone No 7/13/2017 Yes 9/09/2006 daffy@gmail.com Elmer.am@gmail.com 12/5/2019 бесполезное.использование.кота@gmail.com kobernIU@hotmail.comp drüben@msn.com manilow@barry76@gmail.com moc.liamg@نالی بلی 時髦的貓@gmail.com pen@ничего.net last@nothing.nyet Elmer.am@gmail.comben@msn.comdaffy@gmail.comkobernIU@hotmail.complast@nothing.nyetmanilow@barry76 cardinality: 6 $
I'm looking to extend it to "conforming" unicode values, whatever that is.
Thanks for your comment,
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: regex for unicode email addresses
by kcott (Archbishop) on Mar 03, 2022 at 09:36 UTC | |
by Aldebaran (Curate) on Mar 10, 2022 at 20:01 UTC | |
by hv (Prior) on Mar 11, 2022 at 00:55 UTC | |
|
Re: regex for unicode email addresses
by soonix (Chancellor) on Mar 03, 2022 at 09:54 UTC | |
|
Re: regex for unicode email addresses
by LanX (Saint) on Mar 03, 2022 at 09:49 UTC | |
|
Re: regex for unicode email addresses
by Anonymous Monk on Mar 03, 2022 at 09:14 UTC |