in reply to RegExp help

You do know that *johnsmith@yahoo.com is actually valid syntax, don't you?

Here's a regexp:

$pat = qr { (?(DEFINE) (?<address> (?&mailbox) | (?&group)) (?<mailbox> (?&name_addr) | (?&addr_spec)) (?<name_addr> (?&display_name)? (?&angle_addr)) (?<angle_addr> (?&CFWS)? < (?&addr_spec) > (?&CFWS)?) (?<group> (?&display_name) : (?:(?&mailbox_list) | (?& +CFWS))? ; (?&CFWS)?) (?<display_name> (?&phrase)) (?<mailbox_list> (?&mailbox) (?: , (?&mailbox))*) (?<addr_spec> (?&local_part) \@ (?&domain)) (?<local_part> (?&dot_atom) | (?&quoted_string)) (?<domain> (?&dot_atom) | (?&domain_literal)) (?<domain_literal> (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?& +FWS)? \] (?&CFWS)?) (?<dcontent> (?&dtext) | (?&quoted_pair)) (?<dtext> (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e]) (?<atext> (?&ALPHA) | (?&DIGIT) | [-!#\$%&'*+/=?^_`{|} +~]) (?<atom> (?&CFWS)? (?&atext)+ (?&CFWS)?) (?<dot_atom> (?&CFWS)? (?&dot_atom_text) (?&CFWS)?) (?<dot_atom_text> (?&atext)+ (?: \. (?&atext)+)*) (?<text> [\x01-\x09\x0b\x0c\x0e-\x7f]) (?<quoted_pair> \\ (?&text)) (?<qtext> (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e]) (?<qcontent> (?&qtext) | (?&quoted_pair)) (?<quoted_string> (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent +))* (?&FWS)? (?&DQUOTE) (?&CFWS)?) (?<word> (?&atom) | (?&quoted_string)) (?<phrase> (?&word)+) # Folding white space (?<FWS> (?: (?&WSP)* (?&CRLF))? (?&WSP)+) (?<ctext> (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e +]) (?<ccontent> (?&ctext) | (?&quoted_pair) | (?&comment)) (?<comment> \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) ) (?<CFWS> (?: (?&FWS)? (?&comment))* (?: (?:(?&FWS)? (?&comment)) | (?&FWS))) # No whitespace control (?<NO_WS_CTL> [\x01-\x08\x0b\x0c\x0e-\x1f\x7f]) (?<ALPHA> [A-Za-z]) (?<DIGIT> [0-9]) (?<CRLF> \x0d \x0a) (?<DQUOTE> ") (?<WSP> [\x20\x09]) ) (?&address) }x; while (<DATA>) { chomp; use 5.010; say $_ if /^$pat$/; } __DATA__ *johnsmith@yahoo.com hello12@.com hello@gmail..com
This will print *johnsmith@yahoo.com as this is the only entry that's syntactically correct.

Replies are listed 'Best First'.
Re^2: RegExp help
by heatblazer (Scribe) on Mar 27, 2012 at 03:42 UTC

    Thanks for the awesome example, however it`s too much for me to understand it yet.

      There's little I can offer to make it more understandable: the syntax of email addresses is complex. Just be glad that we're living in a post-5.10 world: now we can use rules and recursion which allows us to, almost mechanically, translate BNF grammars to regular expressions. In one (both?) of the editions of "Mastering Regular Expressions", Jeffrey Friedl gives a pre-5.10 regular expression to match email addresses. That one is far, far more complex (and doesn't allow nested comments below a certain depth (2, IIRC)).

      You may want to look at RFC 822, or one of its descendants, for the grammar of email addresses. It's my understanding, the regexp I gave was constructed based on the grammar given in one of the RFCs. (I don't recall which one, and the file t/re/reg_email.t doesn't say where it comes from).

        Well I am getting some tutorials now and trying to parse your previous example ( go easy on me tho :). From the beginning I`ve started with something simple as that (using a simple divide and conquer technique) :

        my $match; if ( $match = split(/\s/, 'somemail@yahu.com') == 1 ) { # do we have a whole string with no spaces in it? # if so check for '@' if ( $match = split("@", 'somemail@yahu.com') == 2 ) { # if it`s split by 2 then it must be 1 @ sign # then do some other nested checks # whith regex for validating mail ex. are there any # dots or hash signs in the end of domain etc. } } else { print "Mail verification form failed!\n"; # call back the mail form }

        Now I`ll keep going with regex to master it because I really want to know what is going on there and how.