Re: RegExp help

You do know that *johnsmith@yahoo.com is actually valid syntax, don't you?

Here's a regexp:

$pat = qr {
    (?(DEFINE)
      (?<address>         (?&mailbox) | (?&group))
      (?<mailbox>         (?&name_addr) | (?&addr_spec))
      (?<name_addr>       (?&display_name)? (?&angle_addr))
      (?<angle_addr>      (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
      (?<group>           (?&display_name) : (?:(?&mailbox_list) | (?&
+CFWS))? ;
                                             (?&CFWS)?)
      (?<display_name>    (?&phrase))
      (?<mailbox_list>    (?&mailbox) (?: , (?&mailbox))*)

      (?<addr_spec>       (?&local_part) \@ (?&domain))
      (?<local_part>      (?&dot_atom) | (?&quoted_string))
      (?<domain>          (?&dot_atom) | (?&domain_literal))
      (?<domain_literal>  (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&
+FWS)?
                                    \] (?&CFWS)?)
      (?<dcontent>        (?&dtext) | (?&quoted_pair))
      (?<dtext>           (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])
      
      (?<atext>           (?&ALPHA) | (?&DIGIT) | [-!#\$%&'*+/=?^_`{|}
+~]) 
      (?<atom>            (?&CFWS)? (?&atext)+ (?&CFWS)?)
      (?<dot_atom>        (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
      (?<dot_atom_text>   (?&atext)+ (?: \. (?&atext)+)*)

      (?<text>            [\x01-\x09\x0b\x0c\x0e-\x7f])
      (?<quoted_pair>     \\ (?&text))
    
      (?<qtext>           (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
      (?<qcontent>        (?&qtext) | (?&quoted_pair))
      (?<quoted_string>   (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent
+))*
                           (?&FWS)? (?&DQUOTE) (?&CFWS)?)

      (?<word>            (?&atom) | (?&quoted_string))
      (?<phrase>          (?&word)+)
      
      # Folding white space
      (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
      (?<ctext>           (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e
+])
      (?<ccontent>        (?&ctext) | (?&quoted_pair) | (?&comment))
      (?<comment>         \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
      (?<CFWS>            (?: (?&FWS)? (?&comment))*
                          (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))

      # No whitespace control
      (?<NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])
      
      (?<ALPHA>           [A-Za-z])
      (?<DIGIT>           [0-9])
      (?<CRLF>            \x0d \x0a)
      (?<DQUOTE>          ")
      (?<WSP>             [\x20\x09])
    )
      
    (?&address)
}x;
while (<DATA>) {
    chomp;
    use 5.010;
    say $_ if /^$pat$/;
}
__DATA__
*johnsmith@yahoo.com
hello12@.com
hello@gmail..com
[download]

This will print *johnsmith@yahoo.com as this is the only entry that's syntactically correct.

Comment on Re: RegExp help Select or Download Code

Replies are listed 'Best First'.
Re^2: RegExp help by heatblazer (Scribe) on Mar 27, 2012 at 03:42 UTC
Thanks for the awesome example, however it`s too much for me to understand it yet.	[reply]
Re^3: RegExp help by JavaFan (Canon) on Mar 27, 2012 at 09:06 UTC
There's little I can offer to make it more understandable: the syntax of email addresses is complex. Just be glad that we're living in a post-5.10 world: now we can use rules and recursion which allows us to, almost mechanically, translate BNF grammars to regular expressions. In one (both?) of the editions of "Mastering Regular Expressions", Jeffrey Friedl gives a pre-5.10 regular expression to match email addresses. That one is far, far more complex (and doesn't allow nested comments below a certain depth (2, IIRC)). You may want to look at RFC 822, or one of its descendants, for the grammar of email addresses. It's my understanding, the regexp I gave was constructed based on the grammar given in one of the RFCs. (I don't recall which one, and the file `t/re/reg_email.t` doesn't say where it comes from).	[reply] [d/l]
Re^4: RegExp help by heatblazer (Scribe) on Mar 27, 2012 at 14:17 UTC
Well I am getting some tutorials now and trying to parse your previous example ( go easy on me tho :). From the beginning I`ve started with something simple as that (using a simple divide and conquer technique) : my $match; if ( $match = split(/\s/, 'somemail@yahu.com') == 1 ) { # do we have a whole string with no spaces in it? # if so check for '@' if ( $match = split("@", 'somemail@yahu.com') == 2 ) { # if it`s split by 2 then it must be 1 @ sign # then do some other nested checks # whith regex for validating mail ex. are there any # dots or hash signs in the end of domain etc. } } else { print "Mail verification form failed!\n"; # call back the mail form } [download] Now I`ll keep going with regex to master it because I really want to know what is going on there and how.	[reply] [d/l]
Re^5: RegExp help by JavaFan (Canon) on Mar 27, 2012 at 14:58 UTC
Re^6: RegExp help by heatblazer (Scribe) on Mar 27, 2012 at 15:17 UTC