slydog has asked for the wisdom of the Perl Monks concerning the following question:

Oh I wish I was a master of regex, but I guess I lack the mental capacity to formulate complex matches......!!!! So, I manage an email content filtering system (ActiveState PureMessage) which provides filtering for several banks. Right now we have the usual filters in place (like bad words, job search, ect.), though now the banks want to be able to filter on account numbers. This would be easy if they followed a standard number convention.... but no, the most they can give me is that an account number is x digits long. Fine I say, since the filtering is based on regular expressions I can write one to filter for that. The problem is that when a user uses an email client like outlook they get many false positives since the client adds formatting tags in the body of the message (why the formatting tags use number string is beyond me).

So I have set out to write a regex that can take care of this problem, but after about a hundred or so different variations.... I just can not get it down. So I was wondering if any of the Perl guru's might be able to help me. This is what I need the regex to do in a nutshell:

Searching for a 7 digit string: <SPAN class=3D319263020-11082003>Test Hello</SPAN> # NO MATCH <SPAN class=3D319263020-11082003>1234567</SPAN> # MATCH

Replies are listed 'Best First'.
Re: Email content filter and RegEx!
by liz (Monsignor) on Aug 12, 2003 at 14:24 UTC
    If "x" is a fixed number, why don't you:
    m#\b\d{x}\b#s;
    ?

    Liz

Re: Email content filter and RegEx!
by monktim (Friar) on Aug 12, 2003 at 13:57 UTC
    Does a tag always follow the number? This might work.
    use strict; use warnings; while (<DATA>) { print "$_"; if ($_ =~ /\d{7}</) { print "\tMATCH\n"; } else { print "\tNO MATCH\n"; } } __DATA__ <SPAN class=3D319263020-11082003>Test Hello</SPAN> <SPAN class=3D319263020-11082003>1234567</SPAN>
Re: Email content filter and RegEx!
by Mr. Muskrat (Canon) on Aug 12, 2003 at 14:01 UTC
    Show us some of your variations and you will get a better response. What have you tried that didn't work?
Re: Email content filter and RegEx!
by simonm (Vicar) on Aug 12, 2003 at 16:18 UTC
    I'm not familiar with PureMessage, but if you can do more than just regular expressions, I think there's an easy enough solution available: if an message is in HTML format, use HTML::FormatText or an equivalent module to make a plain-text-only copy, then run your regex against that.
Re: Email content filter and RegEx!
by slydog (Novice) on Aug 12, 2003 at 17:54 UTC
    I think I might have found a solution ......

    /(\d{7})(?![^<]*>)/

    I would like to know if anyone has any thoughts on this....
    Thanks for everyones Help....
      Ok, this may not be the best solution for what I am looking for since it can still provide a false positive or even a way to hide a number. It should work for now though, as long as I can decrease the number of false positives for the client..... Guess I will have to work on it
Re: Email content filter and RegEx!
by slydog (Novice) on Aug 12, 2003 at 16:13 UTC
    I have tried many different regular expressions and right now it looks like this:

    /\b(?:[\d]{8})\b/i

    Below is a part of an email message that was filtered by the regex above, though the message that was sent does not contain any numbers in it. I just grabbed the part of the message that had the formatting tags that tripped the filter. The PureMessage filtering system looks at the message in the raw and filters each part (HEAD, BODY, etc.) separately. So when outlook adds HTML formatting tags to the email, some of the tags contain number strings that trip the regex. I am looking for a regex that can skip anything in a HTML tag and just match what I need.
    <META content=3D"MSHTML 6.00.2800.1170" name=3DGENERATOR></HEAD> <BODY> <DIV><SPAN class=3D049350615-12082003><FONT face=3DArial color=3D#0000 +ff = size=3D2>Please=20 remove me from your list!</FONT></SPAN></DIV> <BLOCKQUOTE dir=3Dltr style=3D"MARGIN-RIGHT: 0px"> <DIV class=3DOutlookMessageHeader dir=3Dltr align=3Dleft><FONT = face=3DTahoma=20 size=3D2>-----Original Message-----<BR>
      Does this help?
      /[<.*?>]*\d{7}[<.*?>]*/