http://qs1969.pair.com?node_id=326134

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm new to perl and have to update a former employees code adding some support for more things matching a 'regular expression'
The expression is
/.*([\$#\%>~]|\@\w~\$|\\\[\\e\[0m\\\] \[0m)\s?/
and try as I might, all I can figure out is that this is a totally random string of letters symbols and slashes.

So I was wondering how support could be added to maintain current functionality (I'm not entirely sure what the current purpose of this is other then to check a login prompt which I got from the code and from my assignment to make it support a wider range of prompts) and at the same time support a string like hostname# or # also hostname% or % and hostname$ or $ as well as having user@hostname for all those characters or possibly those varried outputs surrounded by [ ]. I know I dont know how to do this type of thing, but I'm not sure how hard it might be for people who have been doing this a long time. After searching through the wealth of information out there I think I will need to work with perl for a bit longer before I will be able to make any sense of this sub-language it has within it.

Thanks a bunch
Jack

Replies are listed 'Best First'.
Re: This looks like someone sneezed and hit the keyboard
by Roger (Parson) on Feb 03, 2004 at 07:05 UTC
    use YAPE::Regex::Explain; $regex = qr/.*([\$#\%>~]|\@\w~\$|\\\[\\e\[0m\\\] \[0m)\s?/; print YAPE::Regex::Explain->new($regex)->explain;

    Let me show you my (colourful) command prompt... :-) which looks like: ($hostname)$fullpath>
    export PS1="\[\033[1;37m\](\[\033[1;32m\]`uname -n`\[\033[1;37m\])\[\0 +33[1;36m\]\$PWD\[\033[1;37m\]>\[\033[0m\] " # ANSI colour commands # \[\033[1;37m\] => set colour to white (37) # \[\033[1;32m\] => set colour to green (32) # \[\033[1;36m\] => set colour to cyan (36) # ... # \[\033[0m\] => set colour back to normal

      In one of my first projects using Perl (I had an internship at the time), someone wanted me to parse the output of top and another Unix tool. Both used ANSI codes, so I backed away. It looks like this guy didn't back away, so I'll give him a gold star for bravery. However, he looses his gold star for not commenting his code and using an ugly regex without the /x modifier.

      It's really amazing how many people (and in good open-source programs, too) forget to add line comments here at there when they could greatly help. I am not asking for flower-box style comments, just an occasional "now we parse the ANSI terminal prompt" kind of comment here and there.

      Long story short, people who never comment their code and modules implementations need to be shot :)

        I have real trouble with remembering to use whitespace, comments, and /x in my regexs. I just don't have the habit (yet!) while writing code. Almost all the /x's that end up in my code are added after-the-fact. I'm almost ready to decide just to put an /x on all regexes (to help develop the habit), but I know that will get me strange looks.
      wow. I'd spend all my votes to upvote this for a week if I could.
      I've always thought of regexes as a sort of black art (and still do to a certain degree), and I've always wanted something that would just explain in plain english what the heck a regex means when you read it. This could be my ticket (and possibly MANY others as well) to finally get a grip on regexes.


      Very funny Scotty... Now PLEASE beam down my PANTS!
        Have you considered reading the O'Reilly Press book "Mastering Regular Expressions" by Jeffrey E. F. Friedl? The first few chapters explain in detail how to read regexes, step by step.
      Hi, There is an article on perl.com that might be useful: http://www.perl.com/pub/a/2004/01/16/regexps.html
      Regexes are a programs. It takes time to learn new languages and that's why it might look difficult at the begining. With time and practice that kind of regex become (almost) clear.
      Using /x and commenting is very important but having the right support from the tools you use is also important. Here is a little html document that show your regex colored. I couldn't get it to show directly in this answer so you'll have to copy past :-(:
      <HTML> <HEAD> <TITLE>Smed generated dump</TITLE> </head> <body bgcolor="#FFFFFF"> <FONT color=#000000 style="BACKGROUND-COLOR: #ffffff"> <br> </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #ffffff"> <br> </FONT> <FONT color=#f00000 style="BACKGROUND-COLOR: #ffffff"> / </FONT> <FONT color=#ffff00 style="BACKGROUND-COLOR: #ff0000"> .* </FONT> <FONT color=#ffffff style="BACKGROUND-COLOR: #ff0000"> ( </FONT> <FONT color=#ffff00 style="BACKGROUND-COLOR: #643296"> [\$#\%&gt;~] < +/FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #00ff00"> | </FONT> <FONT color=#ff0000 style="BACKGROUND-COLOR: #ffff00"> \@ </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #afeeee"> \w </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #f0f0ff"> ~ </FONT> <FONT color=#ff0000 style="BACKGROUND-COLOR: #ffff00"> \$ </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #00ff00"> | </FONT> <FONT color=#ff0000 style="BACKGROUND-COLOR: #ffff00"> \\ </FONT> <FONT color=#ff0000 style="BACKGROUND-COLOR: #ffff00"> \[ </FONT> <FONT color=#ff0000 style="BACKGROUND-COLOR: #ffff00"> \\ </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #f0f0ff"> e </FONT> <FONT color=#ff0000 style="BACKGROUND-COLOR: #ffff00"> \[ </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #f0f0ff"> 0 </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #f0f0ff"> m </FONT> <FONT color=#ff0000 style="BACKGROUND-COLOR: #ffff00"> \\ </FONT> <FONT color=#ff0000 style="BACKGROUND-COLOR: #ffff00"> \] </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #f0f0ff"> </FONT> <FONT color=#ff0000 style="BACKGROUND-COLOR: #ffff00"> \[ </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #f0f0ff"> 0 </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #f0f0ff"> m </FONT> <FONT color=#ffffff style="BACKGROUND-COLOR: #ff0000"> ) </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #afeeee"> \s </FONT> <FONT color=#f00000 style="BACKGROUND-COLOR: #f0f0ff"> ? </FONT> <FONT color=#f00000 style="BACKGROUND-COLOR: #ffffff"> / </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #ffffff"> <br> </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #ffffff"> <br> </FONT> </body> </HTML>
      If your text editor supported this, you would have less problems getting in regexes. There few tools to work on regexes and they do the coloring as well.

      Cheers, Nadim (NKH).
      my (colourful) command prompt... :-)

      Or, simpler, with \h instead of `uname -n`:

      export PS1="\[\033[1;37m\](\[\033[1;32m\]\u@\h\[\033[1;37m\])\[\033[1; +36m\]\$PWD\[\033[1;37m\]>\[\033[0m\] "

      which shows (username@host)path>

Re: This looks like someone sneezed and hit the keyboard
by davido (Cardinal) on Feb 03, 2004 at 07:06 UTC
    I'm going to re-enter your RE as though it has the /x modifier so that it's easier to comment on what it's doing... here goes:

    / .* # Match any quantity of any character (or # none at all) ( # group together, and capture. [\$#\%>~] # match any one of the following: $#%>~ | # OR \@\w~\$ # match literal @, a word character, and $ | # OR \\\[\\e\[0m\\\] \[0m # match "\[\e[0m\] [0m" ) # end capturing / grouping. \s? # match a single optional whitespace /x # End regexp.

    So you put that all together and you get a regexp that will match a pretty wierd looking string.

    The following strings should match (and MANY others too):

    "Hi, I'm Dave\[\e[0m\] [0m"

    "121#@$14asdf$"

    "@h~$"

    Looks pretty peculiar to me.


    Dave

      Okay, I guess this is a newbie question ... It looks to me like the /.* opening to the regex is greedy and would grab everything that was applied to it leaving nothing for the rest of the regex to match to. In otherwords, anything/everything would give a match.

      Other, wiser monks have not mentioned this, so I'm assuming I've missed something. Why isn't my assumption true?

      Update: Thanks to bunnyman, ysth and MCS for their gentle instruction.

      -Theo-
      (so many nodes and so little time ... )

        No, everything in the regex must match, not just the first part of it, and the part in the middle with the (one|two|three) must match too.

        The thing that you must remember is that regexes can backtrack -- if they get to the end of the string without having matched yet, they can go back a few letters and try again.

        So the .* part will first try to match the entire string, because it is greedy. Then the middle part (one|two|three) must match, but there is nothing left in the string, and we must backtrack and try again. First we try going one letter back, then two, and eventually we either find the match or we backtrack all the way to the start and then there is no match.

        The reason that most people say you shouldn't use .* is because it can match nothing (or everything) so matching just .* is pointless because it will match everything (including nothing) However, if you were looking for "hi" some ammount of text and then "there" you could use:

        $line =~ /hi.*there/;

        and it would match. Of course it's greedy and might not be exactly what you wanted but there are times when it is needed. However, it is overused a lot and usually something better can be used.

        To answer your question though, /.* doesn't grab everything because it has required stuff after that. If you try and match /.*some text/ It has to find "some text" or it will fail. However, if you try and match something like: /.*\d?/ it could match nothing since the \d is optional.

        Because for the match to succeed, one of the three (optA|optB|optC) options has to match. With the .* at the front, it will basically start at the end of the string and work backwards until it finds one of the alternates.

        The \s? at the end is useless though (unless $& is used).

Re: This looks like someone sneezed and hit the keyboard
by dws (Chancellor) on Feb 03, 2004 at 07:23 UTC
    One way to make sense of your former employee's code is to use the /x modifier on the regex, which lets you throw in whitespace for formatting without screwing up semantics (except that you'll need to encode whitespace as \s -- thanks, grinder). With this, the regex will look something like
    m/ .* ( [\$#\%>~] | \@\w~\$ | \\ \[ \\ e \[ 0m \\ \] \s \[ 0m ) \s? /x
    which basically says
    • skip past as many characters as possible, then
    • match one of three things
      • a single character that is one of $ # % > ~
      • a four character string beginning with @, following by a single "word" character, followed by ~ and $
      • the string "\[\e[0m\] [0m"
    • allow for an optional trailing space

    With the "one of these three things" going into the Perl variable $1. In short, this regular expression isn't matching what you think its supposed to be matching. The third alternative looks like it's inteded to capture an escape sequence of some sort.

    So yeah, it looks pretty messy.

      use the /x modifier on the regex, which lets you throw in whitespace for formatting without screwing up semantics

      ... with one significant caveat: spaces lose their semantics.

      The RE /foo bar/ is not the same as /foo bar/x. The latter is equivalent to /foobar/. And there is a space in the OP's RE, which will therefore be incorrect if /x is blindly applied.

      The choices are either to escape the space with a backslash (which is difficult to read, especially if the backslash-space winds up at the end of a line) or replace it with \s, which is not semantically equivalent (it can match tabs or newlines as well). There is always [\ ] but I'm not sure it's a win.

      I would bet the original author probably meant to group the or's together an in that case it could be made a little faster by changing the first ( to (?: (which makes it not save the match into $1) Unless of course there is a $1 soon after the regex... in that case it was probably meant to capture it.

Re: This looks like someone sneezed and hit the keyboard
by Sol-Invictus (Scribe) on Feb 03, 2004 at 10:54 UTC
    this is expecting to parse a command line prompt which uses ANSI escapes:
    ANSI Color Codes in brief: 0 to restore default color 1 for brighter colors 4 for underlined text 5 for flashing text 30 for black foreground 31 for red foreground 32 for green foreground 33 for yellow (or brown) foreground 34 for blue foreground 35 for purple foreground 36 for cyan foreground 37 for white (or gray) foreground 40 for black background 41 for red background 42 for green background 43 for yellow (or brown) background 44 for blue background 45 for purple background 46 for cyan background 47 for white (or gray) background you use the above codes together with an escape sequence like this (re +place the '#' with the colour code of your choice) : \e[#m Once you've used an escape all subsequent text will be affected until +you use the reset escape \e[0m so if I want to format a part of a line of text, instead of: print "This boring old line of text was supposed to have red text\n"; do this: print "This new improved, brighter, more interesting line of text has +\e[31mred text\e[0m\n" ; If you want to use two escapes on the same piece of text use one of these ';' : \e[#;#m print "This \e[5m new improved \e[0m, \e[5m brighter \e[0m, more \e[5minteresting \e[0m line of text has \e[31;5mflashing red text\e[0m\n" ;

    \e[0m (the reset string) requires escaping in the regex, or at least the '\' and '[' do. In pseudo code the regex would read like this :

    match anything (.*) grab the rest up to \e[0m

    try running it on the couple of the ANSI formatted strings I gave as examples and you'll see how it's working

Re: This looks like someone sneezed and hit the keyboard
by MCS (Monk) on Feb 03, 2004 at 16:51 UTC

    While it can be very intimidating because you don't know what it means, it's really not that hard to read. I would suggest the book "mastering regular expressions" by Jeffrey Friedl. It helped me make sense of what was once gibberish. Others have already explained what it does so I won't go over that again but I don't think regular expressions should be feared. A read through "Mastering Regular Expressions" should make a master out of anyone.

    I also want to say that I disagree with the notion that you should always use /x to make your code clearer. If you do, you are relying on what the comments say, not on what the regex says. To me, when you spread it out like that, it makes it easier to comment but harder to actually read and find errors. (in my opinion) I think it's all the whitespace around the regex that makes it harder for me to understand. I agree most code is under commented but /x tends to lead to overcommenting for people who don't really understand regex's.

    ps. I have no affiliation with the author or the publishers other than I bought the book and loved it