GhodMode has asked for the wisdom of the Perl Monks concerning the following question:

     I'm going through a list of files in a directory and I want to take action on some of them based on a regular expression.
How can I find all of the files that...?

  1. begin with "PH"
  2. followed by any number of any kind of character
  3. followed by any characters that are not "H-000" or "I-000"
  4. followed by the end of the file name
Here's my regular expression:
/^PH.*?[^(H\-000)(IF\-000)]$/

Notes: Here's a sample script I was using to try to test it...
#!/usr/local/bin/perl -w # patterntest.pl while (<>) { chomp; if ( /^PH.*?[(H\-000)(IF\-000)]$/ ) { print "{$`}[$&]{$'}\n"; } }
     So, I "ls -1 | patterntest.pl" to test.

Invulnerable. Unlimited XP. Unlimited Votes. I must be...
        GhodMode

Replies are listed 'Best First'.
Re: old Perl regex problem
by BrowserUk (Patriarch) on Aug 01, 2002 at 21:02 UTC

    limited testing...

    while(<>) { chomp and print if /^PH.*$/ and !/[HI]-000$/; }
Re: old Perl regex problem
by sauoq (Abbot) on Aug 01, 2002 at 21:34 UTC

    First, I don't think the regular expression you have works at all. BrowserUk's version works nicely.

    I don't remember if look-behind was supported in 5.004. If so, you might use something like: /^PH.*(?<![HI]-000)$/

    One question: is a file named "PH-000" ok, or not? From your description, I'd think it was but the solutions so far (including the one above) don't allow it. This would: /^PH(.*)/ and $1 !~ /[HI]-000$/;

    -sauoq
    "My two cents aren't worth a dime.";
    

           It works on 5.6.1. Someone tested it for me.
           Using the info that I put in the original post, PH-000 should not be ok. I didn't really think about that too much because I know the format of the actual files that I'm going through, but my original regex would probably be better represented with ".+" after the "PH" instead of ".*?".
      Here's an actual file name: PH0022080209401500001PE-000
           BrowserUk's version would work great, but I'm reading the pattern from a configuration file and I can't change anything outside of the slashes.

      Invulnerable. Unlimited XP. Unlimited Votes. I must be...
              GhodMode
        I doubt it worked under 5.6.1. Perhaps it was tested, but then the test was insufficient.
        /^PH.*?[^(H\-000)(IF\-000)]$/
        will reject any file name ending in a 0, including
        PH0022080209401500001PE-000
        yet that is a legal file name according to your specifications.
        $ /opt/perl/5.6.1/bin/perl -wle 'print "Reject" unless "PH0022080209401500001PE-000" =~ /^PH.*?[^(H\-000)(IF\-000)]$/' Reject $
        Abigail
Re: old Perl regex problem
by crenz (Priest) on Aug 01, 2002 at 21:07 UTC

    How about dividing the problem into two regexes?

    while (<>) { chomp; next if (/(H|IF)-000$/ or !/^PH/); print "$_\n"; }

    Second thought: I just noticed that your description and your regexes differ, so I'm not sure whether you want to match IF-000 or I-000. If you want the latter, use

    next if (/[HI]-000$/ or !/^PH/);

    But I guess that's obvious.

Re: old Perl regex problem
by Abigail-II (Bishop) on Aug 02, 2002 at 11:43 UTC
    I wonder why none of the people responded to this post actually read the post carefully. They all come up with solutions that use two regexes. What's so hard to understand about:
    I'm actually reading the pattern from a configuration file. Everything outside of the slashes is not changeable.

    You have a misunderstanding about the meaning of [ ] inside a regular expression. [ ] is a character class, and matches exactly one character. Inside you either list the characters that are allowed to match, or the characters that aren't allowed to match. [^(H\-000)(IF\-000)] means the same as [^(H\-0)IF] and means "match a single character, the character could be anything, except a (, an H, a dash, a 0, a ), an I or an F.

    If I understand your requirements, you are looking for all files that start with PH, and do not end with either "H-000" or "I-000". The following regex ought to work:

    /^PH(?:.{0,4}|.*(?![HI]-000).{5})$/
    It works with 5.004_02.

    Abigail

           Thank you. I understand better now. I had to look up the "?:" and "?!". I'm going to play with it a little.
           I've also thought of something like /^PH.+[^HI]F?\-000/
           Some examples of the file names are

      PH0024080209401400001PH-000 PH0026072913114200001IF-000 PH0029072911352700001AF-000
           I wonder why it worked on 5.6.1? I'll have to double-check that.

      Invulnerable. Unlimited XP. Unlimited Votes. I must be...
              GhodMode
        /^PH.+[^HI]F?\-000/ will accept a file called PHfooIF-000/. After all, the .+ can match the 'fooI', the [^HF] can match the 'I', and the F? matches nothing.

        Abigail

      Close, but that still matches PH-000 due to the .{0,4} clause. I think that
      /^PH(?:.{0,3}|.*(?![HI]-000).{5})$/
      does it.
      -sauoq
      "My two cents aren't worth a dime.";
      
        but that still matches PH-000
        That's because I wrote the regex before he revealed that the things could overlap. However, later he suggests that the file names are all pretty long, so PH-000 can't happen anyway.

        Abigail

           I pasted your regex into my script exactly and it didn't work either :(. I found one that did, though.
           First, I want to make sure I understand yours. Please tell me if the following is correct...
      /^PH(?:.{0,4}|.*(?![HI]-000).{5})$/

      1. starts with PH
      2. (?:)these parentheses aren't memory parentheses
      3. followed by 0 to 4 of any character or zero or more of any character
      4. (?!)return true if "H-000" or "I-000" would not match next
      5. followed by 5 of any character
      6. followed by the end of the line

           At first, I thought it failed because of the .{5} part. So, I changed it to {4,5} because the possibilities are "?H-000" (only 4 characters after the H) or "I?-000". That still didn't work. I was confused about why you used the (?:) and the quantifier and the or after the first wildcard dot, so I removed them. It still didn't work.

           After all that I re-started. I counted the characters between the PH and the part I wanted to check (\d{19}) and tried /^PH\d{19}[^I][^H]-000$/. It's a little easier because all of these files have a fixed format. That did work, but I wasn't sure I needed to quantify the characters between the PH and the end of the file name. So, I ended up with /^PH.*[^I][^H]-000$/ which works fine.

           I'm not sure why I couldn't get yours to work. I haven't used regex extensions or assertions before, so I want to understand them better.

      Many thanks for your input.

      Invulnerable. Unlimited XP. Unlimited Votes. I must be...
              GhodMode
        That will fail on file name ending in "IQ-000", as you demand that the sixth character from the end isn't equal to an I.

        You seem to be changing your requirements over time. This makes it hard to be helpful. As things stand now, I suggest:

        /^PH.*(?:[^I][^H]|I[^HF])-000$/
        Abigail