in reply to Regex: Capturing and optionally replacing

Forget about doing everything in one regexp. It just makes it hard to read and hard to maintain.

while (<DATA>) { chomp; my ($host, $domain, $test) = /^([^,]+),([^.]+)\.(.+)$/; next if not defined $host; $domain =~ s/,/./g; next if substr($domain, -4) eq '.net'; print("Host: $host Test: $test\n"); }

Update: This should be faster:

while (<DATA>) { print("Host: $1 Test: $2\n") if /^([^,]+),[^.]+(?<!,net)\.(.+)$/; }

Update: I just realized you said it needs to process 6000 files a minute. That's 100 files a second! That seems excessive. You way want to rethink your design.

Replies are listed 'Best First'.
Re^2: Regex: Capturing and optionally replacing
by McDarren (Abbot) on Dec 08, 2005 at 16:11 UTC
    Oh, okay... fair enough.

    But just to clarify... this is the output I am currently getting:

    Host:hosta-sel-kr-1 Test:testa Host:hostb-sel-kr-1 Test:testb Host:hostc-sel-kr-1,my-domain,com Test:testa Host:hostd-sel-kr-1,my-domain,com Test:testc Host:hoste-sel-kr-1 Test:testxyz Host:hosta-mel-au-1 Test:testabc Host:hosta-mel-au-1 Test:testdef Host:hostxyz Test:testabc Host:someotherhost Test:someothertest
    and this is the output I want:
    Host:hosta-sel-kr-1 Test:testa Host:hostb-sel-kr-1 Test:testb Host:hostc-sel-kr-1.my-domain.com Test:testa Host:hostd-sel-kr-1.my-domain.com Test:testc Host:hoste-sel-kr-1 Test:testxyz Host:hosta-mel-au-1 Test:testabc Host:hosta-mel-au-1 Test:testdef Host:hostxyz Test:testabc Host:someotherhost Test:someothertest
    (Notice that the only difference is that the commas in the com domain names have been replaced with periods)

    So your solution doesn't actually give the desired result.

      Based on the variable name, I thought you were only printing the host without the domain.

      And when you said "skip it", I thought you meant the line, not the domain.

      while (<DATA>) { my ($host, $domain, $test) = /^([^,.]+)(,[^.]+|)\.(.+)$/; $domain =~ s/,/./g; $domain = '' if /\.net$/; print("Host:$host$domain Test:$test\n"); }

      Turns out

      $domain =~ s/^.*\.net$//;
      takes the same amout of time as
      $domain = '' if substr($domain, -4) eq '.net';
      but both are slightly slower than
      $domain = '' if /\.net$/;

      Update: 17% faster:

      while (<DATA>) { chomp; my ($host, $test) = split(/\./, $_, 2); $host =~ s/,/./g; $host =~ s/\..*\.net$//; print("Host:$host Test:$test\n"); }

      Update: If I change
      $host =~ s/\..*\.net$//;
      to
      $host =~ s/\.my-domain\.net$//;
      my version is 6% faster than yours (with $host =~ s/,/./g; added).

      Update: Benchmark code

        Great, thanks for that :)

        I guess the lesson I've learned here is to never forget the kiss principle ;)

        Cheers,
        Darren :)

        Okay... this one kept me awake last night :(

        Because I'd still like to know... how could I have re-written the original expression to get rid of the unwanted commas?

        Can somebody please put me out of my misery?
        (I promise I won't use it in production :D)