Regex: Capturing and optionally replacing

McDarren has asked for the wisdom of the Perl Monks concerning the following question:

Howdy :)

I have a set of filenames that appear as follows:

hosta-sel-kr-1,my-domain,net.testa
hostb-sel-kr-1,my-domain,net.testb
hostc-sel-kr-1,my-domain,com.testa
hostd-sel-kr-1,my-domain,com.testc
hoste-sel-kr-1,my-domain,net.testxyz
hosta-mel-au-1,my-domain,net.testabc
hosta-mel-au-1,my-domain,net.testdef
hostxyz.testabc
someotherhost.someothertest
[download]

The format of each filename is:

a hostname
followed by an (optional) domain name
followed by a period
followed by a name of a test

I need to extract the hostname and the test name from each filename, however:

If the domain name is .net, I need to drop it (Update: Just the domain name, not the whole filename)
If the domain name is .com, I need to keep it AND replace the commas with periods.

I have written an expression that does everything except replace the commas, and this is where I am stuck.
I know I could simply post-process each file with s/,/\./;, but that's not very elegant and I'm sure it can be done within the expression.

So I have two questions:

How do I replace the commas within the expression?
Can the expression be made more efficient? (In production, this will run every minute and will process approx 6000 files on each run)

Here is the code I have so far:

#!/usr/bin/perl -w
use strict;

while (<DATA>) {
    my ($host, $test) = ($_ =~
      m/
        (                      # Start first capture
          [\w\-]+              # One or more alphanum or hyphens
          (?:                  # non-capturing lookahead
            ,my-domain,com     # Literal string
          )?                   # Make it optional
        )                      # End of first capture
          (?:                  # non-capturing lookahead
            [\w\-,]+           # One or more alpanum or hyphens
          )?                   # Make it optional
        \.                     # A literal period
        (                      # Start second capture
          [a-z]+               # One or more lowercase chars
        )                      # End second capture
     /x)
        or print "Cannot parse $_\n" and next;

    print "Host:$host Test:$test\n";
}

__DATA__
hosta-sel-kr-1,my-domain,net.testa
hostb-sel-kr-1,my-domain,net.testb
hostc-sel-kr-1,my-domain,com.testa
hostd-sel-kr-1,my-domain,com.testc
hoste-sel-kr-1,my-domain,net.testxyz
hosta-mel-au-1,my-domain,net.testabc
hosta-mel-au-1,my-domain,net.testdef
hostxyz.testabc
someotherhost.someothertest
[download]

Any advice would be greatly appreciated.
Thanks in advance,
Darren :)

Comment on Regex: Capturing and optionally replacing Select or Download Code

Replies are listed 'Best First'.
Re: Regex: Capturing and optionally replacing by ikegami (Patriarch) on Dec 08, 2005 at 15:55 UTC
Forget about doing everything in one regexp. It just makes it hard to read and hard to maintain. `while (<DATA>) { chomp; my ($host, $domain, $test) = /^([^,]+),([^.]+)\.(.+)$/; next if not defined $host; $domain =~ s/,/./g; next if substr($domain, -4) eq '.net'; print("Host: $host Test: $test\n"); }` [download] Update: This should be faster: `while (<DATA>) { print("Host: $1 Test: $2\n") if /^([^,]+),[^.]+(?<!,net)\.(.+)$/; }` [download] No need to `chomp` since `.` won't match newline without the `s` modifier. (The `$` will absorb it.) No need to convert the commas to periods since we don't care about the domain. No need for two conditions. We can check for .com domains name right in the regexp. No need to assign the return values of the regexp to variables. Update: I just realized you said it needs to process 6000 files a minute. That's 100 files a second! That seems excessive. You way want to rethink your design.	[reply] [d/l] [select]
Re^2: Regex: Capturing and optionally replacing by McDarren (Abbot) on Dec 08, 2005 at 16:11 UTC
Oh, okay... fair enough. But just to clarify... this is the output I am currently getting: `Host:hosta-sel-kr-1 Test:testa Host:hostb-sel-kr-1 Test:testb Host:hostc-sel-kr-1,my-domain,com Test:testa Host:hostd-sel-kr-1,my-domain,com Test:testc Host:hoste-sel-kr-1 Test:testxyz Host:hosta-mel-au-1 Test:testabc Host:hosta-mel-au-1 Test:testdef Host:hostxyz Test:testabc Host:someotherhost Test:someothertest` [download] and this is the output I want: `Host:hosta-sel-kr-1 Test:testa Host:hostb-sel-kr-1 Test:testb Host:hostc-sel-kr-1.my-domain.com Test:testa Host:hostd-sel-kr-1.my-domain.com Test:testc Host:hoste-sel-kr-1 Test:testxyz Host:hosta-mel-au-1 Test:testabc Host:hosta-mel-au-1 Test:testdef Host:hostxyz Test:testabc Host:someotherhost Test:someothertest` [download] (Notice that the only difference is that the commas in the com domain names have been replaced with periods) So your solution doesn't actually give the desired result.	[reply] [d/l] [select]
Re^3: Regex: Capturing and optionally replacing by ikegami (Patriarch) on Dec 08, 2005 at 16:37 UTC
Based on the variable name, I thought you were only printing the host without the domain. And when you said "skip it", I thought you meant the line, not the domain. `while (<DATA>) { my ($host, $domain, $test) = /^([^,.]+)(,[^.]+\|)\.(.+)$/; $domain =~ s/,/./g; $domain = '' if /\.net$/; print("Host:$host$domain Test:$test\n"); }` [download] Turns out `$domain =~ s/^.\.net$//;` takes the same amout of time as `$domain = '' if substr($domain, -4) eq '.net';` but both are slightly slower than `$domain = '' if /\.net$/;` Update: 17% faster: `while (<DATA>) { chomp; my ($host, $test) = split(/\./, $_, 2); $host =~ s/,/./g; $host =~ s/\..\.net$//; print("Host:$host Test:$test\n"); }` [download] Update: If I change `$host =~ s/\..\.net$//;` to `$host =~ s/\.my-domain\.net$//;` my version is 6% faster than yours (with `$host =~ s/,/./g;` added). Update*: Benchmark code Read more... (2 kB)	[reply] [d/l] [select]
Re^4: Regex: Capturing and optionally replacing by McDarren (Abbot) on Dec 08, 2005 at 17:13 UTC
Re^4: Regex: Capturing and optionally replacing by McDarren (Abbot) on Dec 09, 2005 at 00:34 UTC
Re^5: Regex: Capturing and optionally replacing by ikegami (Patriarch) on Dec 09, 2005 at 06:18 UTC