McDarren has asked for the wisdom of the Perl Monks concerning the following question:

Howdy :)

I have a set of filenames that appear as follows:

hosta-sel-kr-1,my-domain,net.testa hostb-sel-kr-1,my-domain,net.testb hostc-sel-kr-1,my-domain,com.testa hostd-sel-kr-1,my-domain,com.testc hoste-sel-kr-1,my-domain,net.testxyz hosta-mel-au-1,my-domain,net.testabc hosta-mel-au-1,my-domain,net.testdef hostxyz.testabc someotherhost.someothertest
The format of each filename is:

I need to extract the hostname and the test name from each filename, however:

I have written an expression that does everything except replace the commas, and this is where I am stuck.
I know I could simply post-process each file with s/,/\./;, but that's not very elegant and I'm sure it can be done within the expression.

So I have two questions:

  1. How do I replace the commas within the expression?
  2. Can the expression be made more efficient? (In production, this will run every minute and will process approx 6000 files on each run)
Here is the code I have so far:
#!/usr/bin/perl -w use strict; while (<DATA>) { my ($host, $test) = ($_ =~ m/ ( # Start first capture [\w\-]+ # One or more alphanum or hyphens (?: # non-capturing lookahead ,my-domain,com # Literal string )? # Make it optional ) # End of first capture (?: # non-capturing lookahead [\w\-,]+ # One or more alpanum or hyphens )? # Make it optional \. # A literal period ( # Start second capture [a-z]+ # One or more lowercase chars ) # End second capture /x) or print "Cannot parse $_\n" and next; print "Host:$host Test:$test\n"; } __DATA__ hosta-sel-kr-1,my-domain,net.testa hostb-sel-kr-1,my-domain,net.testb hostc-sel-kr-1,my-domain,com.testa hostd-sel-kr-1,my-domain,com.testc hoste-sel-kr-1,my-domain,net.testxyz hosta-mel-au-1,my-domain,net.testabc hosta-mel-au-1,my-domain,net.testdef hostxyz.testabc someotherhost.someothertest

Any advice would be greatly appreciated.
Thanks in advance,
Darren :)

Replies are listed 'Best First'.
Re: Regex: Capturing and optionally replacing
by ikegami (Patriarch) on Dec 08, 2005 at 15:55 UTC

    Forget about doing everything in one regexp. It just makes it hard to read and hard to maintain.

    while (<DATA>) { chomp; my ($host, $domain, $test) = /^([^,]+),([^.]+)\.(.+)$/; next if not defined $host; $domain =~ s/,/./g; next if substr($domain, -4) eq '.net'; print("Host: $host Test: $test\n"); }

    Update: This should be faster:

    while (<DATA>) { print("Host: $1 Test: $2\n") if /^([^,]+),[^.]+(?<!,net)\.(.+)$/; }
    • No need to chomp since . won't match newline without the s modifier. (The $ will absorb it.)
    • No need to convert the commas to periods since we don't care about the domain.
    • No need for two conditions. We can check for .com domains name right in the regexp.
    • No need to assign the return values of the regexp to variables.

    Update: I just realized you said it needs to process 6000 files a minute. That's 100 files a second! That seems excessive. You way want to rethink your design.

      Oh, okay... fair enough.

      But just to clarify... this is the output I am currently getting:

      Host:hosta-sel-kr-1 Test:testa Host:hostb-sel-kr-1 Test:testb Host:hostc-sel-kr-1,my-domain,com Test:testa Host:hostd-sel-kr-1,my-domain,com Test:testc Host:hoste-sel-kr-1 Test:testxyz Host:hosta-mel-au-1 Test:testabc Host:hosta-mel-au-1 Test:testdef Host:hostxyz Test:testabc Host:someotherhost Test:someothertest
      and this is the output I want:
      Host:hosta-sel-kr-1 Test:testa Host:hostb-sel-kr-1 Test:testb Host:hostc-sel-kr-1.my-domain.com Test:testa Host:hostd-sel-kr-1.my-domain.com Test:testc Host:hoste-sel-kr-1 Test:testxyz Host:hosta-mel-au-1 Test:testabc Host:hosta-mel-au-1 Test:testdef Host:hostxyz Test:testabc Host:someotherhost Test:someothertest
      (Notice that the only difference is that the commas in the com domain names have been replaced with periods)

      So your solution doesn't actually give the desired result.

        Based on the variable name, I thought you were only printing the host without the domain.

        And when you said "skip it", I thought you meant the line, not the domain.

        while (<DATA>) { my ($host, $domain, $test) = /^([^,.]+)(,[^.]+|)\.(.+)$/; $domain =~ s/,/./g; $domain = '' if /\.net$/; print("Host:$host$domain Test:$test\n"); }

        Turns out

        $domain =~ s/^.*\.net$//;
        takes the same amout of time as
        $domain = '' if substr($domain, -4) eq '.net';
        but both are slightly slower than
        $domain = '' if /\.net$/;

        Update: 17% faster:

        while (<DATA>) { chomp; my ($host, $test) = split(/\./, $_, 2); $host =~ s/,/./g; $host =~ s/\..*\.net$//; print("Host:$host Test:$test\n"); }

        Update: If I change
        $host =~ s/\..*\.net$//;
        to
        $host =~ s/\.my-domain\.net$//;
        my version is 6% faster than yours (with $host =~ s/,/./g; added).

        Update: Benchmark code