So interesting problem I've run into. I have a script for pulling hostnames in a two-node cluster. I ghettoed a bit using a bash egrep expression to determine hostnames have been added to /etc/hosts on both nodes.

So I may have an /etc/hosts that looks like this:

192.168.1.199 hostname62a.domain.com hostname62a 192.168.1.200 hostname62b.domain.com hostname62b 192.168.1.201 hostname62.domain.com hostname62 192.168.2.144 hostname62amgt.domain.com hostname62amgt 192.168.2.145 hostname62bmgt.domain.com hostname62bmgt
here's a snip of my code:
my $ha1 = "hostname62a"; my $ha2 = "hostname62b"; my $cmd1 = "egrep -i \"\\b$ha1\\b|\\b$ha2\\b\" /etc/hosts"; open(HOSTS1, "$cmd1|"); while(<HOSTS1>) { chomp; push (@hosts_ha1, $_); } close(HOSTS1);
I used word boundaries (\b) to make sure I only find what I'm looking for. Normally, this would return something like below:
192.168.1.199 hostname62a.domain.com hostname62a 192.168.1.200 hostname62b.domain.com hostname62b

This is what I want. Just the two hostnames.

The hostnames themselves follow whatever standard the customer sets, so we have little control over what they name their stuff. But usually the above code works well for just pulling out the hostnames. We do control how they format the names in /etc/hosts by providing a script interface, so how stuff is laid out in /etc/hosts is pretty constant.

Now here's the problem: (\b) boundaries work pretty well most of the time. But we have one customer that named his stuff like this:

192.168.1.199 hostname62a.domain.com hostname62a 192.168.1.200 hostname62b.domain.com hostname62b 192.168.1.201 hostname62.domain.com hostname62 192.168.2.144 hostname62a-r.domain.com hostname62a-r 192.168.2.145 hostname62b-r.domain.com hostname62b-r

So the above egrep statement finds these:

192.168.1.199 hostname62a.domain.com hostname62a 192.168.1.200 hostname62b.domain.com hostname62b 192.168.2.144 hostname62a-r.domain.com hostname62a-r 192.168.2.145 hostname62b-r.domain.com hostname62b-r
This is because "-" isn't considered part of a word if it's at the end, so the "\b" ignores it. I got no idea how to craft the right expression to determine just the hostnames I want. I do have customers that name their stuff like below:

192.168.1.2 hostname-node1.domain.com hostname-node1 192.168.1.3 hostname-node2.domain.com hostname-node2 192.168.1.4 hostname-node1mgt.domain.com hostname-node1mgt 192.168.1.5 hostname-node2mgt.domain.com hostname-node2mgt

Which will return:

192.168.1.2 hostname-node1.domain.com hostname-node1 192.168.1.3 hostname-node2.domain.com hostname-node2

So I can't split on the "-". Ugh, even now my head hurts thinking about this issue. Does anyone have any idea for some nifty perl regex that could solve my problem?


In reply to Perl regex and word boundaries by MeatLips

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.