murugu has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks

I am currently processing the text in a file. I want to change the hyphenated words in the file in to some other format as mentioned below

input

my father-in-law is a chemist who is currently working with αphenol-acetate.

output

my father<->in<->law is a chemist who is currently working with &alpha;phenol<->acetate.

In the above input the hyphens should be replaced with <->. we are having another file file2 which consist of the hyphenated words list. we have to replace the text in file1 which matches the text present in file2.

The problem is when we wrote an regular expression to match the hyphenated words, except the words starting with ampersand all other words are matched.

the regular expression is

m/\b[;&\w]+([-\w;&]+)+\b/gsi;

Why our code is not matching the words starts with ampersand. Is there any other way to solve this. If so we need regular expression

---Murugesan and Prasad---

Replies are listed 'Best First'.
(z) Re: Regex not matching &
by zigdon (Deacon) on Mar 18, 2004 at 15:42 UTC

    I think the problem is that \b matches between \W and \w - since the '&' isn't within the \w class, \b failes to match.

    Try replacing the \b with [^\w&][\w&] and adjusting the rest of the regexp accordingly.

    See perlre for more details

    -- zigdon

Re: Regex not matching &
by Abigail-II (Bishop) on Mar 18, 2004 at 15:45 UTC
    Well, you require the match to start with a switch between word and non-word characters. However, both a space and a & are non-word characters. So you fail to match "words" starting with an ampersand.

    I'm not giving you an alternate regexp, as it isn't quite clear to me what exactly you consider to be a word or not.

    Abigail

      Actually what we are doing is we have two files. File 1 is an XML file. File 2 is a text file which consist of words with hyphenations.

      FILE-I

      <element id="10">&alpha;phenol-acetate and Ace-tone and 5-ethyl-alcohol</element>

      File-II

      Ace-tone

      &alpha;phenol-acetate

      I want to replace the hyphenated words present in file1 which is also present in file2 with hyphens changed into <->.

      is the problem is now clear

      we want the regular expression

      thanks for ur kind reply

        Eh, no. The problem is not clear. Any problem that considers matching "words" isn't clear unless there is a clear (sic) definition of what a word is. Your original code allows "words" to contain semi-colons, dashes and ampersands. Are you considering ;-; to be a word? Is father-in-law&mother-in-law one or two words? Etc, etc.

        Once you have a clear definition of what you are going to consider words, writing a regex is likely to be easy.

        Abigail

Re: Regex not matching &
by CombatSquirrel (Hermit) on Mar 18, 2004 at 16:24 UTC
    How about this:
    $_ = 'my father-in-law is a chemist who is currently working with &alp +ha;phenol-acetate.'; s/\b-\b/<->/g; print;
    Hope it helps...
    CombatSquirrel.
    Entropy is the tendency of everything going to hell.
Re: Regex not matching &
by Anomynous Monk (Scribe) on Mar 18, 2004 at 18:51 UTC
    \b is only going to find a work boundary for you if you agree with what \w thinks is a word character. If you want to have an expanded notion of what is a word, using \b is not going to do what you want.

    In this case, I don't see any need for it at all, though. Just removing the \b's should do what you want. Perhaps add [;&\w] at the end if you don't want it to end with a hyphen.

    Your regex will set $1 to everything from the first hyphen on, or to the last character of the "word" if there were no hyphens. That seems kind of bizzare. Are you using $1? If so, exactly which part of the word to you want it to contain? If not, you can simplify it to m/[;&\w][-;&\w]+/g, or if you mean to only match hyphenated words, m/[;&\w]+-[-;&\w]+/g.

    Your //s and //i flags have no effect with your regex.

      Thaks for all of ur responses

      Right now it is working just by removing the starting \b. I hve finished the work now.

      Once again thanks guys

      --Murugesan--

        so how would I just match the & character?