REGEX Help

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear all,
1. We have a string "Olympus xModel 3.1mp".. In this
string we want the REGEX to remove the chars mp (i.e.
characters appearing after 3.1) and not the "mp" that is
part of the brand Olympus.

2. We have another string "Olympus yModel 3x Zoom
Optical 5x Zoom Digital". As you can see Zoom appears
twice in this string. In this we want to remove the
words "Zoom" and also words preceding both
the "Zoom"s i.e. 3x and 5x. But before removing the 3x and
5x, we also want make sure that the word contains the
character "x" and its all other characters
are "Numeric" only. There also could be a possibility of
more "Zoom" appearing in the string. REGEX should be
applied to remove all the occurences of Zoom and the
word appearing immediately before "Zoom", provided it
contains any numbers (not necessarily 3 or 5) and character "x".

Thanks a lot.

Regards,
Habib

Comment on REGEX Help

Replies are listed 'Best First'.
Re: REGEX Help by GrandFather (Saint) on Jun 24, 2006 at 07:59 UTC
In the first case it is probably sufficient to ensure that the preceeding character was a digit and that the following character is a non-word character so `s/(?<=\d)mp\b//g` should do that one. In the second case it is actually even simpler, no zero width look back assertion required: `s/\s+\d+x Zoom//g`. Update: fixed broken regex. Was `s/(><=\d)mp\b//g` DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^2: REGEX Help by Anonymous Monk on Jun 24, 2006 at 08:29 UTC
Thanks a lot. The second problem got resolved but the first issue still persists. What could be missing in this solution `$original_text="Olympus xModel 3.1mp"; $original_text=~ s/(><=\d)mp\b//g; print "\n$original_text";` [download] Regards Habib	[reply] [d/l]
Re^3: REGEX Help by GrandFather (Saint) on Jun 24, 2006 at 08:35 UTC
The absence of a stupid typo! Sorry, the regex should have been `$original_text=~ s/(?<=\d)mp\b//g;`. By the way I strongly recommend that you add `use strict; use warnings;` to any script you write. You then need to declare variables using my, but many problems get found before they bite hard. You should also browse the following documentation, probably in the order given: perlretut, perlre and perlreref. DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^3: REGEX Help by rsriram (Hermit) on Jun 26, 2006 at 07:08 UTC
Hi, Your regex `$original_text=~ s/(><=\d)mp\b//g;` is not complete. You do have have a replacement string. Try this instead. `$original_text=~ s/(.+)mp/$1/g;` Sriram	[reply] [d/l] [select]
Re: REGEX Help by rsriram (Hermit) on Jun 24, 2006 at 09:37 UTC
Hi, Try this, `$str="Olympus xModel 3.1mp";` `$str =~ s/(.+)mp/$1/g;` `print $str;` `$str="Olympus yModel 3x Zoom Optical 5x Zoom Digital";` `$str =~ s/([0-9]+)x Zoom//g;` `print $str;` Sriram	[reply] [d/l] [select]
Re^2: REGEX Help by dsheroh (Monsignor) on Jun 24, 2006 at 14:31 UTC
Your first regex only works in this case because of the quirk in the data that "3.1mp" happens to be at the end of the string. The greediness of `(.+)` causes it to actually remove the last "mp" in the string, regardless of its context. This also means that if the model is just "3.1" instead of "3.1mp", it will remove the "mp" in "Olympus". The second regex is more reliable, but would get false positives if they released a line of cameras with "Super Duper T101x Zoominess(tm)", turning it into "Super Duper Tiness(tm)". It also leaves two consecutive spaces when removing a zoom specification. (Both of these can be fixed by adding a space at the beginning of the regex.)	[reply] [d/l]