![]() |
|
Perl Monk, Perl Meditation | |
PerlMonks |
comment on |
( #3333=superdoc: print w/replies, xml ) | Need Help?? |
EXTENDED PATTERNS IN
REGULAR EXPRESSIONS - Reference to Tutorial
The extended patterns in regular expressions document in perlre serves as a reference than a tutorial content. The perl programmers who are aspiring (like me who is trying ) to learn the language consummately are finding difficulties in the perlre document that is like a reference document than a tutorial document and hence I am attempting to elucidate the extended patterns of the regular expressions alone The Extended patterns are named as they serve as extensions to the existing syntax that is already similar to those in awk and sed. The extended patterns are indicated by a syntax that begins with a parenthesis and a question mark and the extended pattern itself. (?=\|)The Perl document insists on two points as reasons for choosing the question mark as an indicator for the extended patterns. The first being that the question marks are rare in regular expressions and the next being that question marks make one, stop and think over. I think like when it is used in backtracking to give non-greedy search patterns, one stops to think. Some Extended patterns are considered experimental and may be removed at anytime and hence I shall try to discuss on the other extended patterns. (?imsx-imsx)This extended pattern is for having one or more pattern-match modifiers in the regular expression pattern. Here as an example (?i) is for case insensitive pattern match This pattern can be combined with the other patterns. Practical Use: This type of extended pattern match is used for matching dynamic file contents, user input strings that we do not really know or have a hold on the contents. In a practical scenario, let us assume we have an oracle instance which has a schema that has a column named cust-code. This column has two parts, one is a fixed characters set and the other is a free textual part that indicates the text for the sub-contractors a customer has. Let us say, we have a condition that should match a part of the string exactly with the cases intact and the other case insensitive (checking for the pattern in both capital and small alphabets) and do the required and hence the PATTERN that has to be matched is a combination of both case sensitive and case insensitive text. To enforce such a pattern in the regular expression we could use:
SILICONEX is the following case sensitive text that forms the pattern and the other part is ((?!)siliconex-wafers) which is the extended pattern that matches the case insensitive text which is , siliconex-wafers in small and capital cases . The cust_code with the value, "SILICONEXSILICONEX-WAFERS" also matches along with the other small and capital alphabets for the pattern indicated. To insist on the above a regular expression pattern, the one here shall match any small or capital alphabets of the string and not the exact string "SILICONEXsiliconex-wafers"
The value of "siliconexSILICONEX-WAFERS" would also match the pattern indicated by the above code because the case sensitive (?i) makes the whole string case insensitive and matches all case insensitive alphabets in $pattern, " SILICONEXsiliconex-wafers" (?:pattern)This type of pattern is for clustering into a sub-group forming a group of patterns The (?:pattern) is essentially a grouping syntax rather than backreferencing as a group. Practical use: The text from a user entered form for an address could be any free textual form element or when you actually migrate from informix to oracle some columns that have the address or manager note that can have a free textual value can have some control M characters that translate to a carriage return characters thus separating the characters to different lines. This can happen the other way around a free textual column value with carriage returns translate into ^M character when extracted to a unix ascii file that has to be loaded to Oracle schema. For such complex strings, the pattern to match can be difficult without an extended pattern match that can give you the flexibility of having
This (?:pattern) match can be used along with the previous (?imsx-imsx) by this syntax (?imsx-imsx:pattern) to create a more flexible extended pattern . Let the String from a schema or the free textual column
The is involving both the checking of the ^M or | and changing it to a space and a - actually swithces of the option and then actually you split with a case insensitive "NOTE" as a string value in the split function split((?i)NOTE,$sub_contract); (?=pattern) (?<=pattern) (?!pattern) (?<!pattern)This part of the regular expression is a forward (look_ahead) and backward (look_behind) assertions pattern match .There is also positive and negative connotations to it in that whether you match a positive or negative assertion either in forward or backward direction to it.
Then there is a concept of regular and variable width assertion. The positive and negative forward (look_ahead) assertions allow variable width assertions The positive and negative backward (look_behind) assertions allow fixed width assertions alone Practical Use: The contents of the extracted data from the schemas always is a delimited file in the os.
But that which really happens is that if there are two fields that are empty together, then this expression of pattern match shall not do the required in both places, that which actually happens is
The Regular expression that usually is used on the wake does add space between the first occurrences of xx thus making x x, the next pattern xx that goes together with the first pattern is not added a space between. This is because , as the first x|xx|x is matched and a space is added as x|x x|x . Then there is second empty string delimited by x|x that occurs together with the first match .The regular expression skips the match as it has already crossed the initial part of the pattern and hence does not match and the second occurrence is not spaced. This is actually handled by look_ahead and look_behind assertions
The pattern with the look_ahead and look_behind assertions are looking for a pattern of xx that is following a "|" and is followed by a "|" , but the $& is actually the xx and the action is changing it to x x which solves the pattern match.
The great monks like tachyon , BrowserUk have helped me in an earlier node regarding this extended pattern . The string that has the edge cases in the look_ahead pattern is not solved by the fixed width match stated with the first code , hence we use the variable width look_ahead assertion . (?{code}) which enables including an anonymous block of code in the place of "code" (?{{code}}) which evaluates the "code" at runtime and considers that as a pattern as if it was there before the runtime ((?condition)yes pattern|no pattern) which is conditional pattern match and other strings (?>pattern) which is named as independent sub expression, that which matches a stand-alone pattern if it is anchored at the given position . the extended patterns which might be removedThe elucidation of these, I have left it out giving the links to perlre, cause of the experimental nature of the patterns, the fellow monks have nodes where in they use these patterns of regular expressions. I consider this document complete only with the comments and explanations of anything or many things extra to the outlined tutorial content .
Edit by tye, add READMORE In reply to Extended Patterns of Regular Expressions by OM_Zen
|
|