| Accelerated Regex Tutorial |
|
Regular expressions (regex or regexp in short) do look screwy, that's a fact. The reason, for me, is that you cannot expect to guess their meaning as easily as you can read and understand intuitively program code in a language you do not master too well. Any guessing attempt will be defeated until you are explicitly told about the very special meaning of a bunch of commonplace characters like . * + ? | ( [ { \ ^ $ and obviously the closing ) ] } . Count them! we have only 11 special characters, no more. If you know them, you'll understand and develop 95% of the regular expressions that you'll need. You will spend 5 minutes extra about the others, usually looking for the meaning of a seldom used flag in the reference documentation. None of the online regex tutorials I have visited follow the progression in understanding that, I personally think, is the most appropriate. So I wrote my own tutorial with the definite objective that, in roughly one hour, you'll become quite good at reguar expressions. I even hope that most of you will start to love them for their absolutely magical power, and the enjoyable mind-teasing opportunities that they offer. A last advice: this is a tutorial, not reference documentation. As suggested above, there is a progression where each lesson builds over the previous one. Just read through and take time to understand, or read faster, but it will be wise not to skip a section. Table of Contents
Lesson 1: What are regex'es used for? A regular expression (regex) is basically a pattern that you will match over an input string. The very first error is to consider matching as the purpose of using a regex. It is not! Matching is a technique and the purpose for which the technique will be used is—to my knowledge—one of the following four:
The next time you read or develop a regex, know the goal: identify? extract? validate? or cut? this may very much influence the design of the right regular expression pattern, or accelerate the reading. Lesson 2: Special characters The screwiness of a regular expression pattern will quickly fade away with your capability to spot the special characters in it in order to discover the boundaries of sub-expressions, and how these sub-expressions relate to each other (in sequence? like A-or-B? as optional bit? as repeated stuff?) in the making of the final pattern. Pattern matching mechanisms are not complex: it's all about describing a sequence of string pieces that are mandatory, optional, or repeating, or exclude each other, and defining which precise character—or set of characters—is expected at every position. The inventors of regular expressions have opted for a very compact notation which is both magical and a challenge for the eye. Such compactness and the use of commonplace characters are—to my understanding—behind most of the apparent complexity. So you shall train your eye to recognize . * + ? | ( [ { \ ^ $ as characters with a special meaning. And every time you see one of ( [ { look for the corresponding ) ] } as they always go in pairs, possibly nested as in (A[B]C) but never overlapping as in {A[B}C]. In a moment, we will learn the effects of each of them, composedly, and in turn. There is not much more to learn! Take the time to consider the list again: . dot * star + plus ? question mark | pipe ( brackets [ square brackets { curly brackets \ backslash ^ caret (formally a circumflex accent) and the $ dollar sign. And do compare them to their peers ! / : ; , - ' # " ~ & @ = % _ and space which are not special characters. Lesson 3: Regular expressions without special characters! Before getting saturated at special characters, what happens if we don't use any of them? Well, we get the basic and essential pattern matching behavior: simply matching character against character, case sensitive by default. Let's consider the input string:
and the regular expression ABC Side question: is there a standard way of quoting a regular expression? For instance, shall we note the above as "ABC"? Reply: No! It depends entirely from the language or tool dealing with regular expressions. For instance, in Java and C# you will note it "ABC", in PHP it becomes 'ABC', but when used in awk or sed command in UNIX it will be noted by default as /ABC/, whereas when passed as argument to the grep command you'll specify "ABC" or 'ABC' or even ABC (without any quotes because there is no space character inside this regex, spaces being interpreted as delimiters of command line arguments unless included within quotes or double quotes). In the reverseXSL software, you can use any character as delimiter; indeed, the first non-space character of a regex specification automatically becomes the delimiter. But then you shall not use that character elsewhere in the regex—there is no way to escape. For instance, a regex in DEF files can appear as "ABC", or 'ABC', or #ABC#, or xABCx (this last one is not very clever!). You may also use |ABC|, or (ABC(, or )ABC) and so forth, but it is not recommended because | and () and the others are special characters often needed inside a regex. Within this tutorial, strings and regular expressions are quoted with background colors So, if we match the regex ABC onto ABCaBCABBACCCAAABCCC we get the following two matches
and not over aBC because matching is case sensitive by default. Reminding that ! / : ; , - ' # " ~ & @ = % _ and space are not special characters, we can solve the following examples:
In the third case, remind that " and ' are characters like others. Forget about string delimiters in programming languages, we deal here with regular expressions. In the last case, remember that the special characters . * + ? | ( [ { \ ^ $ seen in the input string are only special for the regex side. Characters within the input string shall always be considered as plain, simple, characters that we may like to match. In practice Are regular expressions without any special characters of any use? Yes, for identification: such regex'es indicate that a given tag or value exists in your input, and that may just be enough to drive a processing decision. In the reverseXSL parser there is also a segment cut mode CUT-ON-"regex". For instance CUT-ON-"--" applied to ABC--DEF--G-H--IJK yields segmented pieces ABC , DEF , G-H , and IJK Lesson 4: Matching the start and end of the input In most above examples, we matched string portions elsewhere in the middle of the input. We can impose a regular expression to match only the very beginning or the very end of the input string.
Expert note: in MULTILINE mode (cfr advanced tutorial) we can use ^ and $ to match intermediate line boundaries So you shall understand the effect of the following regular expressions:
Regular expressions like ABC^ , $ABC or A^B$C would not cause any error; they are simply stupid, and never able to match anything. The last case (^MyRegex$) is quite interesting: you force the pattern to match the entire input string or nothing. Regular expressions software libraries actually provide means of testing whether the same pattern matches the entire input string or not, or whether this pattern is only matching a subset of the input string, without having to explicitly supply the ^ and $ in the former case. For instance, in java, we have a matcher.matches() method that returns TRUE when the pattern "ABC" is applied to the input "ABC" and FALSE when the same pattern "ABC" is applied to "XABCZ" for instance. However, the alternative matcher.find() method returns TRUE for such later case (finding "ABC" in "XABCZ"). In other words, a matcher.find() using "^myRegex$" is equivalent to matcher.matches() with "myRegex" alone. The conclusion is that find-operations are more appropriate for identification, whereas full-match-operations fit well to validation. So, if you are using software that drives some processing based on the regex library find() method, you may be advised to frame with explicit ^ and $ the regular expressions used for validation purposes, as in ^validValueRegex$. Vice-versa, a software built over the regex matches() method may require to extend with .* (that matches any string, see further) the regular expressions used for identification purposes. In practice For identification purposes, in addition to the default matching-elsewhere-in-input we have now the capability to match something that begins-with or ends-with,or matches-exactly.
Within reverseXSL software, regular expresssions immediately following a SEG (segment) or GRP (group) definition keyword are only used for identification purposes and built over find-operations. For validation purposes, the possibility of leaving portions of the input string outside of the scope of the validation pattern is obviously not a good strategy. Therefore, only regular expressions like ^MyValidationPattern$ do make sense, else using those API functions that enforce an entire match (making ^ and $ implicit). Within reverseXSL software, regular expressions immediately following a D (data element) definition keyword are used for validation and extraction purposes (see further), and built over full-match-operations. The use of ^ and $ as in ^myExp$ is implicit and may be omitted. Lesson 5: Matching characters in lists By default, a regular expression matches characters in the pattern against the same exact character in the input string, case sensitive. If we want to open up the possibility of accepting several characters at any position, we can use the following notations:
Expert note: the DOTALL mode entitles . to match also new line characters (cfr advanced tutorial). Regular expressions also feature backslash-notations for numerous built-in character ranges like \d for the set of digits (0 to 9), \w for any word-character (A to Z, a to z, 0 to 9 and _), \s for whitespace (tab, space, CR, LF, FF),\p{ASCII} for all 7bit ASCII (with binary representations 0 to 127), \p{Punct} for any puctuation (!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~). We shall not bother now. The exhaustive inventory is copied in the documentation. Have you noted that the ^ which previously meant start-of-input is now overloaded with the neaning of negating the following character list, but only when used in [^ ]. Similary, - which is a regular character gets a role in range notations, but only within the scope of [ ] and [^ ] . Let us illustrate the use of character lists with a few examples:
The last two examples are worth some comments:
In practice With cut-out purposes in mind, a character-list may allow specifying all possible delimiters; for instance when
in order to separate all fields as in:
The expression [: /-] matches single characters, hence delimits an empty field in between the two // . Within reverseXSL, the segment cut mode CUT-ON-"regex" is just about using a regex to specify field delimiters. On the other hand, the use of character lists for validation purposes is obvious.
Lesson 6: Optional and repeating characters So far, we are only capable of building fixed-length regular expression patterns where any position may match a fixed character, or a [ ] list of characters, else . any character. We can also hook the pattern to the ^ start of the input string, or the $ end, or both ^ $ (forcing an entire match). Obviously, we need means to deal with the length of the pattern by letting characters repeat a variable number of times.
Let us consider a few samples:
The last example is worth a comment: .* can also match an empty input string whereas .+ requires at least one character as input. Note that like the . , the + , ? and * also loose their special status when used within [ ] or [^ ] range specifiactions: they can only match a single + ? or * character. In practice We have now powerful means of validating data.
The identification of interchange and segment/record structures is often based on tags at certain positions.
Lesson 7: Consuming input characters in greedy or reluctant mode How many matches do we have of ... in ABCDEFG ? We can indeed think about the five ABC, BCD, CDE, DEF, EFG, but the right answer is only ABC and DEF , G being left out because 3 characters are no longer available for a third match. Indeed, as matching progresses throughout the input string, characters matched by the pattern are like 'consumed'. In the above example, the first match takes ABC, so the next attempt starts from D and takes DEF, and the next attempt starts from G and fails with insufficient characters left to match once more the pattern of 3 . any-char. Assume now that we try to match ....? meaning 3 chars plus an optional fourth. Will we get ABC plus DEFG or ABCD plus EFG ? The answer is ABCD and then EFG , because matching is by default greedy; in other words, it takes as much as it can match as soon as possible. Note that another formally correct solution to matching ....? over ABCDEFG is ABC plus DEF because the fourth character is optional. This last result is actually obtained by enforcing the reluctant mode; in other words, it takes as few as it can while trying to match optional and repeating elements. The default mode is greedy, and reluctant mode is invoked by adding an extra ? next to the optional and repeating indicators described in the previous lesson. The above pattern becomes: ....?? which reads ... 3 any-chars plus . any-char ? optional and ? reluctant. If we match ABCD plus EFG with ....?, and ABC plus DEF (no G!) with ....?? , how can we match ABC plus DEFG ? The reply (based on what we have seen so far) is ....??G? , try it! The notation for reluctant matches becomes:
The reluctant modifier (an extra ?) is one of the most powerful features of regular expressions. A few examples comparing greedy and reluctant outcomes will immediately clarify the point.
In practice Reluctant mode is useless in validating data, because validation is intrinsically about one big full match, checking that the data value entirely complies with a given pattern. There is no reason to match reluctantly less than the entire input data value. By the same token, identification purposes do not need the reluctant modifier, because in principle if you have several identification possibilities, regular optional markers will cater for the variants: you do not need to favor one identification match versus another when all are valid by definition! However, there are activities where the reluctant mode makes a great difference and is even indispensable: data value extraction, and cut-out (segmentation). These activities require help from capturing groups, which we investigate in the second part of this tutorial. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||