Many tutorials I found while trying to learn Regular Expresssions were too advanced, too theortical, and not practical enough to be of much help to me. Perhaps I just didn’t want to learn badly enough - and quite possibly the tutuorials I read were great and I just wasn’t ready to become grasshopper to Master Ninja RegEx Masters. In any case, I am writing this tutorial fully aware that there are more advanced, more technical, and perhaps more accurate tutorials available.
Technical accuracy is important, but technically advanced examples are not the goal here; the goal is approachable Regular Expressions.
My focus will be on PHP as that is the language I use to implement my regular expressions. In particular, we’ll be using the PCRE (Perl Compatible) Regular Expressions.
Let’s cover some basics, and we’re going to be task oriented.
I want to say, “here’s how you accomplish this task”, not “hey, look at the cool things I can do.”
Task:
Capture words between two markers.
Solution:
This is one of the most basic uses of Regular Expressions, and one of the easiest to start with.
Let’s start with some example text:
“this is my text, we want to capture text, between commas”
We know we want to match on a comma, so our regular expression starts.
,
That’s great and all, but we can do this using strpos!
Now comes the power of Regular Expressions, and the meaning of special characters.
The lowly period (.) is a powerful tool, it means “Match a single character that is not a line break.”
This is exactly what we want to do, because we don’t care what is between our commas, we just want to capture it.
Regular Expression:
,.
I can hear you now, “But, we want to match all characters between commas, and the comma is only for a single character!”
How right you are.
Remember DOS? Know what wild cards are? Don’t get confused, because this is not one of those.
Introducing the asterisk (*).
The asterisk is one of several repetition characters which we’ll cover in more detail later.
The asterisk means, “Match the preceding token zero or more times.”
So, the period is match any character, and the asterisk means zero or more times so…
Regular Expression:
,.*
There. That matches the comma, and all characters after it.
But wait, we need to stop capturing when we hit another comma.
Regular Expression:
There. That does it.
But wait, what if our text was actually:
“this is my text, we want to, capture text, between commas”
The Regular Expression ,.*, doesn’t cut it, it doesn’t match the different groups delimited by commas.
We would expect to get:
” we want to”
” capture text”
What we actually get is:
“, we want to, capture text,”
This is because it is greedy. Yes, greedy - before you know it, it will be out gambling and come home drunk.
Here is how the Regular Expression engine works on the Regular Expression we have come up with so far.
“I’m looking for a comma, ok I found one. Now I’m looking for any character that is not a line break, ok I found one, now what? Oh, repeat.”
And at this point the engine just keeps on going even though it hits another comma. It does this because the period means, “Match ANY character” and not until the engine has looked at the rest of the string will this fail (it fails at the end of the string because there are no more characters to find after the end of the string), at which point it starts backtracking, trying to find a comma (the next part of our Regular Expression).
And so the engine continues, “Ok, I reached the end of the string. Now I’m going to go backwards until I find a comma.”
Which finds the last comma in the string. Obviously NOT what we wanted.
Fortunately Regular Expressions allow us to change the greediness to laziness.
We do this by adding the question mark. The question mark is an overused modifier in Regular Expressions, and it’s important to realize it may be used many times in a Regular Expression and mean totally different things depending on where it is used.
Using a question mark after the asterisk will make the dot lazy. And just like it sounds, laziness will be slower.
The question mark means, “Repeat the dot as few times as possible.”
So let me describe how this works, and although internally the engine doesn’t work exactly like this, the effect is the same.
Each match of the dot tries to match the next part of the Regular Expression. If it fails, it goes back to the previous Regular Expression definition. In this case, if it fails to match the final comma, it goes back to the dot and keeps trying to match any character.
So here is our Regular Expression so far:
,.*?,
But there is still a problem.
If I want to capture all text between commas, this will still not work.
Consider this sample input text:
“this is my text, we want to, capture text, between commas,more text”
We will only match:
“, we want to,”
“, between commas,”
The text between the two matches is ignored even though it is between commas.
For this we’ll need to Look Behind.
Look arounds are somewhat complicated and will be explored in more depth in a future article, but we can use a simple example here.
A look behind makes the Regular Expression engine step backwards temporarily to check to see if the text inside the look behind expression matches.
Here is how the look behind for our example is expressed:
(?<=,)
Remember when I said the question mark was used for different things? Now you know. The question mark only designates a special construct here, and has nothing to do with greediness or laziness.
the <= means look behind and match (= means match, ! means do not match)
So, now our Regular Expression looks like this:
(?<=,).*?,
It’s beginning to look weird and unless you knew what each piece did, you would probably have no clue. Welcome to the elite! You now know a little about Regular Expressions!
There’s still a problem.
what if our data has line break characters in it? What if we want to match across line breaks? Php’s Perl Compatible Regular Expression Engine can handle this. This is not really part of the Regular Expression, it is a matching mode the Regular Expression engine is placed into.
For this, we use the /s mode this is the single line mode. In single line mode, the dot matches line break characters
To use this expression in PHP you would need to format it like:
‘/(?<=,).*?,/s'
The “/” characters delimit the regular expression, i.e. they are the characters between which the regular expression is placed.
Following the ending slash / we place an s to set the matching mode for the Perl Compatible Regular Expression Engine to single line.
There are three modes that can be used. For matching where we don’t care about upper vs lower case, you can use the letter “i” to signify case Insensitivity. (More on this later)
So, now this works.
With this input data:
“this is my text, we want to, capture text, between commas,more text
,this is test 2, and more ,and
more, and, yet again, more.”
Our matches are:
” we want to,”
” capture text,”
” between commas,”
“more text”
“,”
“this is test 2,”
” and more ,”
“and”
“more,”
” and,”
” yet again,”
Still, there is a problem.
We’re not actually capturing the output between the commas.
Allow me to introduce you to the concept of capturing groups.
Placing parens around Regular Expression expressions creates capturing groups. The text matched by the expression inside the parens will be accessible when using PHP in the return array.
Don’t confuse the parens around the look behind expression with capturing groups, they are different. The look behind expression doesn’t actually return any text, it just modifies the way the capturing occurs.
If all we want to capture is the text between the commas, we can use parens around this part of the expression.
So, now the most up to date Regular Expression:
(?<=,)(.*?),
When executed in php, like this:
$data = "this is my text, we want to, capture text, between commas,more text
,this is test 2, and more ,and
more, and, yet again, more.";preg_match( '/(?<=,).*?,/s', $data, $groups )
?>
$group[0][0[ contains the entire match
$group[1][] contains the matches from the first capturing group
The matches are an array, so:
$group[1][0] contains ” we want to”
And so on.
The complete script:
<?php
$data = “this is my text, we want to, capture text, between commas,more text
,this is test 2, and more ,and
more, and, yet again, more.”;preg_match( ‘/(?<=,)(.*?),/s', $data, $groups )
var_dump( $groups[1] );
for ( $ix = 0; $ix < count( $groups[1] ); $ix++ )
{
var_dump( $groups[1][$ix] );
}?>
