Regular exp. library

import re

Finding the first instance in the text

text = "The phone number given in the helpline is 408-999-4567"
pattern = 'phone'
re.search(pattern, text)
<re.Match object; span=(4, 9), match='phone'>

If the match is found then search return the location of the match. Note: It only gives the first instance in the text.

Span is the starting and ending index of the match. (Index starts from zero)

match=re.search(pattern, text)
match
<re.Match object; span=(4, 9), match='phone'>

.span() give the span of the match, .start() give the start index, .end() gives the end index

match.span()
(4, 9)
match.start()
4
match.end()
9

Find all instances in the text

text1 = "My phone is a hi-tech phone. The phone is dual band, with the lastest phone-tech processor"
matches = re.findall("phone", text1)
matches
['phone', 'phone', 'phone', 'phone']
len(matches)
4
 
for match in re.finditer('phone', text1):
    print(match.span())
(3, 8)
(22, 27)
(33, 38)
(70, 75)

To find the word matched, use .group() method

match.group()
'phone'

Identifiers in Regex

</p>

</table> </div> </div> </div>
text 
'The phone number given in the helpline is 408-999-4567'

If we want to find phone number with the pattern xxx-xxx-xxxx, we can use the identifier for it.

re.search(r'\d\d\d-\d\d\d-\d\d\d\d', text).group()
'408-999-4567'

Quantifiers in Regex

In repeating the identifier, we can use quantifiers to do the same thing.

Character Description Example Pattern Code Exammple Match
\d A digit file_\d\d file_25
\w Alphanumeric \w-\w\w\w A-b_1
\s White space a\sb\sc a b c
\D A non digit \D\D\D ABC
\W Non-alphanumeric \W\W\W\W\W *-+=)
\S Non-whitespace \S\S\S\S Yoyo
</p>

</table> </div> </div> </div>
re.search(r'\d{3}-\d{3}-\d{4}', text).group()
'408-999-4567'

Using parentheses in regex we can create groups with the matched data

phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')
results = re.search(phone_pattern, text)
results.group()
'408-999-4567'

Each parentheses in the regex pattern is group which can called out.

results.group(1)
'408'
results.group(2)
'999'
results.group(3)
'4567'

Or operator |

re.search(r"man|woman", "This man is a good person")
<re.Match object; span=(5, 8), match='man'>
re.search(r"man|woman", "This woman is a good person")
<re.Match object; span=(5, 10), match='woman'>

Wildcard characters

re.findall(r".at", "The fat cat ate the peta bread and sat on the rattop and splat")
['fat', 'cat', ' at', 'sat', 'rat', 'lat']

We see that all 3 letter word being matched. One single period matches on wildcard letter before the pattern.

re.findall(r"..at", "The fat cat ate the peta bread and sat on the rattop and splat")
[' fat', ' cat', ' sat', ' rat', 'plat']
re.findall(r"\S+at", "The fat cat ate the peta bread and sat on the rattop and splat")
['fat', 'cat', 'sat', 'rat', 'splat']

In case one or more non whitespace that end with 'at' are matched.

Starts with and ends with

^ : Starts with , $ : ends with

re.findall(r'\d$', "This ends with a number 2")
['2']
re.findall(r'^\d', "5 is the number of choice")
['5']

Exclusion

Square brackerts[^] are used for exclude a character.

phrase = "there are 3 numbers 34 insides 5 this sentence."
re.findall(r'[^\d]+', phrase)
['there are ', ' numbers ', ' insides ', ' this sentence.']

Removing the punctuation

test_phrase = 'This is a string! But it has punctuation. How can we remove it?'
test_phrase
'This is a string! But it has punctuation. How can we remove it?'
re.findall(r'[^!.? ]+', test_phrase)
['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

Putting it together

clean = ' '.join(re.findall(r'[^!.? ]+', test_phrase))
clean
'This is a string But it has punctuation How can we remove it'
 
text3 = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'
text3
'Only find the hypen-words in this sentence. But you do not know how long-ish they are'
re.findall(r'[\w]+-[\w+',text3)
['hypen-words', 'long-ish']

Note Difference between [], ()

The [] construct in a regex is essentially shorthand for an | on all of the contents. For example [abc] matches a, b or c. Additionally the - character has special meaning inside of a []. It provides a range construct. The regex [a-z] will match any letter a through z.

The () construct is a grouping construct establishing a precedence order (it also has impact on accessing matched substrings but that's a bit more of an advanced topic). The regex (abc) will match the string "abc".

 
text = 'Hello, would you like some catfish?'
texttwo = "Hello, would you like to take a catnap?"
textthree = "Hello, have you seen this caterpillar?"
re.search(r'cat(fish|nap|claw)',text).group()
'catfish'
re.search(r'cat(fish|nap|claw)',texttwo).group()
'catnap'
re.search(r'cat(fish|nap|claw)',textthree)
</div>
Character Description Example Pattern Code Exammple Match
+ Occurs one or more times Version \w-\w+ Version A-b1_1
{3} Occurs exactly 3 times \D{3} abc
{2,4} Occurs 2 to 4 times \d{2,4} 123
{3,} Occurs 3 or more \w{3,} anycharacters
\* Occurs zero or more times A\*B\*C* AAACC
? Once or none plurals? plural