Madhuka: Regular Expressions with Python

Regular expressions are a powerful language for matching text patterns. This post will explain regular expressions by Python. The Python "re" module provides regular expression support.

match = re.search(pattern, string)

If the search is successful, search() returns a match object or None otherwise.

1 import re
2 
3 string1 = 'Regular expressions are a powerful language for matching text patterns'
4 match = re.search(r'matching \w\w\w\w', string1)
5 # If-statement after search() tests if it succeeded
6 if match:                      
7     print 'found', match.group() ## 'found matching text'
8 else:
9     print 'did not find'

Above code searches for the pattern ‘matching ' followed by a 4 letter word

Regular Expressions

a, X, 9, < -- ordinary characters just match themselves exactly

\b -- boundary between word and non-word

. (a period) -- matches any single character except newline '\n'

\w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]

\b -- boundary between word and non-word -- space, newline, return, tab, form [ \n\r\t\f]

\S (upper case S) matches any non-whitespace character

\d -- decimal digit [0-9]

^ = start, $ = end -- match the start or end of the string

\ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash

1 match = re.search(r'\d\d\d', 'No 505/3') 
2 print match.group() ##==> "505"

Repetition

+ -- 1 or more occurrences of the pattern to its left

* -- 0 or more occurrences of the pattern to its left

? -- match 0 or 1 occurrences of the pattern to its left

1 match = re.search(r'pa+', 'occurrences of the pattern') 
2 print match.group() ##==> "pat"

First the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible. Therefore '+' and '*' are said to be "greedy".

1 import re
2 str = 'my email madhuka@gmail.com and my friend'
3 match = re.search(r'\w*@\w+.\w*', str)
4 if match:
5     print match.group()  ## 'madhuka@gmail.com'

Group feature

The "group" feature of a regular expression allows you to pick out parts of the matching text.

1 import re
2 str = 'my email madhuka@gmail.com and my friend'
3 match = re.search(r'(\w*)@(\w+.\w*)', str)
4 if match:
5     print match.group()  ## 'madhuka@gmail.com'
6     print match.group(1) 
7     print match.group(2)

findall

findall() is probably the single most powerful function in the re module. Above we used re.search() to find the first match for a pattern. findall() finds *all* the matches and returns them as a list of strings.

1 import re
2  ## We have a text with many email addresses
3 str = 'James madhuka@google.com, foo bar dilan@abc.com blah jamesw@wso2.com'
4 emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['madhuka@google.com', 'dilan@abc.com']
5 for email in emails:
6     print email

File

For files, you may be in the habit of writing a loop to iterate over the lines of the file, and you could then call findall() on each line. Instead, let findall() do the iteration for you. Just feed the whole file text into findall()

1 import re
2 # Open file
3 f = open('D:/Research/python/test.txt', 'r')
4 
5 strings = re.findall(r'pyth..', f.read())
6 for string in strings:
7     print string

Madhuka

Thursday, October 9, 2014

Regular Expressions with Python

2 comments: