April 8, 2020
REGULAR EXPRESSIONS
Regular Expressions are special sequences of characters that describe a pattern of text that is to be matched.
Alternation uses the pipe symbol, |, and allows us to match the text preceding or following the |.
Character Sets are denoted by a pair of brackets [ ], and let us match one character from a series of characters.
Wildcards are represented by the period or dot. They can match any single character. (letter number, symbol, or whitespace)
Ranges allow us to specify a range of characters for a match.
Shorthand Character Classes, like \w, \d, and \s, represent the ranges representing word characters, digit characters, and whitespace characters, respectively.
Groupings are denoted with parenthesis, ( ), and group parts of a regular expression together. They allow us to limit alternation to part of a regex.
Fixed Quantifiers are represented by curly braces, { }, and let us indicate an exact quantity or range of quantity of a character we wish to match.
Optimal Quantifiers are indicated by the question mark, ?, and allow us to indicate a character in regex that is optional, or can appear either 0 or 1 time.
Kleene Star is denoted by an asterisk, *, and is a quantifier that matches the preceding character 0 or more times.
Kleene Plus is denoted by the plus sign, +, and matches the preceding character 1 or more times.
The Anchor symbols, hat (^) and dollar sign ($) are used to match text at the start and end of a string, respectively.
Glob can open multiple files by using regex matching to get the file names.
import glob
files = glob.glob(“file*.csv”)
df_list = [ ]
for filename in files:
data = pd.read_csv(filename)
df_list.append(data)
df = pd.concat(df_list)