April 10, 2020
DATA CLEANING WITH PANDAS
Dealing with Duplicates
To check for duplicates, we can use the Pandas function .duplicated( ).
df.duplicated( )
We can use Pandas .drop_duplicates( ) to remove all rows that are duplicates of another row.
df.drop_duplicates( )
If we want to remove every row with a duplicate value in the item column, we can specify a subset.
fruits = fruits.drop_duplicates(subset=[‘item’])
By default, this keeps the first occurrence of the duplicate.
Splitting By Index
If your data looks like this, MMDDYYYY, you can make the following columns by splitting by index.
#Create a month column
df[‘month’] = df.birthday.str[0:2]
#Create a day column
df[‘day’} = df.birthday.str[2:4}
#Create a year column
df[‘year’] = df.birthday.str[4:]
Suppose you have data called ‘type’ which contains values like this “admin_US” or “user_Kenya”
You can split this data by the underscore ‘_’
#Create the ‘str_split’ column
df[‘str_split’] = df.type.str_split(‘_’)
#Create the ‘user_type’ column
df[‘user_type’] = df.str_split.str.get(0)
#Create the ‘country’ column
df[‘country’] = df.str_split.str.get(1)