Muller Unlimited

April 10, 2020

April 10, 2020 Case Muller

DATA CLEANING WITH PANDAS

Dealing with Duplicates

To check for duplicates, we can use the Pandas function .duplicated( ).
1. df.duplicated( )
We can use Pandas .drop_duplicates( ) to remove all rows that are duplicates of another row.
1. df.drop_duplicates( )
If we want to remove every row with a duplicate value in the item column, we can specify a subset.
1. fruits = fruits.drop_duplicates(subset=[‘item’])
2. By default, this keeps the first occurrence of the duplicate.

Splitting By Index

If your data looks like this, MMDDYYYY, you can make the following columns by splitting by index.
1. #Create a month column
2. df[‘month’] = df.birthday.str[0:2]
3. #Create a day column
4. df[‘day’} = df.birthday.str[2:4}
5. #Create a year column
6. df[‘year’] = df.birthday.str[4:]
Suppose you have data called ‘type’ which contains values like this “admin_US” or “user_Kenya”
1. You can split this data by the underscore ‘_’
  1. #Create the ‘str_split’ column
  2. df[‘str_split’] = df.type.str_split(‘_’)
  3. #Create the ‘user_type’ column
  4. df[‘user_type’] = df.str_split.str.get(0)
  5. #Create the ‘country’ column
  6. df[‘country’] = df.str_split.str.get(1)