Muller Unlimited

View Original

April 10, 2020

DATA CLEANING WITH PANDAS

Dealing with Duplicates

  1. To check for duplicates, we can use the Pandas function .duplicated( ).

    1. df.duplicated( )

  2. We can use Pandas .drop_duplicates( ) to remove all rows that are duplicates of another row.

    1. df.drop_duplicates( )

  3. If we want to remove every row with a duplicate value in the item column, we can specify a subset.

    1. fruits = fruits.drop_duplicates(subset=[‘item’])

    2. By default, this keeps the first occurrence of the duplicate.

Splitting By Index

  1. If your data looks like this, MMDDYYYY, you can make the following columns by splitting by index.

    1. #Create a month column

    2. df[‘month’] = df.birthday.str[0:2]

    3. #Create a day column

    4. df[‘day’} = df.birthday.str[2:4}

    5. #Create a year column

    6. df[‘year’] = df.birthday.str[4:]

  2. Suppose you have data called ‘type’ which contains values like this “admin_US” or “user_Kenya”

    1. You can split this data by the underscore ‘_’

      1. #Create the ‘str_split’ column

      2. df[‘str_split’] = df.type.str_split(‘_’)

      3. #Create the ‘user_type’ column

      4. df[‘user_type’] = df.str_split.str.get(0)

      5. #Create the ‘country’ column

      6. df[‘country’] = df.str_split.str.get(1)