August 21, 2020

August 21, 2020 Case Muller

Logistic Regression Project

Predict the Titanic Survivors

Load the Data
1. use Pandas to load .csv
  1. pd.read_csv(‘file_name’)
Clean the Data
1. Use .map function
  1. passengers[‘Sex’] = passengers[‘Sex’].map({‘male’: 0, ‘female’: 1})
2. Replace missing values - NaN - using .fillna( ) function
  1. passengers[‘Age’].fillna(inplace = True, value = passengers[‘Age’].mean( ))
3. Store “First Class” passengers using .apply( ) function
  1. passengers[‘First Class’] = passengers[‘Pclass’].apply( lambda p: 1 if p == 1 else 0)
Select and Split Data
1. features = passengers[[‘Sex’, ‘Age’, ‘First Class’, ‘Second Class’]]
2. survived = passengers[‘Survived’]
3. Save the train test split results as variables
  1. train_features, test_features, train_labels, test_labels = train_test_split(features, survived)
Normalize the Data
1. Create a standard scaler
  1. scaler = StandardScaler( )
2. To determine scaling factors and apply the scaling to the feature data:
  1. train_features = scaler.fit_transform(train_features)
3. To apply the scaling to the test data
  1. test_features = scaler.transform(test_features)
Create and Evaluate the Model Using Logistic Regression
1. model = LogisticRegression( )
2. model.fit(train_features, train_labels)
3. print(model.score(train_features, train_labels))
  1. do the same for test data
  2. What are the scores?
4. Look at coefficient data to see which feature is the most useful
  1. Sex is the most useful feature, then First Class
Predict with the Model
1. Use samples with the same features (sex, age, class, …)
2. Transform the data using the scaler.
  1. sample_passengers = scaler.transform(sample_passengers)
3. Make a prediction using the normalized sample
  1. print(model.predict(sample_passengers))
  2. OR
  3. print(model.predict_proba(sample_passengers)) — to see probabilities