August 21, 2020
Logistic Regression Project
Predict the Titanic Survivors
Load the Data
use Pandas to load .csv
pd.read_csv(‘file_name’)
Clean the Data
Use .map function
passengers[‘Sex’] = passengers[‘Sex’].map({‘male’: 0, ‘female’: 1})
Replace missing values - NaN - using .fillna( ) function
passengers[‘Age’].fillna(inplace = True, value = passengers[‘Age’].mean( ))
Store “First Class” passengers using .apply( ) function
passengers[‘First Class’] = passengers[‘Pclass’].apply( lambda p: 1 if p == 1 else 0)
Select and Split Data
features = passengers[[‘Sex’, ‘Age’, ‘First Class’, ‘Second Class’]]
survived = passengers[‘Survived’]
Save the train test split results as variables
train_features, test_features, train_labels, test_labels = train_test_split(features, survived)
Normalize the Data
Create a standard scaler
scaler = StandardScaler( )
To determine scaling factors and apply the scaling to the feature data:
train_features = scaler.fit_transform(train_features)
To apply the scaling to the test data
test_features = scaler.transform(test_features)
Create and Evaluate the Model Using Logistic Regression
model = LogisticRegression( )
model.fit(train_features, train_labels)
print(model.score(train_features, train_labels))
do the same for test data
What are the scores?
Look at coefficient data to see which feature is the most useful
Sex is the most useful feature, then First Class
Predict with the Model
Use samples with the same features (sex, age, class, …)
Transform the data using the scaler.
sample_passengers = scaler.transform(sample_passengers)
Make a prediction using the normalized sample
print(model.predict(sample_passengers))
OR
print(model.predict_proba(sample_passengers)) — to see probabilities