kaggle 泰坦尼克:从灾难中学习算法

数据集

Titanic 数据集是源自 1912 年泰坦尼克号沉没事故的存亡情况统计,1500 多人死于这场灾难。我们的训练数据集提供了共 891 名乘客的具体信息,包括姓名、性别、船舱等级、船票价格等,最重要的是 survived 信息:0/1 代表着死亡与幸存,我们的任务就是从这 891 名乘客信息中寻找特征,确定模型,用以预测测试数据集中其他 418 名乘客的幸存/死亡情况。

数据集下载链接

数据预处理

载入数据集

1
2
3
4
5
6
# 载入数据集
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
full_data = [train, test]
# 查看数据集前5行
train.head()

01_head

1
2
3
# 查看训练集和测试集信息
train.info()
test.info()

02_info
查看信息后发现 Age Cabin Embarked 数据不全,后续需要填充或者删除。

特征工程 Feature Engineering

Pclass

1
train[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean()

03_Pclass

Name

在这个特征中,我们可以发现乘客的头衔。

1
2
3
for dataset in full_data:
dataset['Title'] = dataset['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
pd.crosstab(train['Sex'], train['Title'])

04_Name
有些头衔的人数太少,我们将其合并为 Rare ,这里有两种写法
写法一:

1
2
3
4
5
6
7
for dataset in full_data:
dataset['Title'] = dataset['Title'].replace(['Capt', 'Col', 'Don', 'Dr', 'Jonkheer',\
'Lady', 'Major', 'Rev', 'Sir', 'the Countess', 'Dona'], 'Rare')
dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()

05_Title

写法二:

1
2
3
title_names = (data1['Title'].value_counts() < 10)
for dataset in full_data:
data1['Title'] = data1['Title'].apply(lambda x: 'Rare' if title_names.loc[x] == True else x)

以上两种写法,第一种修改比较细致,还将错误的类别修改正确;而第二种处理方式直接根据相应类别的数量小于一个阀值(这里是 10)进行判断,代码简洁,但是无法检查出错误的类。

Sex

1
train[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean()

06_Sex

Age

这部分特征缺失值较多,我们计算该特征的均值标准差,并在 [均值-标准差,均值+标准差] 范围内产生随机整数。然后将年龄分成 5 个区间。

1
2
3
4
5
6
7
8
9
10
11
12
for dataset in full_data:
age_avg = dataset['Age'].mean()
age_std = dataset['Age'].std()
age_null_count = dataset['Age'].isnull().sum()

age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)
dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list
dataset['Age'] = dataset['Age'].astype(int)

dataset['AgeBin'] = pd.cut(dataset['Age'], 5)

train[['AgeBin', 'Survived']].groupby(['AgeBin'], as_index=False).mean()

07_Age

SibSp and Parch

通过 SibSp (兄妹和配偶)的数量和 Parch (父母与子女)的数量,我们能够计算出一个新特征 Family Size (家庭人数)

1
2
3
for dataset in full_data:
dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
train[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean()

08_FamilySize

由此我们可以计算出一个新特征,是否孤身一人

1
2
3
4
for dataset in full_data:
dataset['IsAlone'] = 0
dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1
train[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()

09_IsAlone

Fare

这部分特征在测试集上少量缺失,我们用中位数进行填充。然后将其分成 4 个区间。

1
2
3
4
for dataset in full_data:
dataset['Fare'] = dataset['Fare'].fillna(train['Fare'].median())
dataset['FareBin'] = pd.qcut(train['Fare'], 4)
train[['FareBin', 'Survived']].groupby(['FareBin'], as_index=False).mean()

10_FareBin

Embarked

该特征少量缺失,我们用出现频率最多的值进行填充。

1
2
3
for dataset in full_data:
dataset['Embarked'] = dataset['Embarked'].fillna(dataset['Embarked'].mode()[0])
train[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean()

11_Embarked

Ticket and Cabin

这两个特征 Ticket 为随机产生的数字,与分析无关,直接删除。 Cabin 缺失值太多,无法补全,也直接删除。

1
2
3
drop_columns = ['Ticket', 'Cabin']
for dataset in full_data:
dataset.drop(drop_columns, axis = 1, inplace=True)

最后通过代码 train.isnull().sum() 查看是否所有缺失值已处理完毕。

特征数值化

我们先打印下特征工程后处理的训练集,如下:
12_train

其中 Sex Embarked Title Title AgeBin FareBin 均为非数值,算法无法处理这种类型的数据,需要进行编码,转换为数值。

1
2
3
4
5
6
7
label = LabelEncoder()
for dataset in data_cleaner:
dataset['Sex_Code'] = label.fit_transform(dataset['Sex'])
dataset['Embarked_Code'] = label.fit_transform(dataset['Embarked'])
dataset['Title_Code'] = label.fit_transform(dataset['Title'])
dataset['AgeBin_Code'] = label.fit_transform(dataset['AgeBin'])
dataset['FareBin_Code'] = label.fit_transform(dataset['FareBin'])

导入所有模型预训练

对于同一个数据集,不同的模型表现效果不尽相同,对于一个问题到底选用何种模型是机器学习的一个难点。我们不妨将所有可能的模型导入,查看下在默认参数的情况下,各种机器学习算法表现优劣。

导入需要的库

1
2
3
4
5
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from xgboost import XGBClassifier

import warnings
warnings.filterwarnings('ignore')

定义机器学习算法列表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
MLA = [
#Ensemble Methods
ensemble.AdaBoostClassifier(),
ensemble.BaggingClassifier(),
ensemble.ExtraTreesClassifier(),
ensemble.GradientBoostingClassifier(),
ensemble.RandomForestClassifier(),
#Gaussian Processes
gaussian_process.GaussianProcessClassifier(),
#GLM
linear_model.LogisticRegressionCV(),
linear_model.PassiveAggressiveClassifier(),
linear_model.RidgeClassifierCV(),
linear_model.SGDClassifier(),
linear_model.Perceptron(),
#Navies Bayes
naive_bayes.BernoulliNB(),
naive_bayes.GaussianNB(),
#Nearest Neighbor
neighbors.KNeighborsClassifier(),
#SVM
svm.SVC(probability=True),
svm.NuSVC(probability=True),
svm.LinearSVC(),
#Trees
tree.DecisionTreeClassifier(),
tree.ExtraTreeClassifier(),
#Discriminant Analysis
discriminant_analysis.LinearDiscriminantAnalysis(),
discriminant_analysis.QuadraticDiscriminantAnalysis(),
#xgboost: http://xgboost.readthedocs.io/en/latest/model.html
XGBClassifier()
]

预训练并打印算法对比效果

sklearn.model_selection.ShuffleSplit 类将数据集进行交叉验证分割,详见官方文档说明,该方法可以作为 train_test_split 的替代,数据集 10 等分,其中 3 份作为测试,7 份作为训练集,1 份丢弃。

1
cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0 )

创建算法比较表的表头,以及所有预测结果汇总表

1
2
3
4
5
6
# create table to compare MLA metrics
MLA_columns = ['MLA Name', 'MLA Parameters','MLA Train Accuracy Mean', 'MLA Test Accuracy Mean', 'MLA Test Accuracy 3*STD' ,'MLA Time']
MLA_compare = pd.DataFrame(columns = MLA_columns)

# create table to compare MLA predictions
MLA_predict = train[Target]

开始训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# index through MLA and save performance to table
row_index = 0
train_X, train_y = train[Feature_x_bin], train[Target]
for alg in MLA:

#set name and parameters
MLA_name = alg.__class__.__name__
MLA_compare.loc[row_index, 'MLA Name'] = MLA_name
MLA_compare.loc[row_index, 'MLA Parameters'] = str(alg.get_params())

#score model with cross validation: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate
cv_results = model_selection.cross_validate(alg, train_X, train_y, cv = cv_split)

MLA_compare.loc[row_index, 'MLA Time'] = cv_results['fit_time'].mean()
MLA_compare.loc[row_index, 'MLA Train Accuracy Mean'] = cv_results['train_score'].mean()
MLA_compare.loc[row_index, 'MLA Test Accuracy Mean'] = cv_results['test_score'].mean()
#if this is a non-bias random sample, then +/-3 standard deviations (std) from the mean, should statistically capture 99.7% of the subsets
MLA_compare.loc[row_index, 'MLA Test Accuracy 3*STD'] = cv_results['test_score'].std()*3 #let's know the worst that can happen!


#save MLA predictions - see section 6 for usage
alg.fit(train_X, train_y)
MLA_predict[MLA_name] = alg.predict(train_X)

row_index+=1

#print and sort table: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html
MLA_compare.sort_values(by = ['MLA Test Accuracy Mean'], ascending = False, inplace = True)
MLA_compare

13_alg

画出比较图:

1
2
3
4
5
6
7
8
9
import matplotlib.pyplot as plt
import seaborn as sns
#barplot using https://seaborn.pydata.org/generated/seaborn.barplot.html
sns.barplot(x='MLA Test Accuracy Mean', y = 'MLA Name', data = MLA_compare, color = 'm')

#prettify using pyplot: https://matplotlib.org/api/pyplot_api.html
plt.title('Machine Learning Algorithm Accuracy Score \n')
plt.xlabel('Accuracy Score (%)')
plt.ylabel('Algorithm')

14_campare

选择模型

坚持原创技术分享,您的支持将鼓励我继续创作!