在Python中使用Tensorflow预测泰坦尼克号的生存能力

在这篇文章中，我们将学习利用给定的关于泰坦尼克号乘客的性别、年龄等信息来预测他们的生存机会。由于这是一项分类任务，我们将使用随机森林。

这个实验将有三个主要步骤。

Feature Engineering
Imputation
训练和预测

数据集

这个实验的数据集可以在Kaggle网站上免费获得。从这个链接下载数据集https://www.kaggle.com/competitions/titanic/data?select=train.csv。一旦下载了数据集，它就被分成三个CSV文件，分别是gender submission.csv train.csv和test.csv。

导入库和初始设置

import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
%matplotlib inline
warnings.filterwarnings('ignore')

现在让我们使用pandas数据框架来读取训练和测试数据。

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
 
# To know number of columns and rows
train.shape
# (891, 12)

为了了解每一列的信息，如数据类型等，我们使用df.info()函数。

train.info()

在Python中使用Tensorflow预测泰坦尼克号的生存能力

现在让我们看看数据集中是否有任何NULL值。这可以用isnull()函数来检查。它的输出结果如下。

train.isnull().sum()

在Python中使用Tensorflow预测泰坦尼克号的生存能力

可视化

现在让我们用一些饼状图和柱状图将数据可视化，以获得对数据的正确理解。

让我们首先直观地看到幸存者的数量和死亡人数。

f, ax = plt.subplots(1, 2, figsize=(12, 4))
train['Survived'].value_counts().plot.pie(
    explode=[0, 0.1], autopct='%1.1f%%', ax=ax[0], shadow=False)
ax[0].set_title('Survivors (1) and the dead (0)')
ax[0].set_ylabel('')
sns.countplot('Survived', data=train, ax=ax[1])
ax[1].set_ylabel('Quantity')
ax[1].set_title('Survivors (1) and the dead (0)')
plt.show()

在Python中使用Tensorflow预测泰坦尼克号的生存能力

Sex feature

f, ax = plt.subplots(1, 2, figsize=(12, 4))
train[['Sex', 'Survived']].groupby(['Sex']).mean().plot.bar(ax=ax[0])
ax[0].set_title('Survivors by sex')
sns.countplot('Sex', hue='Survived', data=train, ax=ax[1])
ax[1].set_ylabel('Quantity')
ax[1].set_title('Survived (1) and deceased (0): men and women')
plt.show()

在Python中使用Tensorflow预测泰坦尼克号的生存能力

Feature Engineering

现在让我们来看看，为了预测测试数据，我们应该放弃和/或修改哪些列。这一步的主要任务是放弃不必要的特征，并将字符串数据转换为数字类别，以方便训练。

我们一开始会放弃船舱的功能，因为从它那里不能提取很多更有用的信息。但我们将从机舱栏中做一个新的栏目，看看是否有机舱信息的分配。

# Create a new column cabinbool indicating
# if the cabin value was given or was NaN
train["CabinBool"] = (train["Cabin"].notnull().astype('int'))
test["CabinBool"] = (test["Cabin"].notnull().astype('int'))
 
# Delete the column 'Cabin' from test
# and train dataset
train = train.drop(['Cabin'], axis=1)
test = test.drop(['Cabin'], axis=1)

我们也可以放弃票据功能，因为它不太可能产生任何有用的信息。

train = train.drop(['Ticket'], axis=1)
test = test.drop(['Ticket'], axis=1)

在Embarked特征中存在缺失值。为此，我们将用 “S “代替空值，因为 “S “的登船次数比其他两个高。

# replacing the missing values in
# the Embarked feature with S
train = train.fillna({"Embarked": "S"})

我们现在将对年龄进行分组。我们将合并人们的年龄组并将其归入相同的组别。通过这样做，我们将有更少的类别，并将有一个更好的预测，因为这将是一个分类数据集。

# sort the ages into logical categories
train["Age"] = train["Age"].fillna(-0.5)
test["Age"] = test["Age"].fillna(-0.5)
bins = [-1, 0, 5, 12, 18, 24, 35, 60, np.inf]
labels = ['Unknown', 'Baby', 'Child', 'Teenager',
          'Student', 'Young Adult', 'Adult', 'Senior']
train['AgeGroup'] = pd.cut(train["Age"], bins, labels=labels)
test['AgeGroup'] = pd.cut(test["Age"], bins, labels=labels)

在测试集和训练集的 “标题 “栏中，我们将把它们归类为同等数量的类别。然后我们将给标题分配数值，以方便模型训练。

# create a combined group of both datasets
combine = [train, test]
 
# extract a title for each Name in the
# train and test datasets
for dataset in combine:
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
 
pd.crosstab(train['Title'], train['Sex'])
 
# replace various titles with more common names
for dataset in combine:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Capt', 'Col',
                                                 'Don', 'Dr', 'Major',
                                                 'Rev', 'Jonkheer', 'Dona'],
                                                'Rare')
 
    dataset['Title'] = dataset['Title'].replace(
        ['Countess', 'Lady', 'Sir'], 'Royal')
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
 
train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()
 
# map each of the title groups to a numerical value
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3,
                 "Master": 4, "Royal": 5, "Rare": 6}
for dataset in combine:
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

现在使用标题信息，我们可以填补缺失的年龄值。

mr_age = train[train["Title"] == 1]["AgeGroup"].mode()  # Young Adult
miss_age = train[train["Title"] == 2]["AgeGroup"].mode()  # Student
mrs_age = train[train["Title"] == 3]["AgeGroup"].mode()  # Adult
master_age = train[train["Title"] == 4]["AgeGroup"].mode()  # Baby
royal_age = train[train["Title"] == 5]["AgeGroup"].mode()  # Adult
rare_age = train[train["Title"] == 6]["AgeGroup"].mode()  # Adult
 
age_title_mapping = {1: "Young Adult", 2: "Student",
                     3: "Adult", 4: "Baby", 5: "Adult", 6: "Adult"}
 
for x in range(len(train["AgeGroup"])):
    if train["AgeGroup"][x] == "Unknown":
        train["AgeGroup"][x] = age_title_mapping[train["Title"][x]]
 
for x in range(len(test["AgeGroup"])):
    if test["AgeGroup"][x] == "Unknown":
        test["AgeGroup"][x] = age_title_mapping[test["Title"][x]]

现在给每个年龄类别分配一个数值。一旦我们将年龄映射到不同的类别，我们就不需要年龄特征了。因此，放弃它

# map each Age value to a numerical value
age_mapping = {'Baby': 1, 'Child': 2, 'Teenager': 3,
               'Student': 4, 'Young Adult': 5, 'Adult': 6,
               'Senior': 7}
train['AgeGroup'] = train['AgeGroup'].map(age_mapping)
test['AgeGroup'] = test['AgeGroup'].map(age_mapping)
 
train.head()
 
# dropping the Age feature for now, might change
train = train.drop(['Age'], axis=1)
test = test.drop(['Age'], axis=1)

删除名称功能，因为它没有包含更多有用的信息。

train = train.drop(['Name'], axis=1)
test = test.drop(['Name'], axis=1)

赋予性别以数值，并进行分类。

sex_mapping = {"male": 0, "female": 1}
train['Sex'] = train['Sex'].map(sex_mapping)
test['Sex'] = test['Sex'].map(sex_mapping)
 
embarked_mapping = {"S": 1, "C": 2, "Q": 3}
train['Embarked'] = train['Embarked'].map(embarked_mapping)
test['Embarked'] = test['Embarked'].map(embarked_mapping)

根据该P类的平均票价，填补测试集中缺失的票价值

for x in range(len(test["Fare"])):
    if pd.isnull(test["Fare"][x]):
        pclass = test["Pclass"][x]  # Pclass = 3
        test["Fare"][x] = round(
            train[train["Pclass"] == pclass]["Fare"].mean(), 4)
 
# map Fare values into groups of
# numerical values
train['FareBand'] = pd.qcut(train['Fare'], 4,
                            labels=[1, 2, 3, 4])
test['FareBand'] = pd.qcut(test['Fare'], 4,
                           labels=[1, 2, 3, 4])
 
# drop Fare values
train = train.drop(['Fare'], axis=1)
test = test.drop(['Fare'], axis=1)

现在我们已经完成了功能工程

模型训练

我们将使用随机森林作为选择的算法来进行模型训练。在此之前，我们将以80:20的比例分割数据，作为训练-测试分割。为此，我们将使用sklearn库中的train_test_split（）。

from sklearn.model_selection import train_test_split
 
# Drop the Survived and PassengerId
# column from the trainset
predictors = train.drop(['Survived', 'PassengerId'], axis=1)
target = train["Survived"]
x_train, x_val, y_train, y_val = train_test_split(
    predictors, target, test_size=0.2, random_state=0)

现在从sklearn的集合模块中导入随机森林函数，并用于训练集。

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
 
randomforest = RandomForestClassifier()
 
# Fit the training data along with its output
randomforest.fit(x_train, y_train)
y_pred = randomforest.predict(x_val)
 
# Find the accuracy score of the model
acc_randomforest = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_randomforest)

这样一来，我们得到了83.25%的准确率。

预测

我们得到了测试数据集，我们必须对其进行预测。为了进行预测，我们将把测试数据集传递给我们的训练模型，并将其保存为一个CSV文件，其中包含信息、乘客ID和生存率。PassengerId将是测试数据中的乘客ID，survival一栏将是0或1。

ids = test['PassengerId']
predictions = randomforest.predict(test.drop('PassengerId', axis=1))
 
# set the output as a dataframe and convert
# to csv file named resultfile.csv
output = pd.DataFrame({'PassengerId': ids, 'Survived': predictions})
output.to_csv('resultfile.csv', index=False)