有哪些建站的公司,在线crm系统价格,天津城乡住房建设厅网站,中企动力z云邮Titanic : Machine Learning from Disaster
链接#xff1a;GitHub源代码 Question 要求你建立一个预测模型来回答这个问题#xff1a;“什么样的人更有可能生存#xff1f;”使用乘客数据#xff08;如姓名、年龄、性别、社会经济阶层等#xff09;。 一、导入数据包和数…Titanic : Machine Learning from Disaster
链接GitHub源代码 Question 要求你建立一个预测模型来回答这个问题“什么样的人更有可能生存”使用乘客数据如姓名、年龄、性别、社会经济阶层等。 一、导入数据包和数据集
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns重点在kaggle notebook上时应该把pd.read_csv(./kaggle/input/titanic/train.csv)引号中第一个.去掉 读入训练集和测试及都需要 train pd.read_csv(./kaggle/input/titanic/train.csv)
test pd.read_csv(./kaggle/input/titanic/test.csv)
allData pd.concat([train, test], ignore_indexTrue)
# dataNum train.shape[0]
# featureNum train.shape[1]
train.info()二、数据总览
概况
输入train.info()回车可以查看数据集整体信息
class pandas.core.frame.DataFrame
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6 KB输入train.head()可以查看数据样例
特征
VariableDefinitionKeysurvivalSurvival0 No, 1 YespclassTicket class(客舱等级)1 1st, 2 2nd, 3 3rdsexSexAgeAge in yearssibsp# of siblings / spouses aboard the Titanic(旁系亲属)parch# of parents / children aboard the Titanic(直系亲属)ticketTicket numberfarePassenger farecabinCabin number(客舱编号)embarkedPort of Embarkation(上船港口编号)C Cherbourg, Q Queenstown, S Southampton
三、可视化数据分析
性别特征Sex
女性生存率远高于男性
# Sex
sns.countplot(Sex, hueSurvived, datatrain)
plt.show()等级特征Pclass
乘客等级越高生存率越高
# Pclass
sns.barplot(xPclass, ySurvived, datatrain)
plt.show()家庭成员数量特征 FamilySizeParchSibSp 家庭成员数量适中生存率高
# FamilySize SibSp Parch 1
allData[FamilySize] allData[SibSp] allData[Parch] 1
sns.barplot(xFamilySize, ySurvived, dataallData)
plt.show()上船港口特征Embarked
上船港口不同生存率不同
# Embarked
sns.countplot(Embarked, hueSurvived, datatrain)
plt.show()年龄特征Age
年龄小或者正值壮年生存率高
# Age
sns.stripplot(xSurvived, yAge, datatrain, jitterTrue)
plt.show()年龄生存密度
facet sns.FacetGrid(train, hueSurvived,aspect2)
facet.map(sns.kdeplot,Age,shade True)
facet.set(xlim(0, train[Age].max()))
facet.add_legend()
plt.xlabel(Age)
plt.ylabel(density)
plt.show()儿童相对于全年龄段有特殊的生存率 作者将10及以下视为儿童设置单独标签 费用特征Fare
费用越高生存率越高
# Fare
sns.stripplot(xSurvived, yFare, datatrain, jitterTrue)
plt.show()姓名特征Name
头衔特征Title
头衔由姓名的前置称谓进行分类
# Name
allData[Title] allData[Name].apply(lambda x:x.split(,)[1].split(.)[0].strip())
pd.crosstab(allData[Title], allData[Sex])统计分析
TitleClassification {Officer:[Capt, Col, Major, Dr, Rev],Royalty:[Don, Sir, the Countess, Dona, Lady],Mrs:[Mme, Ms, Mrs],Miss:[Mlle, Miss],Mr:[Mr],Master:[Master,Jonkheer]}
for title in TitleClassification.keys():cnt 0for name in TitleClassification[title]:cnt allData.groupby([Title]).size()[name]print (title,:,cnt)设置标签
TitleClassification {Officer:[Capt, Col, Major, Dr, Rev],Royalty:[Don, Sir, the Countess, Dona, Lady],Mrs:[Mme, Ms, Mrs],Miss:[Mlle, Miss],Mr:[Mr],Master:[Master,Jonkheer]}
TitleMap {}
for title in TitleClassification.keys():TitleMap.update(dict.fromkeys(TitleClassification[title], title))
allData[Title] allData[Title].map(TitleMap)头衔不同生存率不同
sns.barplot(xTitle, ySurvived, dataallData)
plt.show()票号特征Ticket
有一定连续座位存在票号相同的乘客生存率高
#Ticket
TicketCnt allData.groupby([Ticket]).size()
allData[SameTicketNum] allData[Ticket].apply(lambda x:TicketCnt[x])
sns.barplot(xSameTicketNum, ySurvived, dataallData)
plt.show()
# allData[SameTicketNum]二维/多维分析
可以将任意两个/多个数据进行分析
二维分析之Pclass Age
# Pclass Age
sns.violinplot(Pclass, Age, hueSurvived, datatrain, splitTrue)
plt.show()二维分析之Age Sex
# Age Sex
sns.swarmplot(xAge, ySex, datatrain, hueSurvived)
plt.show()四、数据清洗 异常处理
离散型数据
有可用标签 -- One-Hot编码
Sex Pclass Embarked 都有已经设置好的标签int或float或string等可以直接进行get_dummies拆分成多维向量增加特征维度其中Embarked存在一定缺失值通过对整体的分析填充上估计值
# Sex
allData allData.join(pd.get_dummies(allData[Sex], prefixSex))
# Pclass
allData allData.join(pd.get_dummies(allData[Pclass], prefixPclass))
# Embarked
allData[allData[Embarked].isnull()] # 查看缺失值
allData.groupby(by[Pclass,Embarked]).Fare.mean() # Pclass1, EmbarkC, 中位数76
allData[Embarked] allData[Embarked].fillna(C)
allData allData.join(pd.get_dummies(allData[Embarked], prefixEmbarked))无可用标签 -- 设计标签 -- One-Hot
FamilySize Name Ticket需要对整体数据统一处理再进行标记
# FamilySize
def FamilyLabel(s):if (s 4):return 4elif (s 2 or s 3):return 3elif (s 1 or s 7):return 2elif (s 5 or s 6):return 1elif (s 1 or s 7):return 0
allData[FamilyLabel] allData[FamilySize].apply(FamilyLabel)
allData allData.join(pd.get_dummies(allData[FamilyLabel], prefixFam))# Name
TitleLabelMap {Mr:1.0,Mrs:5.0,Miss:4.5,Master:2.5,Royalty:3.5,Officer:2.0}
def TitleLabel(s):return TitleLabelMap[s]
# allData[TitleLabel] allData[Title].apply(TitleLabel)
allData allData.join(pd.get_dummies(allData[Title], prefixTitle))# Ticket
def TicketLabel(s):if (s 3 or s 4):return 3elif (s 2 or s 8):return 2elif (s 1 or s 5 or s 6 or s 7):return 1elif (s 1 or s 8):return 0
allData[TicketLabel] allData[SameTicketNum].apply(TicketLabel)
allData allData.join(pd.get_dummies(allData[TicketLabel], prefixTicNum))连续型数据
Age Fare
进行标准化缩小数据范围加速梯度下降
# Age
allData[Child] allData[Age].apply(lambda x:1 if x 10 else 0) # 儿童标签
allData[Age] (allData[Age]-allData[Age].mean())/allData[Age].std() # 标准化
allData[Age].fillna(value0, inplaceTrue) # 填充缺失值
# Fare
allData[Fare] allData[Fare].fillna(25) # 填充缺失值
allData[allData[Survived].notnull()][Fare] allData[allData[Survived].notnull()][Fare].apply(lambda x:300.0 if x500 else x)
allData[Fare] allData[Fare].apply(lambda x:(x-allData[Fare].mean())/allData[Fare].std())清除无用特征
清除无用特征降低算法复杂度
# 清除无用特征
allData.drop([Cabin, PassengerId, Ticket, Name, Title, Sex, SibSp, Parch, FamilySize, Embarked, Pclass, Title, FamilyLabel, SameTicketNum, TicketLabel], axis1, inplaceTrue)重新分割训练集/测试集
一开始为了处理方便作者将训练集和测试集合并现在根据Survived是否缺失来讲训练集和测试集分开
# 重新分割数据集
train_data allData[allData[Survived].notnull()]
test_data allData[allData[Survived].isnull()]
test_data test_data.reset_index(dropTrue)xTrain train_data.drop([Survived], axis1)
yTrain train_data[Survived]
xTest test_data.drop( [Survived], axis1)特征相关性分析
该步骤用于筛选特征后向程序员反馈特征是否有效、是否重叠若有问题可以修改之前的特征方案
# 特征间相关性分析
Correlation pd.DataFrame(allData[allData.columns.to_list()])
colormap plt.cm.viridis
plt.figure(figsize(24,22))
sns.heatmap(Correlation.astype(float).corr(), linewidths0.1, vmax1.0, cmapcolormap, linecolorwhite, annotTrue, squareTrue)
plt.show()五、模型建立 参数优化
导入模型包
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest作者选择随机森林分类器 网格搜索调试参数
pipe Pipeline([(select, SelectKBest(k10)),(classify, RandomForestClassifier(random_state 10, max_features sqrt))])
param_test {classify__n_estimators:list(range(20,100,5)),classify__max_depth :list(range(3,10,1))}
gsearch GridSearchCV(estimatorpipe, param_gridparam_test, scoringroc_auc, cv10)
gsearch.fit(xTrain, yTrain)
print (gsearch.best_params_, gsearch.best_score_)运行时间较长结束后出现结果
{classify__max_depth: 6, classify__n_estimators: 70} 0.8790924679681529建立模型
用以上参数进行输入模型训练
rfc RandomForestClassifier(n_estimators70, max_depth6, random_state10, max_featuressqrt)
rfc.fit(xTrain, yTrain)导出结果
predictions rfc.predict(xTest)
output pd.DataFrame({PassengerId:test[PassengerId], Survived:predictions.astype(int64)})
output.to_csv(my_submission.csv, indexFalse)六、提交评分 官方推荐教程 附完整代码 Jupiter Notebook导出为Python Script格式需要ipynb格式请点击 GitHub源代码 # To add a new cell, type # %%
# To add a new markdown cell, type # %% [markdown]# %%
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns# %% [markdown]
# # Features
# Variable | Definition | Key
# :-:|:-:|:-:
# survival | Survival | 0 No, 1 Yes
# pclass | Ticket class(客舱等级) | 1 1st, 2 2nd, 3 3rd
# sex | Sex
# Age | Age in years
# sibsp | # of siblings / spouses aboard the Titanic(旁系亲属)
# parch | # of parents / children aboard the Titanic(直系亲属)
# ticket | Ticket number
# fare | Passenger fare
# cabin | Cabin number(客舱编号)
# embarked | Port of Embarkation(上船的港口编号) | C Cherbourg, Q Queenstown, S Southampton# %%
train pd.read_csv(./kaggle/input/titanic/train.csv)
test pd.read_csv(./kaggle/input/titanic/test.csv)
allData pd.concat([train, test], ignore_indexTrue)
# dataNum train.shape[0]
# featureNum train.shape[1]
train.head()# %%
# Sex
sns.countplot(Sex, hueSurvived, datatrain)
plt.show()# %%
# Pclass
sns.barplot(xPclass, ySurvived, datatrain)
plt.show()
# Pclass Age
sns.violinplot(Pclass, Age, hueSurvived, datatrain, splitTrue)
plt.show()# %%
# FamilySize SibSp Parch 1
allData[FamilySize] allData[SibSp] allData[Parch] 1
sns.barplot(xFamilySize, ySurvived, dataallData)
plt.show()# %%
# Embarked
sns.countplot(Embarked, hueSurvived, datatrain)
plt.show()# %%
# Age
sns.stripplot(xSurvived, yAge, datatrain, jitterTrue)
plt.show()
facet sns.FacetGrid(train, hueSurvived, aspect2)
facet.map(sns.kdeplot, Age, shadeTrue)
facet.set(xlim(0, train[Age].max()))
facet.add_legend()
plt.xlabel(Age)
plt.ylabel(density)
plt.show()
# Age Sex
sns.swarmplot(xAge, ySex, datatrain, hueSurvived)
plt.show()# %%
# Fare
sns.stripplot(xSurvived, yFare, datatrain, jitterTrue)
plt.show()# %%
# Name
# allData[Title] allData[Name].str.extract(([A-Za-z])\., expandFalse) # str.extract不知道在干嘛
allData[Title] allData[Name].apply(lambda x: x.split(,)[1].split(.)[0].strip()
)
# pd.crosstab(allData[Title], allData[Sex])
TitleClassification {Officer: [Capt, Col, Major, Dr, Rev],Royalty: [Don, Sir, the Countess, Dona, Lady],Mrs: [Mme, Ms, Mrs],Miss: [Mlle, Miss],Mr: [Mr],Master: [Master, Jonkheer],
}
TitleMap {}
for title in TitleClassification.keys():TitleMap.update(dict.fromkeys(TitleClassification[title], title))# cnt 0for name in TitleClassification[title]:cnt allData.groupby([Title]).size()[name]# print (title,:,cnt)
allData[Title] allData[Title].map(TitleMap)
sns.barplot(xTitle, ySurvived, dataallData)
plt.show()# %%
# Ticket
TicketCnt allData.groupby([Ticket]).size()
allData[SameTicketNum] allData[Ticket].apply(lambda x: TicketCnt[x])
sns.barplot(xSameTicketNum, ySurvived, dataallData)
plt.show()
# allData[SameTicketNum]# %% [markdown]
# # 数据清洗
# - Sex Pclass Embarked -- Ont-Hot
# - Age Fare -- Standardize
# - FamilySize Name Ticket -- ints -- One-Hot# %%
# Sex
allData allData.join(pd.get_dummies(allData[Sex], prefixSex))
# Pclass
allData allData.join(pd.get_dummies(allData[Pclass], prefixPclass))
# Embarked
allData[allData[Embarked].isnull()] # 查看缺失值
allData.groupby(by[Pclass, Embarked]).Fare.mean() # Pclass1, EmbarkC, 中位数76
allData[Embarked] allData[Embarked].fillna(C)
allData allData.join(pd.get_dummies(allData[Embarked], prefixEmbarked))# %%
# Age
allData[Child] allData[Age].apply(lambda x: 1 if x 10 else 0) # 儿童标签
allData[Age] (allData[Age] - allData[Age].mean()) / allData[Age].std() # 标准化
allData[Age].fillna(value0, inplaceTrue) # 填充缺失值
# Fare
allData[Fare] allData[Fare].fillna(25) # 填充缺失值
allData[allData[Survived].notnull()][Fare] allData[allData[Survived].notnull()][Fare
].apply(lambda x: 300.0 if x 500 else x)
allData[Fare] allData[Fare].apply(lambda x: (x - allData[Fare].mean()) / allData[Fare].std()
)# %%
# FamilySize
def FamilyLabel(s):if s 4:return 4elif s 2 or s 3:return 3elif s 1 or s 7:return 2elif s 5 or s 6:return 1elif s 1 or s 7:return 0allData[FamilyLabel] allData[FamilySize].apply(FamilyLabel)
allData allData.join(pd.get_dummies(allData[FamilyLabel], prefixFam))# Name
TitleLabelMap {Mr: 1.0,Mrs: 5.0,Miss: 4.5,Master: 2.5,Royalty: 3.5,Officer: 2.0,
}def TitleLabel(s):return TitleLabelMap[s]# allData[TitleLabel] allData[Title].apply(TitleLabel)
allData allData.join(pd.get_dummies(allData[Title], prefixTitle))# Ticket
def TicketLabel(s):if s 3 or s 4:return 3elif s 2 or s 8:return 2elif s 1 or s 5 or s 6 or s 7:return 1elif s 1 or s 8:return 0allData[TicketLabel] allData[SameTicketNum].apply(TicketLabel)
allData allData.join(pd.get_dummies(allData[TicketLabel], prefixTicNum))# %%
# 清除无用特征
allData.drop([Cabin,PassengerId,Ticket,Name,Title,Sex,SibSp,Parch,FamilySize,Embarked,Pclass,Title,FamilyLabel,SameTicketNum,TicketLabel,],axis1,inplaceTrue,
)# 重新分割数据集
train_data allData[allData[Survived].notnull()]
test_data allData[allData[Survived].isnull()]
test_data test_data.reset_index(dropTrue)xTrain train_data.drop([Survived], axis1)
yTrain train_data[Survived]
xTest test_data.drop([Survived], axis1)# allData.columns.to_list()# %%
# 特征间相关性分析
Correlation pd.DataFrame(allData[allData.columns.to_list()])
colormap plt.cm.viridis
plt.figure(figsize(24, 22))
sns.heatmap(Correlation.astype(float).corr(),linewidths0.1,vmax1.0,cmapcolormap,linecolorwhite,annotTrue,squareTrue,
)
plt.show()# %% [markdown]
# # 网格筛选随机森林参数
# - n_estimator
# - max_depth# %%
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest# %%pipe Pipeline([(select, SelectKBest(k10)),(classify, RandomForestClassifier(random_state10, max_featuressqrt)),]
)
param_test {classify__n_estimators: list(range(20, 100, 5)),classify__max_depth: list(range(3, 10, 1)),
}
gsearch GridSearchCV(estimatorpipe, param_gridparam_test, scoringroc_auc, cv10)
gsearch.fit(xTrain, yTrain)
print(gsearch.best_params_, gsearch.best_score_)# %%
rfc RandomForestClassifier(n_estimators70, max_depth6, random_state10, max_featuressqrt
)
rfc.fit(xTrain, yTrain)
predictions rfc.predict(xTest)output pd.DataFrame({PassengerId: test[PassengerId], Survived: predictions.astype(int64)}
)
output.to_csv(my_submission.csv, indexFalse)链接GitHub源代码