17 May 2025
Revisiting the titanic with PyCaret
! ls /kaggle/input/titanic
gender_submission.csv test.csv train.csv
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
data_files = []
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
data_files.append(os.path.join(dirname, filename))
print(data_files)
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
['/kaggle/input/titanic/train.csv', '/kaggle/input/titanic/test.csv', '/kaggle/input/titanic/gender_submission.csv']
References
- Britannica
Exploratory Data Analysis

Files
df_train - development dataset
df_test - submission dataset
df_gender - sample submission dataset where all female passengers are assumed to have survived - 76% accuracy recorded with this naive approach!
df_train = pd.read_csv('/kaggle/input/titanic/train.csv')
df_test = pd.read_csv('/kaggle/input/titanic/test.csv')
df_gender = pd.read_csv('/kaggle/input/titanic/gender_submission.csv')
print(df_train.shape, df_test.shape, df_gender.shape)
(891, 12) (418, 11) (418, 2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 418 non-null int64
1 Pclass 418 non-null int64
2 Name 418 non-null object
3 Sex 418 non-null object
4 Age 332 non-null float64
5 SibSp 418 non-null int64
6 Parch 418 non-null int64
7 Ticket 418 non-null object
8 Fare 417 non-null float64
9 Cabin 91 non-null object
10 Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
Observations
- ID column :
PassengerID
-
Target (DV) column : Survived {0 |
1} |
- Non-Null IDV columns
Class
Name
Sex
SibSp
Parch
Ticket
Fare
Embarked
- IDV columns with some Null values
Age
Cabin
|
PassengerId |
Survived |
Pclass |
Age |
SibSp |
Parch |
Fare |
| count |
891.000000 |
891.000000 |
891.000000 |
714.000000 |
891.000000 |
891.000000 |
891.000000 |
| mean |
446.000000 |
0.383838 |
2.308642 |
29.699118 |
0.523008 |
0.381594 |
32.204208 |
| std |
257.353842 |
0.486592 |
0.836071 |
14.526497 |
1.102743 |
0.806057 |
49.693429 |
| min |
1.000000 |
0.000000 |
1.000000 |
0.420000 |
0.000000 |
0.000000 |
0.000000 |
| 25% |
223.500000 |
0.000000 |
2.000000 |
20.125000 |
0.000000 |
0.000000 |
7.910400 |
| 50% |
446.000000 |
0.000000 |
3.000000 |
28.000000 |
0.000000 |
0.000000 |
14.454200 |
| 75% |
668.500000 |
1.000000 |
3.000000 |
38.000000 |
1.000000 |
0.000000 |
31.000000 |
| max |
891.000000 |
1.000000 |
3.000000 |
80.000000 |
8.000000 |
6.000000 |
512.329200 |
Columns without any null values
Pclass
Ticket Class
df_train.Pclass.sample(10)
286 3
353 3
693 3
152 3
871 1
183 2
199 2
162 3
453 1
455 3
Name: Pclass, dtype: int64
What is the distribution of the class of tickets?
pd.concat([df_train.Pclass.value_counts(),
df_train.Pclass.value_counts(normalize=True)], axis = 1)
|
count |
proportion |
| Pclass |
|
|
| 3 |
491 |
0.551066 |
| 1 |
216 |
0.242424 |
| 2 |
184 |
0.206510 |
class_survivors = df_train.groupby('Pclass').agg({'Survived':['count', 'sum']})
class_survivors
|
Survived |
|
count |
sum |
| Pclass |
|
|
| 1 |
216 |
136 |
| 2 |
184 |
87 |
| 3 |
491 |
119 |
class_survivors[('Survived', 'sum')]/class_survivors[('Survived', 'count')]
Pclass
1 0.629630
2 0.472826
3 0.242363
dtype: float64
Name
5 Moran, Mr. James
477 Braund, Mr. Lewis Richard
434 Silvey, Mr. William Baird
835 Compton, Miss. Sara Rebecca
862 Swift, Mrs. Frederick Joel (Margaret Welles Ba...
709 Moubarek, Master. Halim Gonios ("William George")
210 Ali, Mr. Ahmed
409 Lefebre, Miss. Ida
329 Hippach, Miss. Jean Gertrude
599 Duff Gordon, Sir. Cosmo Edmund ("Mr Morgan")
Name: Name, dtype: object
Note
- Title could be extracted from the name.
- Is there some significance in survival context to names with a braces, an alternative name?
- Are there important family names that can extracted to signify importance for survival?
Sex
334 female
29 male
623 male
616 male
398 male
0 male
258 female
388 male
811 male
405 male
Name: Sex, dtype: object
What is the distribution of passenger sex?
pd.concat([df_train.Sex.value_counts(),
df_train.Sex.value_counts(normalize=True)], axis = 1)
|
count |
proportion |
| Sex |
|
|
| male |
577 |
0.647587 |
| female |
314 |
0.352413 |
df_train[df_train.Age.isna()].sample(10)
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
| 815 |
816 |
0 |
1 |
Fry, Mr. Richard |
male |
NaN |
0 |
0 |
112058 |
0.0000 |
B102 |
S |
| 229 |
230 |
0 |
3 |
Lefebre, Miss. Mathilde |
female |
NaN |
3 |
1 |
4133 |
25.4667 |
NaN |
S |
| 656 |
657 |
0 |
3 |
Radeff, Mr. Alexander |
male |
NaN |
0 |
0 |
349223 |
7.8958 |
NaN |
S |
| 475 |
476 |
0 |
1 |
Clifford, Mr. George Quincy |
male |
NaN |
0 |
0 |
110465 |
52.0000 |
A14 |
S |
| 180 |
181 |
0 |
3 |
Sage, Miss. Constance Gladys |
female |
NaN |
8 |
2 |
CA. 2343 |
69.5500 |
NaN |
S |
| 760 |
761 |
0 |
3 |
Garfirth, Mr. John |
male |
NaN |
0 |
0 |
358585 |
14.5000 |
NaN |
S |
| 783 |
784 |
0 |
3 |
Johnston, Mr. Andrew G |
male |
NaN |
1 |
2 |
W./C. 6607 |
23.4500 |
NaN |
S |
| 87 |
88 |
0 |
3 |
Slocovski, Mr. Selman Francis |
male |
NaN |
0 |
0 |
SOTON/OQ 392086 |
8.0500 |
NaN |
S |
| 55 |
56 |
1 |
1 |
Woolner, Mr. Hugh |
male |
NaN |
0 |
0 |
19947 |
35.5000 |
C52 |
S |
| 65 |
66 |
1 |
3 |
Moubarek, Master. Gerios |
male |
NaN |
1 |
1 |
2661 |
15.2458 |
NaN |
C |
SibSp
number of siblings / spouses aboard the Titanic
df_train.SibSp.sample(10)
170 0
410 0
533 0
316 1
7 3
395 0
635 0
855 0
844 0
2 0
Name: SibSp, dtype: int64
pd.concat([df_train.SibSp.value_counts(),
df_train.SibSp.value_counts(normalize=True)], axis = 1)
|
count |
proportion |
| SibSp |
|
|
| 0 |
608 |
0.682379 |
| 1 |
209 |
0.234568 |
| 2 |
28 |
0.031425 |
| 4 |
18 |
0.020202 |
| 3 |
16 |
0.017957 |
| 8 |
7 |
0.007856 |
| 5 |
5 |
0.005612 |
Note
- About 70% of the passengers have no siblings or spouses accompanying them.
- If
SibSp == 1, is it a spouse more likely then a sibling? If so, would they survive together or die together unlike Jack and Rose :D
Parch
Number of parents / children aboard the Titanic
df_train.Parch.sample(10)
658 0
150 0
530 1
338 0
834 0
308 0
866 0
475 0
45 0
435 2
Name: Parch, dtype: int64
pd.concat([df_train.Parch.value_counts(),
df_train.Parch.value_counts(normalize=True)], axis = 1)
|
count |
proportion |
| Parch |
|
|
| 0 |
678 |
0.760943 |
| 1 |
118 |
0.132435 |
| 2 |
80 |
0.089787 |
| 5 |
5 |
0.005612 |
| 3 |
5 |
0.005612 |
| 4 |
4 |
0.004489 |
| 6 |
1 |
0.001122 |
Ticket
Ticket number
df_train.Ticket.sample(10)
527 PC 17483
51 A/4. 39886
485 4133
755 250649
464 A/S 2816
404 315096
776 383121
95 374910
594 SC/AH 29037
157 SOTON/OQ 392090
Name: Ticket, dtype: object
pd.concat([df_train.Parch.value_counts(),
df_train.Parch.value_counts(normalize=True)], axis = 1)
|
count |
proportion |
| Parch |
|
|
| 0 |
678 |
0.760943 |
| 1 |
118 |
0.132435 |
| 2 |
80 |
0.089787 |
| 5 |
5 |
0.005612 |
| 3 |
5 |
0.005612 |
| 4 |
4 |
0.004489 |
| 6 |
1 |
0.001122 |
df_train[['Embarked','Pclass', 'Ticket', 'Cabin']].sample(10)
|
Embarked |
Pclass |
Ticket |
Cabin |
| 238 |
S |
2 |
28665 |
NaN |
| 280 |
Q |
3 |
336439 |
NaN |
| 533 |
C |
3 |
2668 |
NaN |
| 493 |
C |
1 |
PC 17609 |
NaN |
| 813 |
S |
3 |
347082 |
NaN |
| 438 |
S |
1 |
19950 |
C23 C25 C27 |
| 321 |
S |
3 |
349219 |
NaN |
| 790 |
Q |
3 |
12460 |
NaN |
| 822 |
S |
1 |
19972 |
NaN |
| 672 |
S |
2 |
C.A. 24580 |
NaN |
Note
Ticket could help fill in Cabin data by way of signalling it?
- It could also be dependent on
Pclass and Embarked
Fare
Passenger fare
511 8.0500
753 7.8958
460 26.5500
9 30.0708
99 26.0000
347 16.1000
827 37.0042
351 35.0000
177 28.7125
825 6.9500
Name: Fare, dtype: float64
df_train.groupby(['Pclass', 'Embarked']).agg({'Fare':'mean'})
|
|
Fare |
| Pclass |
Embarked |
|
| 1 |
C |
104.718529 |
| Q |
90.000000 |
| S |
70.364862 |
| 2 |
C |
25.358335 |
| Q |
12.350000 |
| S |
20.327439 |
| 3 |
C |
11.214083 |
| Q |
11.183393 |
| S |
14.644083 |
Note
- As expected the class order in terms of fare is 1, 2 and then 3 but where the passenger embarks also has an impact on the fare for the travel.
- However, no new information in our context could be extracted from this column?
Embarked
Port of Embarkation; C -Cherbourg, Q - Queenstown, S - Southampton
pd.concat([df_train.Embarked.value_counts(),
df_train.Embarked.value_counts(normalize = True)], axis = 1)
|
count |
proportion |
| Embarked |
|
|
| S |
644 |
0.724409 |
| C |
168 |
0.188976 |
| Q |
77 |
0.086614 |
Note
- Don’t see relevance in survival but could be useful for other columns?
Columns with null values
Cabin
Cabin number
df_train.Cabin.sample(10)
57 NaN
72 NaN
348 NaN
631 NaN
422 NaN
689 B5
734 NaN
301 NaN
615 NaN
754 NaN
Name: Cabin, dtype: object
pd.concat([df_train.Cabin.isna().value_counts(),
df_train.Cabin.isna().value_counts(normalize=True)], axis = 1)
|
count |
proportion |
| Cabin |
|
|
| True |
687 |
0.771044 |
| False |
204 |
0.228956 |
df_train[df_train.Cabin.notnull()].groupby('Pclass')['Cabin'].agg(' '.join)
Pclass
1 C85 C123 E46 C103 A6 C23 C25 C27 B78 D33 B30 C...
2 D56 F33 E101 F2 F4 F2 D E101 D F2 F33 D F33 F4...
3 G6 F G73 F E69 G6 G6 G6 E10 F G63 F G73 E121 F...
Name: Cabin, dtype: object
Note
- Cabins could hold key information relating to who survived by virtue of how accessible they are to the deck.
- However, since large portions of the information is unavailable (~78%) It cannot be used unless encoded through any other variable.
- Class could signify some cabins in terms of allocation, the definition of class is separation, but what use is the attribute if class alone can define it. Needs exploration!
Age
354 NaN
210 24.0
711 NaN
11 58.0
395 22.0
629 NaN
722 34.0
193 3.0
300 NaN
794 25.0
Name: Age, dtype: float64
Data Preprocessing
Feature Engineering
Title
df_train['Name'].apply(lambda x : x.split('.')[0].split(' ')[-1].strip()).value_counts()
Name
Mr 517
Miss 182
Mrs 125
Master 40
Dr 7
Rev 6
Mlle 2
Major 2
Col 2
Countess 1
Capt 1
Ms 1
Sir 1
Lady 1
Mme 1
Don 1
Jonkheer 1
Name: count, dtype: int64
df_train['Title'] = df_train['Name'].apply(lambda x : x.split('.')[0].split(' ')[-1].strip())
df_test['Title'] = df_test['Name'].apply(lambda x : x.split('.')[0].split(' ')[-1].strip())
df_train.sample(10)
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
Title |
| 637 |
638 |
0 |
2 |
Collyer, Mr. Harvey |
male |
31.0 |
1 |
1 |
C.A. 31921 |
26.2500 |
NaN |
S |
Mr |
| 697 |
698 |
1 |
3 |
Mullens, Miss. Katherine "Katie" |
female |
NaN |
0 |
0 |
35852 |
7.7333 |
NaN |
Q |
Miss |
| 806 |
807 |
0 |
1 |
Andrews, Mr. Thomas Jr |
male |
39.0 |
0 |
0 |
112050 |
0.0000 |
A36 |
S |
Mr |
| 324 |
325 |
0 |
3 |
Sage, Mr. George John Jr |
male |
NaN |
8 |
2 |
CA. 2343 |
69.5500 |
NaN |
S |
Mr |
| 835 |
836 |
1 |
1 |
Compton, Miss. Sara Rebecca |
female |
39.0 |
1 |
1 |
PC 17756 |
83.1583 |
E49 |
C |
Miss |
| 513 |
514 |
1 |
1 |
Rothschild, Mrs. Martin (Elizabeth L. Barrett) |
female |
54.0 |
1 |
0 |
PC 17603 |
59.4000 |
NaN |
C |
Mrs |
| 291 |
292 |
1 |
1 |
Bishop, Mrs. Dickinson H (Helen Walton) |
female |
19.0 |
1 |
0 |
11967 |
91.0792 |
B49 |
C |
Mrs |
| 767 |
768 |
0 |
3 |
Mangan, Miss. Mary |
female |
30.5 |
0 |
0 |
364850 |
7.7500 |
NaN |
Q |
Miss |
| 851 |
852 |
0 |
3 |
Svensson, Mr. Johan |
male |
74.0 |
0 |
0 |
347060 |
7.7750 |
NaN |
S |
Mr |
| 117 |
118 |
0 |
2 |
Turpin, Mr. William John Robert |
male |
29.0 |
1 |
0 |
11668 |
21.0000 |
NaN |
S |
Mr |
df_train.groupby("Title").agg({'Age':['count', 'mean']}).sort_values(('Age', 'mean'), ascending = False)
|
Age |
|
count |
mean |
| Title |
|
|
| Capt |
1 |
70.000000 |
| Col |
2 |
58.000000 |
| Sir |
1 |
49.000000 |
| Major |
2 |
48.500000 |
| Lady |
1 |
48.000000 |
| Rev |
6 |
43.166667 |
| Dr |
6 |
42.000000 |
| Don |
1 |
40.000000 |
| Jonkheer |
1 |
38.000000 |
| Mrs |
108 |
35.898148 |
| Countess |
1 |
33.000000 |
| Mr |
398 |
32.368090 |
| Ms |
1 |
28.000000 |
| Mlle |
2 |
24.000000 |
| Mme |
1 |
24.000000 |
| Miss |
146 |
21.773973 |
| Master |
36 |
4.574167 |
df_train[df_train.Age.isna()].Title.value_counts()
Title
Mr 119
Miss 36
Mrs 17
Master 4
Dr 1
Name: count, dtype: int64
df_test[df_test.Age.isna()].Title.value_counts()
Title
Mr 57
Miss 14
Mrs 10
Master 4
Ms 1
Name: count, dtype: int64
Note Use the average age by Title to encode missing value in the Age column. This method is preferred over mean encoding on the overall dataset.
Title mean age
Average age of all the passengers with a specific title.
Note: Could be useful for age encoding
df_title_mean_age = df_train.groupby("Title").agg({'Age':'mean'}).reset_index().rename({'Age': 'Title_Mean_Age'}, axis = 1)
df_title_mean_age.sample(5)
|
Title |
Title_Mean_Age |
| 16 |
Sir |
49.0 |
| 2 |
Countess |
33.0 |
| 7 |
Major |
48.5 |
| 0 |
Capt |
70.0 |
| 6 |
Lady |
48.0 |
df_train = df_train.merge(df_title_mean_age, how = 'left', on = 'Title')
df_test = df_test.merge(df_title_mean_age, how = 'left', on = 'Title')
df_train.sample(10)
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
Title |
Title_Mean_Age |
| 634 |
635 |
0 |
3 |
Skoog, Miss. Mabel |
female |
9.0 |
3 |
2 |
347088 |
27.9000 |
NaN |
S |
Miss |
21.773973 |
| 232 |
233 |
0 |
2 |
Sjostedt, Mr. Ernst Adolf |
male |
59.0 |
0 |
0 |
237442 |
13.5000 |
NaN |
S |
Mr |
32.368090 |
| 721 |
722 |
0 |
3 |
Jensen, Mr. Svend Lauritz |
male |
17.0 |
1 |
0 |
350048 |
7.0542 |
NaN |
S |
Mr |
32.368090 |
| 266 |
267 |
0 |
3 |
Panula, Mr. Ernesti Arvid |
male |
16.0 |
4 |
1 |
3101295 |
39.6875 |
NaN |
S |
Mr |
32.368090 |
| 650 |
651 |
0 |
3 |
Mitkoff, Mr. Mito |
male |
NaN |
0 |
0 |
349221 |
7.8958 |
NaN |
S |
Mr |
32.368090 |
| 524 |
525 |
0 |
3 |
Kassem, Mr. Fared |
male |
NaN |
0 |
0 |
2700 |
7.2292 |
NaN |
C |
Mr |
32.368090 |
| 311 |
312 |
1 |
1 |
Ryerson, Miss. Emily Borie |
female |
18.0 |
2 |
2 |
PC 17608 |
262.3750 |
B57 B59 B63 B66 |
C |
Miss |
21.773973 |
| 116 |
117 |
0 |
3 |
Connors, Mr. Patrick |
male |
70.5 |
0 |
0 |
370369 |
7.7500 |
NaN |
Q |
Mr |
32.368090 |
| 865 |
866 |
1 |
2 |
Bystrom, Mrs. (Karolina) |
female |
42.0 |
0 |
0 |
236852 |
13.0000 |
NaN |
S |
Mrs |
35.898148 |
| 872 |
873 |
0 |
1 |
Carlsson, Mr. Frans Olof |
male |
33.0 |
0 |
0 |
695 |
5.0000 |
B51 B53 B55 |
S |
Mr |
32.368090 |
df_train[df_train.Age.isna()].sample()
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
Title |
Title_Mean_Age |
| 776 |
777 |
0 |
3 |
Tobin, Mr. Roger |
male |
NaN |
0 |
0 |
383121 |
7.75 |
F38 |
Q |
Mr |
32.36809 |
df_train['Age'] = df_train['Age'].fillna(df_train['Title_Mean_Age'])
df_test['Age'] = df_test['Age'].fillna(df_test['Title_Mean_Age'])
df_test.shape
Alone
Flag to check if no siblings or spouses or children or parents are accompanying. Note: This is ignorant of friends or accomplices
print(f"The number of passengers who are travelling alone is \
{df_train[(df_train.SibSp==0)&(df_train.Parch==0)].shape[0]} \
which is about {df_train[(df_train.SibSp==0)&(df_train.Parch==0)].shape[0]/df_train.shape[0]*100:.2f}% of the total passengers.")
The number of passengers who are travelling alone is 537 which is about 60.27% of the total passengers.
df_train['Alone'] = (df_train.SibSp==0)&(df_train.Parch==0)
df_test['Alone'] = (df_test.SibSp==0)&(df_test.Parch==0)
df_train.sample(10)
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
Title |
Title_Mean_Age |
Alone |
| 573 |
574 |
1 |
3 |
Kelly, Miss. Mary |
female |
21.773973 |
0 |
0 |
14312 |
7.7500 |
NaN |
Q |
Miss |
21.773973 |
True |
| 495 |
496 |
0 |
3 |
Yousseff, Mr. Gerious |
male |
32.368090 |
0 |
0 |
2627 |
14.4583 |
NaN |
C |
Mr |
32.368090 |
True |
| 472 |
473 |
1 |
2 |
West, Mrs. Edwy Arthur (Ada Mary Worth) |
female |
33.000000 |
1 |
2 |
C.A. 34651 |
27.7500 |
NaN |
S |
Mrs |
35.898148 |
False |
| 741 |
742 |
0 |
1 |
Cavendish, Mr. Tyrell William |
male |
36.000000 |
1 |
0 |
19877 |
78.8500 |
C46 |
S |
Mr |
32.368090 |
False |
| 345 |
346 |
1 |
2 |
Brown, Miss. Amelia "Mildred" |
female |
24.000000 |
0 |
0 |
248733 |
13.0000 |
F33 |
S |
Miss |
21.773973 |
True |
| 566 |
567 |
0 |
3 |
Stoytcheff, Mr. Ilia |
male |
19.000000 |
0 |
0 |
349205 |
7.8958 |
NaN |
S |
Mr |
32.368090 |
True |
| 785 |
786 |
0 |
3 |
Harmer, Mr. Abraham (David Lishin) |
male |
25.000000 |
0 |
0 |
374887 |
7.2500 |
NaN |
S |
Mr |
32.368090 |
True |
| 174 |
175 |
0 |
1 |
Smith, Mr. James Clinch |
male |
56.000000 |
0 |
0 |
17764 |
30.6958 |
A7 |
C |
Mr |
32.368090 |
True |
| 474 |
475 |
0 |
3 |
Strandberg, Miss. Ida Sofia |
female |
22.000000 |
0 |
0 |
7553 |
9.8375 |
NaN |
S |
Miss |
21.773973 |
True |
| 512 |
513 |
1 |
1 |
McGough, Mr. James Robert |
male |
36.000000 |
0 |
0 |
PC 17473 |
26.2875 |
E25 |
S |
Mr |
32.368090 |
True |
df_train.Alone.value_counts()
Alone
True 537
False 354
Name: count, dtype: int64
df_train.Survived.value_counts()
Survived
0 549
1 342
Name: count, dtype: int64
df_train.groupby('Alone').agg({'Survived':'sum'}).rename({'Survived':'count'}, axis = 1)
|
count |
| Alone |
|
| False |
179 |
| True |
163 |
pd.DataFrame(df_train.Alone.value_counts())
|
count |
| Alone |
|
| True |
537 |
| False |
354 |
df_train.groupby('Alone').agg({'Survived':'sum'}).rename({'Survived':'count'}, axis = 1)/pd.DataFrame(df_train.Alone.value_counts())
|
count |
| Alone |
|
| False |
0.505650 |
| True |
0.303538 |
Note
- This could be an important feature as the survival rate drops by 20% at least for a passenger who is travelling alone!
Age group
Use the Age column after filling null values with Title mean age
<class 'pandas.core.series.Series'>
RangeIndex: 418 entries, 0 to 417
Series name: Age
Non-Null Count Dtype
-------------- -----
418 non-null float64
dtypes: float64(1)
memory usage: 3.4 KB
# Define bins for the age ranges according to biological markers
bins = [0, 6, 13, 20, 36, 56, 76, float('inf')] # float('inf') for ages above 75
# Labels for the age groups
labels = ['0-5', '6-12', '13-19', '20-35', '36-55', '56-75', '76+']
# Create age categories
df_train['Age_Group'] = pd.cut(df_train['Age'], bins=bins, labels=labels, right=False)
df_test['Age_Group'] = pd.cut(df_test['Age'], bins=bins, labels=labels, right=False)
df_train.Age_Group.value_counts()
Age_Group
20-35 505
36-55 179
13-19 95
0-5 48
56-75 38
6-12 25
76+ 1
Name: count, dtype: int64
features = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Title', 'Alone', 'Age_Group']
cat_features = ['Sex', 'Title', 'Alone', 'Age_Group']
df_train[features].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pclass 891 non-null int64
1 Sex 891 non-null object
2 SibSp 891 non-null int64
3 Parch 891 non-null int64
4 Title 891 non-null object
5 Alone 891 non-null bool
6 Age_Group 891 non-null category
dtypes: bool(1), category(1), int64(3), object(2)
memory usage: 37.0+ KB
Family Size
Encoding
# one hot encoding categorical columns
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output = False, handle_unknown = 'ignore')
encoder.fit(df_train[cat_features])
train_encoded = encoder.transform(df_train[cat_features])
test_encoded = encoder.transform(df_test[cat_features])
# Convert encoded data back to DataFrame
df_train_encoded = pd.DataFrame(train_encoded, columns=encoder.get_feature_names_out())
df_test_encoded = pd.DataFrame(test_encoded, columns=encoder.get_feature_names_out())
Index(['Sex_female', 'Sex_male', 'Title_Capt', 'Title_Col', 'Title_Countess',
'Title_Don', 'Title_Dr', 'Title_Jonkheer', 'Title_Lady', 'Title_Major',
'Title_Master', 'Title_Miss', 'Title_Mlle', 'Title_Mme', 'Title_Mr',
'Title_Mrs', 'Title_Ms', 'Title_Rev', 'Title_Sir', 'Alone_False',
'Alone_True', 'Age_Group_0-5', 'Age_Group_13-19', 'Age_Group_20-35',
'Age_Group_36-55', 'Age_Group_56-75', 'Age_Group_6-12',
'Age_Group_76+'],
dtype='object')
df_train_encoded['Survived'] = df_train['Survived']
df_train_encoded.columns
Index(['Sex_female', 'Sex_male', 'Title_Capt', 'Title_Col', 'Title_Countess',
'Title_Don', 'Title_Dr', 'Title_Jonkheer', 'Title_Lady', 'Title_Major',
'Title_Master', 'Title_Miss', 'Title_Mlle', 'Title_Mme', 'Title_Mr',
'Title_Mrs', 'Title_Ms', 'Title_Rev', 'Title_Sir', 'Alone_False',
'Alone_True', 'Age_Group_0-5', 'Age_Group_13-19', 'Age_Group_20-35',
'Age_Group_36-55', 'Age_Group_56-75', 'Age_Group_6-12', 'Age_Group_76+',
'Survived'],
dtype='object')
df_train_encoded.sample(10)
|
Sex_female |
Sex_male |
Title_Capt |
Title_Col |
Title_Countess |
Title_Don |
Title_Dr |
Title_Jonkheer |
Title_Lady |
Title_Major |
... |
Alone_False |
Alone_True |
Age_Group_0-5 |
Age_Group_13-19 |
Age_Group_20-35 |
Age_Group_36-55 |
Age_Group_56-75 |
Age_Group_6-12 |
Age_Group_76+ |
Survived |
| 638 |
1.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
1.0 |
0.0 |
0.0 |
0.0 |
0.0 |
1.0 |
0.0 |
0.0 |
0.0 |
0 |
| 510 |
0.0 |
1.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
1.0 |
0.0 |
0.0 |
1.0 |
0.0 |
0.0 |
0.0 |
0.0 |
1 |
| 393 |
1.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
1.0 |
0.0 |
0.0 |
0.0 |
1.0 |
0.0 |
0.0 |
0.0 |
0.0 |
1 |
| 857 |
0.0 |
1.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
1.0 |
0.0 |
0.0 |
0.0 |
1.0 |
0.0 |
0.0 |
0.0 |
1 |
| 887 |
1.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
1.0 |
0.0 |
1.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
1 |
| 17 |
0.0 |
1.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
1.0 |
0.0 |
0.0 |
1.0 |
0.0 |
0.0 |
0.0 |
0.0 |
1 |
| 189 |
0.0 |
1.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
1.0 |
0.0 |
0.0 |
0.0 |
1.0 |
0.0 |
0.0 |
0.0 |
0 |
| 444 |
0.0 |
1.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
1.0 |
0.0 |
0.0 |
1.0 |
0.0 |
0.0 |
0.0 |
0.0 |
1 |
| 816 |
1.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
1.0 |
0.0 |
0.0 |
1.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0 |
| 26 |
0.0 |
1.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
... |
0.0 |
1.0 |
0.0 |
0.0 |
1.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0 |
10 rows × 29 columns
Model Building
Auto-ML using PyCaret
from pycaret.classification import setup, compare_models
# Assuming 'data' is your DataFrame and 'target' is the name of the target column
setup_data = setup(data=df_train_encoded, target='Survived', fold = 3, session_id=123)
| |
Description |
Value |
| 0 |
Session id |
123 |
| 1 |
Target |
Survived |
| 2 |
Target type |
Binary |
| 3 |
Original data shape |
(891, 29) |
| 4 |
Transformed data shape |
(891, 29) |
| 5 |
Transformed train set shape |
(623, 29) |
| 6 |
Transformed test set shape |
(268, 29) |
| 7 |
Numeric features |
28 |
| 8 |
Preprocess |
True |
| 9 |
Imputation type |
simple |
| 10 |
Numeric imputation |
mean |
| 11 |
Categorical imputation |
mode |
| 12 |
Fold Generator |
StratifiedKFold |
| 13 |
Fold Number |
3 |
| 14 |
CPU Jobs |
-1 |
| 15 |
Use GPU |
False |
| 16 |
Log Experiment |
False |
| 17 |
Experiment Name |
clf-default-name |
| 18 |
USI |
2335 |
best_model = compare_models()
| |
Model |
Accuracy |
AUC |
Recall |
Prec. |
F1 |
Kappa |
MCC |
TT (Sec) |
| lr |
Logistic Regression |
0.8025 |
0.7862 |
0.7446 |
0.7421 |
0.7433 |
0.5829 |
0.5829 |
1.7533 |
| ridge |
Ridge Classifier |
0.8025 |
0.7878 |
0.7363 |
0.7463 |
0.7411 |
0.5815 |
0.5817 |
0.0500 |
| lda |
Linear Discriminant Analysis |
0.8009 |
0.7820 |
0.7321 |
0.7450 |
0.7383 |
0.5777 |
0.5779 |
0.0767 |
| ada |
Ada Boost Classifier |
0.7945 |
0.7753 |
0.7279 |
0.7343 |
0.7309 |
0.5647 |
0.5649 |
0.1600 |
| lightgbm |
Light Gradient Boosting Machine |
0.7881 |
0.7869 |
0.7111 |
0.7297 |
0.7202 |
0.5497 |
0.5499 |
0.5033 |
| gbc |
Gradient Boosting Classifier |
0.7865 |
0.7766 |
0.7069 |
0.7283 |
0.7174 |
0.5458 |
0.5461 |
0.3533 |
| rf |
Random Forest Classifier |
0.7849 |
0.7790 |
0.6944 |
0.7311 |
0.7123 |
0.5407 |
0.5411 |
0.2467 |
| et |
Extra Trees Classifier |
0.7801 |
0.7732 |
0.6777 |
0.7299 |
0.7027 |
0.5286 |
0.5297 |
0.3300 |
| xgboost |
Extreme Gradient Boosting |
0.7800 |
0.7814 |
0.6944 |
0.7219 |
0.7079 |
0.5316 |
0.5319 |
0.1833 |
| svm |
SVM - Linear Kernel |
0.7785 |
0.7883 |
0.7405 |
0.7027 |
0.7193 |
0.5369 |
0.5393 |
0.0500 |
| dt |
Decision Tree Classifier |
0.7769 |
0.7710 |
0.6777 |
0.7235 |
0.6996 |
0.5225 |
0.5234 |
0.0400 |
| knn |
K Neighbors Classifier |
0.7672 |
0.7734 |
0.6234 |
0.7317 |
0.6715 |
0.4934 |
0.4984 |
0.0733 |
| qda |
Quadratic Discriminant Analysis |
0.7192 |
0.7699 |
0.7361 |
0.6423 |
0.6740 |
0.4358 |
0.4498 |
0.0500 |
| dummy |
Dummy Classifier |
0.6164 |
0.5000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0400 |
| nb |
Naive Bayes |
0.4866 |
0.7688 |
0.6544 |
0.5382 |
0.4083 |
0.0352 |
0.0885 |
0.0433 |
Processing: 0%| | 0/65 [00:00<?, ?it/s]
from pycaret.classification import tune_model
tuned_model = tune_model(best_model, fold=10)
| |
Accuracy |
AUC |
Recall |
Prec. |
F1 |
Kappa |
MCC |
| Fold |
|
|
|
|
|
|
|
| 0 |
0.7937 |
0.7618 |
0.7083 |
0.7391 |
0.7234 |
0.5590 |
0.5593 |
| 1 |
0.8571 |
0.8109 |
0.7917 |
0.8261 |
0.8085 |
0.6947 |
0.6951 |
| 2 |
0.9048 |
0.9054 |
0.9583 |
0.8214 |
0.8846 |
0.8043 |
0.8113 |
| 3 |
0.7903 |
0.7592 |
0.6522 |
0.7500 |
0.6977 |
0.5384 |
0.5415 |
| 4 |
0.7419 |
0.7577 |
0.5833 |
0.7000 |
0.6364 |
0.4389 |
0.4433 |
| 5 |
0.7419 |
0.7621 |
0.7083 |
0.6538 |
0.6800 |
0.4644 |
0.4654 |
| 6 |
0.8871 |
0.8805 |
0.8333 |
0.8696 |
0.8511 |
0.7602 |
0.7607 |
| 7 |
0.7258 |
0.7484 |
0.7083 |
0.6296 |
0.6667 |
0.4352 |
0.4373 |
| 8 |
0.7097 |
0.6727 |
0.5417 |
0.6500 |
0.5909 |
0.3688 |
0.3725 |
| 9 |
0.8387 |
0.8355 |
0.8750 |
0.7500 |
0.8077 |
0.6702 |
0.6761 |
| Mean |
0.7991 |
0.7894 |
0.7361 |
0.7390 |
0.7347 |
0.5734 |
0.5762 |
| Std |
0.0662 |
0.0656 |
0.1232 |
0.0777 |
0.0929 |
0.1431 |
0.1434 |
Processing: 0%| | 0/7 [00:00<?, ?it/s]
Fitting 10 folds for each of 10 candidates, totalling 100 fits
predictions = tuned_model.predict(df_test_encoded)
df_test['Survived'] = predictions
df_submission = df_test[['PassengerId', 'Survived']]
Deep learning
df_submission.to_csv('/kaggle/working/submission.csv', index=False)
17 May 2025

- Play around with the “gamma” regularisation parameter. Gamma adds a complexity cost to the XGBoost trees. The XGBoost trees will only become deeper if the gain associated with expanding the tree is > gamma (also known as “pruning”). The higher gamma, the stronger the regularisation.
- You can use the “subsample” parameter to train each XGBoost tree on a subset of the data. If subsample is set to 0.9, each tree will be trained on 90% of the training data.
- Use the “colsample_bytree” parameter where again you would train each XGBoost tree on a random subset of your features. If you set colsample_bytree = 0.9, you would randomly remove 10% of the features when building each tree.
- Use early stopping. As you keep adding estimators (i.e trees) to better fit the training data, you will start to overfit. Early_stopping_rounds will stop XGBoost from adding additional trees when its performance on the validation set stops improving for a certain number of trees.
- Play around with the lambda and alpha regularisation parameters which are similar to L2 and L1 regularisation.
Bonus, like for random forest you can play around with the tree parameters as well:
-
Adjust the “max_depth” parameter of the trees, the lower the max depth, the simpler your algorithm and the stronger the regularisation.
-
Adjust the min_samples_leaf parameter to set a minimum number of samples per leaf. The higher the value the stronger the regularisation.
-
dispensing with it when the dataset is fairly small and relationships are quite simple/linear. That would, in most cases, be fitting a square peg into a round hole.
-
Another non-hyperparameter strategy: bucketing variables to make the algo less able to slice and dice continuous or ordinal variables too much and create overly complex relationships.
-
And the best algo for tabular data is the one you’ve actually tried and cross-validated to be the best by whatever metric makes the most sense for your specific dataset and business goal.
-
XGBoost is not the best one for tabular data, it is just the well-known one. CatBoost outperforms XGBoost in many ways. As you are discussing about overfit, CatBoost hardly overfits and it works perfectly well on numeric features. Of course, althe more complicated a ml model it is, the more time someone needs to master it.
-
The good news is, because of the bagging method w/ rf, overfitting is very hard?
-
I love xgboost, but I find if you are good at parameter tuning (using a good chunk of the techniques you have outlined above) lightgbm performs nearly as well for the huge increase in speed. Further EvoTrees does both better with proper tuning.
-
However I have been resold on simple linear regression recently. If you can modify the features to encode some of the nonlinearities,
Linear Regression (or more properly “geodesic regression” ) outperforms everything.