Revisiting the titanic with PyCaret

17 May 2025

Revisiting the titanic with PyCaret

! ls /kaggle/input/titanic

gender_submission.csv  test.csv  train.csv

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
data_files = []
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        data_files.append(os.path.join(dirname, filename))
print(data_files)

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

['/kaggle/input/titanic/train.csv', '/kaggle/input/titanic/test.csv', '/kaggle/input/titanic/gender_submission.csv']

References

Britannica

Exploratory Data Analysis

Files

df_train - development dataset
df_test - submission dataset
df_gender - sample submission dataset where all female passengers are assumed to have survived - 76% accuracy recorded with this naive approach!

df_train = pd.read_csv('/kaggle/input/titanic/train.csv')
df_test = pd.read_csv('/kaggle/input/titanic/test.csv')
df_gender = pd.read_csv('/kaggle/input/titanic/gender_submission.csv')

print(df_train.shape, df_test.shape, df_gender.shape)

(891, 12) (418, 11) (418, 2)

df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

Observations

ID column : PassengerID
Target (DV) column : Survived {0 1}
Non-Null IDV columns
Class
Name
Sex
SibSp
Parch
Ticket
Fare
Embarked
IDV columns with some Null values
Age
Cabin

df_train.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

Columns without any null values

Pclass

Ticket Class

df_train.Pclass.sample(10)

  3
  3
  3
  3
  1
  2
  2
  3
  1
  3
Name: Pclass, dtype: int64

What is the distribution of the class of tickets?

pd.concat([df_train.Pclass.value_counts(),
           df_train.Pclass.value_counts(normalize=True)], axis = 1)

	count	proportion
Pclass
3	491	0.551066
1	216	0.242424
2	184	0.206510

class_survivors = df_train.groupby('Pclass').agg({'Survived':['count', 'sum']})

class_survivors

	Survived
	count	sum
Pclass
1	216	136
2	184	87
3	491	119

class_survivors[('Survived', 'sum')]/class_survivors[('Survived', 'count')]

Pclass
1    0.629630
2    0.472826
3    0.242363
dtype: float64

Name

df_train.Name.sample(10)

                                     Moran, Mr. James
                          Braund, Mr. Lewis Richard
                          Silvey, Mr. William Baird
                        Compton, Miss. Sara Rebecca
  Swift, Mrs. Frederick Joel (Margaret Welles Ba...
  Moubarek, Master. Halim Gonios ("William George")
                                     Ali, Mr. Ahmed
                                 Lefebre, Miss. Ida
                       Hippach, Miss. Jean Gertrude
       Duff Gordon, Sir. Cosmo Edmund ("Mr Morgan")
Name: Name, dtype: object

Note

Title could be extracted from the name.
Is there some significance in survival context to names with a braces, an alternative name?
Are there important family names that can extracted to signify importance for survival?

Sex

df_train.Sex.sample(10)

  female
     male
    male
    male
    male
      male
  female
    male
    male
    male
Name: Sex, dtype: object

What is the distribution of passenger sex?

pd.concat([df_train.Sex.value_counts(),
           df_train.Sex.value_counts(normalize=True)], axis = 1)

	count	proportion
Sex
male	577	0.647587
female	314	0.352413

df_train[df_train.Age.isna()].sample(10)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
815	816	0	1	Fry, Mr. Richard	male	NaN	0	0	112058	0.0000	B102	S
229	230	0	3	Lefebre, Miss. Mathilde	female	NaN	3	1	4133	25.4667	NaN	S
656	657	0	3	Radeff, Mr. Alexander	male	NaN	0	0	349223	7.8958	NaN	S
475	476	0	1	Clifford, Mr. George Quincy	male	NaN	0	0	110465	52.0000	A14	S
180	181	0	3	Sage, Miss. Constance Gladys	female	NaN	8	2	CA. 2343	69.5500	NaN	S
760	761	0	3	Garfirth, Mr. John	male	NaN	0	0	358585	14.5000	NaN	S
783	784	0	3	Johnston, Mr. Andrew G	male	NaN	1	2	W./C. 6607	23.4500	NaN	S
87	88	0	3	Slocovski, Mr. Selman Francis	male	NaN	0	0	SOTON/OQ 392086	8.0500	NaN	S
55	56	1	1	Woolner, Mr. Hugh	male	NaN	0	0	19947	35.5000	C52	S
65	66	1	3	Moubarek, Master. Gerios	male	NaN	1	1	2661	15.2458	NaN	C

SibSp

number of siblings / spouses aboard the Titanic

df_train.SibSp.sample(10)

  0
  0
  0
  1
    3
  0
  0
  0
  0
    0
Name: SibSp, dtype: int64

pd.concat([df_train.SibSp.value_counts(),
           df_train.SibSp.value_counts(normalize=True)], axis = 1)

	count	proportion
SibSp
0	608	0.682379
1	209	0.234568
2	28	0.031425
4	18	0.020202
3	16	0.017957
8	7	0.007856
5	5	0.005612

Note

About 70% of the passengers have no siblings or spouses accompanying them.
If SibSp == 1, is it a spouse more likely then a sibling? If so, would they survive together or die together unlike Jack and Rose :D

Parch

Number of parents / children aboard the Titanic

df_train.Parch.sample(10)

  0
  0
  1
  0
  0
  0
  0
  0
   0
  2
Name: Parch, dtype: int64

pd.concat([df_train.Parch.value_counts(),
           df_train.Parch.value_counts(normalize=True)], axis = 1)

	count	proportion
Parch
0	678	0.760943
1	118	0.132435
2	80	0.089787
5	5	0.005612
3	5	0.005612
4	4	0.004489
6	1	0.001122

Ticket

Ticket number

df_train.Ticket.sample(10)

         PC 17483
        A/4. 39886
             4133
           250649
         A/S 2816
           315096
           383121
            374910
      SC/AH 29037
  SOTON/OQ 392090
Name: Ticket, dtype: object

pd.concat([df_train.Parch.value_counts(),
           df_train.Parch.value_counts(normalize=True)], axis = 1)

	count	proportion
Parch
0	678	0.760943
1	118	0.132435
2	80	0.089787
5	5	0.005612
3	5	0.005612
4	4	0.004489
6	1	0.001122

df_train[['Embarked','Pclass', 'Ticket', 'Cabin']].sample(10)

	Embarked	Pclass	Ticket	Cabin
238	S	2	28665	NaN
280	Q	3	336439	NaN
533	C	3	2668	NaN
493	C	1	PC 17609	NaN
813	S	3	347082	NaN
438	S	1	19950	C23 C25 C27
321	S	3	349219	NaN
790	Q	3	12460	NaN
822	S	1	19972	NaN
672	S	2	C.A. 24580	NaN

Note

Ticket could help fill in Cabin data by way of signalling it?
It could also be dependent on Pclass and Embarked

Fare

Passenger fare

df_train.Fare.sample(10)

   8.0500
   7.8958
  26.5500
    30.0708
   26.0000
  16.1000
  37.0042
  35.0000
  28.7125
   6.9500
Name: Fare, dtype: float64

df_train.groupby(['Pclass', 'Embarked']).agg({'Fare':'mean'})

		Fare
Pclass	Embarked
1	C	104.718529
	Q	90.000000
	S	70.364862
2	C	25.358335
	Q	12.350000
	S	20.327439
3	C	11.214083
	Q	11.183393
	S	14.644083

Note

As expected the class order in terms of fare is 1, 2 and then 3 but where the passenger embarks also has an impact on the fare for the travel.
However, no new information in our context could be extracted from this column?

Embarked

Port of Embarkation; C -Cherbourg, Q - Queenstown, S - Southampton

pd.concat([df_train.Embarked.value_counts(),
           df_train.Embarked.value_counts(normalize = True)], axis = 1)

	count	proportion
Embarked
S	644	0.724409
C	168	0.188976
Q	77	0.086614

Note

Don’t see relevance in survival but could be useful for other columns?

Columns with null values

Cabin

Cabin number

df_train.Cabin.sample(10)

   NaN
   NaN
  NaN
  NaN
  NaN
   B5
  NaN
  NaN
  NaN
  NaN
Name: Cabin, dtype: object

pd.concat([df_train.Cabin.isna().value_counts(),
           df_train.Cabin.isna().value_counts(normalize=True)], axis = 1)

	count	proportion
Cabin
True	687	0.771044
False	204	0.228956

df_train[df_train.Cabin.notnull()].groupby('Pclass')['Cabin'].agg(' '.join)

Pclass
1    C85 C123 E46 C103 A6 C23 C25 C27 B78 D33 B30 C...
2    D56 F33 E101 F2 F4 F2 D E101 D F2 F33 D F33 F4...
3    G6 F G73 F E69 G6 G6 G6 E10 F G63 F G73 E121 F...
Name: Cabin, dtype: object

Note

Cabins could hold key information relating to who survived by virtue of how accessible they are to the deck.
However, since large portions of the information is unavailable (~78%) It cannot be used unless encoded through any other variable.
Class could signify some cabins in terms of allocation, the definition of class is separation, but what use is the attribute if class alone can define it. Needs exploration!

Age

df_train.Age.sample(10)

   NaN
  24.0
   NaN
   58.0
  22.0
   NaN
  34.0
   3.0
   NaN
  25.0
Name: Age, dtype: float64

Data Preprocessing

Feature Engineering

Title

df_train['Name'].apply(lambda x : x.split('.')[0].split(' ')[-1].strip()).value_counts()

Name
Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Mlle          2
Major         2
Col           2
Countess      1
Capt          1
Ms            1
Sir           1
Lady          1
Mme           1
Don           1
Jonkheer      1
Name: count, dtype: int64

df_train['Title'] = df_train['Name'].apply(lambda x : x.split('.')[0].split(' ')[-1].strip())
df_test['Title'] = df_test['Name'].apply(lambda x : x.split('.')[0].split(' ')[-1].strip())

df_train.sample(10)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title
637	638	0	2	Collyer, Mr. Harvey	male	31.0	1	1	C.A. 31921	26.2500	NaN	S	Mr
697	698	1	3	Mullens, Miss. Katherine "Katie"	female	NaN	0	0	35852	7.7333	NaN	Q	Miss
806	807	0	1	Andrews, Mr. Thomas Jr	male	39.0	0	0	112050	0.0000	A36	S	Mr
324	325	0	3	Sage, Mr. George John Jr	male	NaN	8	2	CA. 2343	69.5500	NaN	S	Mr
835	836	1	1	Compton, Miss. Sara Rebecca	female	39.0	1	1	PC 17756	83.1583	E49	C	Miss
513	514	1	1	Rothschild, Mrs. Martin (Elizabeth L. Barrett)	female	54.0	1	0	PC 17603	59.4000	NaN	C	Mrs
291	292	1	1	Bishop, Mrs. Dickinson H (Helen Walton)	female	19.0	1	0	11967	91.0792	B49	C	Mrs
767	768	0	3	Mangan, Miss. Mary	female	30.5	0	0	364850	7.7500	NaN	Q	Miss
851	852	0	3	Svensson, Mr. Johan	male	74.0	0	0	347060	7.7750	NaN	S	Mr
117	118	0	2	Turpin, Mr. William John Robert	male	29.0	1	0	11668	21.0000	NaN	S	Mr

df_train.groupby("Title").agg({'Age':['count', 'mean']}).sort_values(('Age', 'mean'), ascending = False)

	Age
	count	mean
Title
Capt	1	70.000000
Col	2	58.000000
Sir	1	49.000000
Major	2	48.500000
Lady	1	48.000000
Rev	6	43.166667
Dr	6	42.000000
Don	1	40.000000
Jonkheer	1	38.000000
Mrs	108	35.898148
Countess	1	33.000000
Mr	398	32.368090
Ms	1	28.000000
Mlle	2	24.000000
Mme	1	24.000000
Miss	146	21.773973
Master	36	4.574167

df_train[df_train.Age.isna()].Title.value_counts()

Title
Mr        119
Miss       36
Mrs        17
Master      4
Dr          1
Name: count, dtype: int64

df_test[df_test.Age.isna()].Title.value_counts()

Title
Mr        57
Miss      14
Mrs       10
Master     4
Ms         1
Name: count, dtype: int64

Note Use the average age by Title to encode missing value in the Age column. This method is preferred over mean encoding on the overall dataset.

Title mean age

Average age of all the passengers with a specific title. Note: Could be useful for age encoding

df_title_mean_age = df_train.groupby("Title").agg({'Age':'mean'}).reset_index().rename({'Age': 'Title_Mean_Age'}, axis = 1)

df_title_mean_age.sample(5)

	Title	Title_Mean_Age
16	Sir	49.0
2	Countess	33.0
7	Major	48.5
0	Capt	70.0
6	Lady	48.0

df_train = df_train.merge(df_title_mean_age, how = 'left', on = 'Title')
df_test = df_test.merge(df_title_mean_age, how = 'left', on = 'Title')

df_train.sample(10)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title	Title_Mean_Age
634	635	0	3	Skoog, Miss. Mabel	female	9.0	3	2	347088	27.9000	NaN	S	Miss	21.773973
232	233	0	2	Sjostedt, Mr. Ernst Adolf	male	59.0	0	0	237442	13.5000	NaN	S	Mr	32.368090
721	722	0	3	Jensen, Mr. Svend Lauritz	male	17.0	1	0	350048	7.0542	NaN	S	Mr	32.368090
266	267	0	3	Panula, Mr. Ernesti Arvid	male	16.0	4	1	3101295	39.6875	NaN	S	Mr	32.368090
650	651	0	3	Mitkoff, Mr. Mito	male	NaN	0	0	349221	7.8958	NaN	S	Mr	32.368090
524	525	0	3	Kassem, Mr. Fared	male	NaN	0	0	2700	7.2292	NaN	C	Mr	32.368090
311	312	1	1	Ryerson, Miss. Emily Borie	female	18.0	2	2	PC 17608	262.3750	B57 B59 B63 B66	C	Miss	21.773973
116	117	0	3	Connors, Mr. Patrick	male	70.5	0	0	370369	7.7500	NaN	Q	Mr	32.368090
865	866	1	2	Bystrom, Mrs. (Karolina)	female	42.0	0	0	236852	13.0000	NaN	S	Mrs	35.898148
872	873	0	1	Carlsson, Mr. Frans Olof	male	33.0	0	0	695	5.0000	B51 B53 B55	S	Mr	32.368090

df_train[df_train.Age.isna()].sample()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title	Title_Mean_Age
776	777	0	3	Tobin, Mr. Roger	male	NaN	0	0	383121	7.75	F38	Q	Mr	32.36809

df_train['Age'] = df_train['Age'].fillna(df_train['Title_Mean_Age'])
df_test['Age'] = df_test['Age'].fillna(df_test['Title_Mean_Age'])

df_test.shape

(418, 13)

Alone

Flag to check if no siblings or spouses or children or parents are accompanying. Note: This is ignorant of friends or accomplices

print(f"The number of passengers who are travelling alone is \
{df_train[(df_train.SibSp==0)&(df_train.Parch==0)].shape[0]} \
which is about {df_train[(df_train.SibSp==0)&(df_train.Parch==0)].shape[0]/df_train.shape[0]*100:.2f}% of the total passengers.")

The number of passengers who are travelling alone is 537 which is about 60.27% of the total passengers.

df_train['Alone'] = (df_train.SibSp==0)&(df_train.Parch==0)
df_test['Alone'] = (df_test.SibSp==0)&(df_test.Parch==0)

df_train.sample(10)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title	Title_Mean_Age	Alone
573	574	1	3	Kelly, Miss. Mary	female	21.773973	0	0	14312	7.7500	NaN	Q	Miss	21.773973	True
495	496	0	3	Yousseff, Mr. Gerious	male	32.368090	0	0	2627	14.4583	NaN	C	Mr	32.368090	True
472	473	1	2	West, Mrs. Edwy Arthur (Ada Mary Worth)	female	33.000000	1	2	C.A. 34651	27.7500	NaN	S	Mrs	35.898148	False
741	742	0	1	Cavendish, Mr. Tyrell William	male	36.000000	1	0	19877	78.8500	C46	S	Mr	32.368090	False
345	346	1	2	Brown, Miss. Amelia "Mildred"	female	24.000000	0	0	248733	13.0000	F33	S	Miss	21.773973	True
566	567	0	3	Stoytcheff, Mr. Ilia	male	19.000000	0	0	349205	7.8958	NaN	S	Mr	32.368090	True
785	786	0	3	Harmer, Mr. Abraham (David Lishin)	male	25.000000	0	0	374887	7.2500	NaN	S	Mr	32.368090	True
174	175	0	1	Smith, Mr. James Clinch	male	56.000000	0	0	17764	30.6958	A7	C	Mr	32.368090	True
474	475	0	3	Strandberg, Miss. Ida Sofia	female	22.000000	0	0	7553	9.8375	NaN	S	Miss	21.773973	True
512	513	1	1	McGough, Mr. James Robert	male	36.000000	0	0	PC 17473	26.2875	E25	S	Mr	32.368090	True

df_train.Alone.value_counts()

Alone
True     537
False    354
Name: count, dtype: int64

df_train.Survived.value_counts()

Survived
0    549
1    342
Name: count, dtype: int64

df_train.groupby('Alone').agg({'Survived':'sum'}).rename({'Survived':'count'}, axis = 1)

	count
Alone
False	179
True	163

pd.DataFrame(df_train.Alone.value_counts())

	count
Alone
True	537
False	354

df_train.groupby('Alone').agg({'Survived':'sum'}).rename({'Survived':'count'}, axis = 1)/pd.DataFrame(df_train.Alone.value_counts())

	count
Alone
False	0.505650
True	0.303538

Note

This could be an important feature as the survival rate drops by 20% at least for a passenger who is travelling alone!

Age group

Use the Age column after filling null values with Title mean age

df_test.Age.info()

<class 'pandas.core.series.Series'>
RangeIndex: 418 entries, 0 to 417
Series name: Age
Non-Null Count  Dtype  
--------------  -----  
418 non-null    float64
dtypes: float64(1)
memory usage: 3.4 KB

# Define bins for the age ranges according to biological markers
bins = [0, 6, 13, 20, 36, 56, 76, float('inf')]  # float('inf') for ages above 75

# Labels for the age groups
labels = ['0-5', '6-12', '13-19', '20-35', '36-55', '56-75', '76+']

# Create age categories
df_train['Age_Group'] = pd.cut(df_train['Age'], bins=bins, labels=labels, right=False)
df_test['Age_Group'] = pd.cut(df_test['Age'], bins=bins, labels=labels, right=False)
df_train.Age_Group.value_counts()

Age_Group
20-35    505
36-55    179
13-19     95
0-5       48
56-75     38
6-12      25
76+        1
Name: count, dtype: int64

features = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Title', 'Alone', 'Age_Group']
cat_features = ['Sex', 'Title', 'Alone', 'Age_Group']

df_train[features].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   Pclass     891 non-null    int64   
 1   Sex        891 non-null    object  
 2   SibSp      891 non-null    int64   
 3   Parch      891 non-null    int64   
 4   Title      891 non-null    object  
 5   Alone      891 non-null    bool    
 6   Age_Group  891 non-null    category
dtypes: bool(1), category(1), int64(3), object(2)
memory usage: 37.0+ KB

Family Size

Encoding

# one hot encoding categorical columns

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output = False, handle_unknown = 'ignore')

encoder.fit(df_train[cat_features])
train_encoded = encoder.transform(df_train[cat_features])
test_encoded = encoder.transform(df_test[cat_features])
# Convert encoded data back to DataFrame
df_train_encoded = pd.DataFrame(train_encoded, columns=encoder.get_feature_names_out())
df_test_encoded = pd.DataFrame(test_encoded, columns=encoder.get_feature_names_out())

df_train_encoded.columns

Index(['Sex_female', 'Sex_male', 'Title_Capt', 'Title_Col', 'Title_Countess',
       'Title_Don', 'Title_Dr', 'Title_Jonkheer', 'Title_Lady', 'Title_Major',
       'Title_Master', 'Title_Miss', 'Title_Mlle', 'Title_Mme', 'Title_Mr',
       'Title_Mrs', 'Title_Ms', 'Title_Rev', 'Title_Sir', 'Alone_False',
       'Alone_True', 'Age_Group_0-5', 'Age_Group_13-19', 'Age_Group_20-35',
       'Age_Group_36-55', 'Age_Group_56-75', 'Age_Group_6-12',
       'Age_Group_76+'],
      dtype='object')

df_train_encoded['Survived'] = df_train['Survived']

df_train_encoded.columns

Index(['Sex_female', 'Sex_male', 'Title_Capt', 'Title_Col', 'Title_Countess',
       'Title_Don', 'Title_Dr', 'Title_Jonkheer', 'Title_Lady', 'Title_Major',
       'Title_Master', 'Title_Miss', 'Title_Mlle', 'Title_Mme', 'Title_Mr',
       'Title_Mrs', 'Title_Ms', 'Title_Rev', 'Title_Sir', 'Alone_False',
       'Alone_True', 'Age_Group_0-5', 'Age_Group_13-19', 'Age_Group_20-35',
       'Age_Group_36-55', 'Age_Group_56-75', 'Age_Group_6-12', 'Age_Group_76+',
       'Survived'],
      dtype='object')

df_train_encoded.sample(10)

	Sex_female	Sex_male	...	Alone_False	Alone_True	Age_Group_13-19	Age_Group_20-35	Age_Group_36-55	Survived
638	1.0	0.0	...	1.0	0.0	0.0	0.0	1.0	0
510	0.0	1.0	...	0.0	1.0	0.0	1.0	0.0	1
393	1.0	0.0	...	1.0	0.0	0.0	1.0	0.0	1
857	0.0	1.0	...	0.0	1.0	0.0	0.0	1.0	1
887	1.0	0.0	...	0.0	1.0	1.0	0.0	0.0	1
17	0.0	1.0	...	0.0	1.0	0.0	1.0	0.0	1
189	0.0	1.0	...	0.0	1.0	0.0	0.0	1.0	0
444	0.0	1.0	...	0.0	1.0	0.0	1.0	0.0	1
816	1.0	0.0	...	0.0	1.0	0.0	1.0	0.0	0
26	0.0	1.0	...	0.0	1.0	0.0	1.0	0.0	0

10 rows × 29 columns

Model Building

Auto-ML using PyCaret

from pycaret.classification import setup, compare_models

# Assuming 'data' is your DataFrame and 'target' is the name of the target column
setup_data = setup(data=df_train_encoded, target='Survived', fold = 3, session_id=123)

	Description	Value
0	Session id	123
1	Target	Survived
2	Target type	Binary
3	Original data shape	(891, 29)
4	Transformed data shape	(891, 29)
5	Transformed train set shape	(623, 29)
6	Transformed test set shape	(268, 29)
7	Numeric features	28
8	Preprocess	True
9	Imputation type	simple
10	Numeric imputation	mean
11	Categorical imputation	mode
12	Fold Generator	StratifiedKFold
13	Fold Number	3
14	CPU Jobs	-1
15	Use GPU	False
16	Log Experiment	False
17	Experiment Name	clf-default-name
18	USI	2335

best_model = compare_models()

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC	TT (Sec)
lr	Logistic Regression	0.8025	0.7862	0.7446	0.7421	0.7433	0.5829	0.5829	1.7533
ridge	Ridge Classifier	0.8025	0.7878	0.7363	0.7463	0.7411	0.5815	0.5817	0.0500
lda	Linear Discriminant Analysis	0.8009	0.7820	0.7321	0.7450	0.7383	0.5777	0.5779	0.0767
ada	Ada Boost Classifier	0.7945	0.7753	0.7279	0.7343	0.7309	0.5647	0.5649	0.1600
lightgbm	Light Gradient Boosting Machine	0.7881	0.7869	0.7111	0.7297	0.7202	0.5497	0.5499	0.5033
gbc	Gradient Boosting Classifier	0.7865	0.7766	0.7069	0.7283	0.7174	0.5458	0.5461	0.3533
rf	Random Forest Classifier	0.7849	0.7790	0.6944	0.7311	0.7123	0.5407	0.5411	0.2467
et	Extra Trees Classifier	0.7801	0.7732	0.6777	0.7299	0.7027	0.5286	0.5297	0.3300
xgboost	Extreme Gradient Boosting	0.7800	0.7814	0.6944	0.7219	0.7079	0.5316	0.5319	0.1833
svm	SVM - Linear Kernel	0.7785	0.7883	0.7405	0.7027	0.7193	0.5369	0.5393	0.0500
dt	Decision Tree Classifier	0.7769	0.7710	0.6777	0.7235	0.6996	0.5225	0.5234	0.0400
knn	K Neighbors Classifier	0.7672	0.7734	0.6234	0.7317	0.6715	0.4934	0.4984	0.0733
qda	Quadratic Discriminant Analysis	0.7192	0.7699	0.7361	0.6423	0.6740	0.4358	0.4498	0.0500
dummy	Dummy Classifier	0.6164	0.5000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0400
nb	Naive Bayes	0.4866	0.7688	0.6544	0.5382	0.4083	0.0352	0.0885	0.0433

Processing:   0%|          | 0/65 [00:00<?, ?it/s]

from pycaret.classification import tune_model

tuned_model = tune_model(best_model, fold=10)

	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
Fold
0	0.7937	0.7618	0.7083	0.7391	0.7234	0.5590	0.5593
1	0.8571	0.8109	0.7917	0.8261	0.8085	0.6947	0.6951
2	0.9048	0.9054	0.9583	0.8214	0.8846	0.8043	0.8113
3	0.7903	0.7592	0.6522	0.7500	0.6977	0.5384	0.5415
4	0.7419	0.7577	0.5833	0.7000	0.6364	0.4389	0.4433
5	0.7419	0.7621	0.7083	0.6538	0.6800	0.4644	0.4654
6	0.8871	0.8805	0.8333	0.8696	0.8511	0.7602	0.7607
7	0.7258	0.7484	0.7083	0.6296	0.6667	0.4352	0.4373
8	0.7097	0.6727	0.5417	0.6500	0.5909	0.3688	0.3725
9	0.8387	0.8355	0.8750	0.7500	0.8077	0.6702	0.6761
Mean	0.7991	0.7894	0.7361	0.7390	0.7347	0.5734	0.5762
Std	0.0662	0.0656	0.1232	0.0777	0.0929	0.1431	0.1434

Processing:   0%|          | 0/7 [00:00<?, ?it/s]

Fitting 10 folds for each of 10 candidates, totalling 100 fits

predictions = tuned_model.predict(df_test_encoded)

df_test['Survived'] = predictions
df_submission = df_test[['PassengerId', 'Survived']]

df_submission.shape

(418, 2)

Deep learning

df_submission.to_csv('/kaggle/working/submission.csv', index=False)

XGBoost_Study

17 May 2025

Play around with the “gamma” regularisation parameter. Gamma adds a complexity cost to the XGBoost trees. The XGBoost trees will only become deeper if the gain associated with expanding the tree is > gamma (also known as “pruning”). The higher gamma, the stronger the regularisation.
You can use the “subsample” parameter to train each XGBoost tree on a subset of the data. If subsample is set to 0.9, each tree will be trained on 90% of the training data.
Use the “colsample_bytree” parameter where again you would train each XGBoost tree on a random subset of your features. If you set colsample_bytree = 0.9, you would randomly remove 10% of the features when building each tree.
Use early stopping. As you keep adding estimators (i.e trees) to better fit the training data, you will start to overfit. Early_stopping_rounds will stop XGBoost from adding additional trees when its performance on the validation set stops improving for a certain number of trees.
Play around with the lambda and alpha regularisation parameters which are similar to L2 and L1 regularisation.

Bonus, like for random forest you can play around with the tree parameters as well:

Adjust the “max_depth” parameter of the trees, the lower the max depth, the simpler your algorithm and the stronger the regularisation.
Adjust the min_samples_leaf parameter to set a minimum number of samples per leaf. The higher the value the stronger the regularisation.
dispensing with it when the dataset is fairly small and relationships are quite simple/linear. That would, in most cases, be fitting a square peg into a round hole.
Another non-hyperparameter strategy: bucketing variables to make the algo less able to slice and dice continuous or ordinal variables too much and create overly complex relationships.
And the best algo for tabular data is the one you’ve actually tried and cross-validated to be the best by whatever metric makes the most sense for your specific dataset and business goal.
XGBoost is not the best one for tabular data, it is just the well-known one. CatBoost outperforms XGBoost in many ways. As you are discussing about overfit, CatBoost hardly overfits and it works perfectly well on numeric features. Of course, althe more complicated a ml model it is, the more time someone needs to master it.
The good news is, because of the bagging method w/ rf, overfitting is very hard?
I love xgboost, but I find if you are good at parameter tuning (using a good chunk of the techniques you have outlined above) lightgbm performs nearly as well for the huge increase in speed. Further EvoTrees does both better with proper tuning.
However I have been resold on simple linear regression recently. If you can modify the features to encode some of the nonlinearities, Linear Regression (or more properly “geodesic regression” ) outperforms everything.

Older Newer

	Sex_female	Sex_male	...	Alone_False	Alone_True	Age_Group_13-19	Age_Group_20-35	Age_Group_36-55	Survived
638	1.0	0.0	...	1.0	0.0	0.0	0.0	1.0	0
510	0.0	1.0	...	0.0	1.0	0.0	1.0	0.0	1
393	1.0	0.0	...	1.0	0.0	0.0	1.0	0.0	1
857	0.0	1.0	...	0.0	1.0	0.0	0.0	1.0	1
887	1.0	0.0	...	0.0	1.0	1.0	0.0	0.0	1
17	0.0	1.0	...	0.0	1.0	0.0	1.0	0.0	1
189	0.0	1.0	...	0.0	1.0	0.0	0.0	1.0	0
444	0.0	1.0	...	0.0	1.0	0.0	1.0	0.0	1
816	1.0	0.0	...	0.0	1.0	0.0	1.0	0.0	0
26	0.0	1.0	...	0.0	1.0	0.0	1.0	0.0	0

	Sex_female	Sex_male	...	Alone_False	Alone_True	Age_Group_13-19	Age_Group_20-35	Age_Group_36-55	Survived
638	1.0	0.0	...	1.0	0.0	0.0	0.0	1.0	0
510	0.0	1.0	...	0.0	1.0	0.0	1.0	0.0	1
393	1.0	0.0	...	1.0	0.0	0.0	1.0	0.0	1
857	0.0	1.0	...	0.0	1.0	0.0	0.0	1.0	1
887	1.0	0.0	...	0.0	1.0	1.0	0.0	0.0	1
17	0.0	1.0	...	0.0	1.0	0.0	1.0	0.0	1
189	0.0	1.0	...	0.0	1.0	0.0	0.0	1.0	0
444	0.0	1.0	...	0.0	1.0	0.0	1.0	0.0	1
816	1.0	0.0	...	0.0	1.0	0.0	1.0	0.0	0
26	0.0	1.0	...	0.0	1.0	0.0	1.0	0.0	0

Learning curve An authentic pursuit to rid the imposter in me 🧘

Revisiting the titanic with PyCaret

Revisiting the titanic with PyCaret

References

Exploratory Data Analysis

Files

Observations

Columns without any null values

Pclass

Name

Sex

SibSp

Parch

Ticket

Fare

Embarked

Columns with null values

Cabin

Age

Data Preprocessing

Feature Engineering

Title

Title mean age

Alone

Age group

Family Size

Encoding

Model Building

Auto-ML using PyCaret

Deep learning

XGBoost_Study

	Sex_female	Sex_male	...	Alone_False	Alone_True	Age_Group_13-19	Age_Group_20-35	Age_Group_36-55	Survived
638	1.0	0.0	...	1.0	0.0	0.0	0.0	1.0	0
510	0.0	1.0	...	0.0	1.0	0.0	1.0	0.0	1
393	1.0	0.0	...	1.0	0.0	0.0	1.0	0.0	1
857	0.0	1.0	...	0.0	1.0	0.0	0.0	1.0	1
887	1.0	0.0	...	0.0	1.0	1.0	0.0	0.0	1
17	0.0	1.0	...	0.0	1.0	0.0	1.0	0.0	1
189	0.0	1.0	...	0.0	1.0	0.0	0.0	1.0	0
444	0.0	1.0	...	0.0	1.0	0.0	1.0	0.0	1
816	1.0	0.0	...	0.0	1.0	0.0	1.0	0.0	0
26	0.0	1.0	...	0.0	1.0	0.0	1.0	0.0	0