Learning curve An authentic pursuit to rid the imposter in me 🧘

Revisiting the titanic with PyCaret

Revisiting the titanic with PyCaret

! ls /kaggle/input/titanic
gender_submission.csv  test.csv  train.csv
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
data_files = []
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        data_files.append(os.path.join(dirname, filename))
print(data_files)

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
['/kaggle/input/titanic/train.csv', '/kaggle/input/titanic/test.csv', '/kaggle/input/titanic/gender_submission.csv']

References

  1. Britannica

Exploratory Data Analysis


route.png

Files

  1. df_train - development dataset
  2. df_test - submission dataset
  3. df_gender - sample submission dataset where all female passengers are assumed to have survived - 76% accuracy recorded with this naive approach!
df_train = pd.read_csv('/kaggle/input/titanic/train.csv')
df_test = pd.read_csv('/kaggle/input/titanic/test.csv')
df_gender = pd.read_csv('/kaggle/input/titanic/gender_submission.csv')

print(df_train.shape, df_test.shape, df_gender.shape)
(891, 12) (418, 11) (418, 2)
df_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
df_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

Observations

  • ID column : PassengerID
  • Target (DV) column : Survived {0 1}
  • Non-Null IDV columns
  • Class
  • Name
  • Sex
  • SibSp
  • Parch
  • Ticket
  • Fare
  • Embarked
  • IDV columns with some Null values
  • Age
  • Cabin
df_train.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

Columns without any null values

Pclass

Ticket Class

df_train.Pclass.sample(10)
286    3
353    3
693    3
152    3
871    1
183    2
199    2
162    3
453    1
455    3
Name: Pclass, dtype: int64

What is the distribution of the class of tickets?

pd.concat([df_train.Pclass.value_counts(),
           df_train.Pclass.value_counts(normalize=True)], axis = 1)
count proportion
Pclass
3 491 0.551066
1 216 0.242424
2 184 0.206510
class_survivors = df_train.groupby('Pclass').agg({'Survived':['count', 'sum']})

class_survivors
Survived
count sum
Pclass
1 216 136
2 184 87
3 491 119
class_survivors[('Survived', 'sum')]/class_survivors[('Survived', 'count')]
Pclass
1    0.629630
2    0.472826
3    0.242363
dtype: float64

Name

df_train.Name.sample(10)
5                                       Moran, Mr. James
477                            Braund, Mr. Lewis Richard
434                            Silvey, Mr. William Baird
835                          Compton, Miss. Sara Rebecca
862    Swift, Mrs. Frederick Joel (Margaret Welles Ba...
709    Moubarek, Master. Halim Gonios ("William George")
210                                       Ali, Mr. Ahmed
409                                   Lefebre, Miss. Ida
329                         Hippach, Miss. Jean Gertrude
599         Duff Gordon, Sir. Cosmo Edmund ("Mr Morgan")
Name: Name, dtype: object

Note

  1. Title could be extracted from the name.
  2. Is there some significance in survival context to names with a braces, an alternative name?
  3. Are there important family names that can extracted to signify importance for survival?

Sex

df_train.Sex.sample(10)
334    female
29       male
623      male
616      male
398      male
0        male
258    female
388      male
811      male
405      male
Name: Sex, dtype: object

What is the distribution of passenger sex?

pd.concat([df_train.Sex.value_counts(),
           df_train.Sex.value_counts(normalize=True)], axis = 1)
count proportion
Sex
male 577 0.647587
female 314 0.352413
df_train[df_train.Age.isna()].sample(10)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
815 816 0 1 Fry, Mr. Richard male NaN 0 0 112058 0.0000 B102 S
229 230 0 3 Lefebre, Miss. Mathilde female NaN 3 1 4133 25.4667 NaN S
656 657 0 3 Radeff, Mr. Alexander male NaN 0 0 349223 7.8958 NaN S
475 476 0 1 Clifford, Mr. George Quincy male NaN 0 0 110465 52.0000 A14 S
180 181 0 3 Sage, Miss. Constance Gladys female NaN 8 2 CA. 2343 69.5500 NaN S
760 761 0 3 Garfirth, Mr. John male NaN 0 0 358585 14.5000 NaN S
783 784 0 3 Johnston, Mr. Andrew G male NaN 1 2 W./C. 6607 23.4500 NaN S
87 88 0 3 Slocovski, Mr. Selman Francis male NaN 0 0 SOTON/OQ 392086 8.0500 NaN S
55 56 1 1 Woolner, Mr. Hugh male NaN 0 0 19947 35.5000 C52 S
65 66 1 3 Moubarek, Master. Gerios male NaN 1 1 2661 15.2458 NaN C

SibSp

number of siblings / spouses aboard the Titanic

df_train.SibSp.sample(10)
170    0
410    0
533    0
316    1
7      3
395    0
635    0
855    0
844    0
2      0
Name: SibSp, dtype: int64
pd.concat([df_train.SibSp.value_counts(),
           df_train.SibSp.value_counts(normalize=True)], axis = 1)
count proportion
SibSp
0 608 0.682379
1 209 0.234568
2 28 0.031425
4 18 0.020202
3 16 0.017957
8 7 0.007856
5 5 0.005612

Note

  1. About 70% of the passengers have no siblings or spouses accompanying them.
  2. If SibSp == 1, is it a spouse more likely then a sibling? If so, would they survive together or die together unlike Jack and Rose :D

Parch

Number of parents / children aboard the Titanic

df_train.Parch.sample(10)
658    0
150    0
530    1
338    0
834    0
308    0
866    0
475    0
45     0
435    2
Name: Parch, dtype: int64
pd.concat([df_train.Parch.value_counts(),
           df_train.Parch.value_counts(normalize=True)], axis = 1)
count proportion
Parch
0 678 0.760943
1 118 0.132435
2 80 0.089787
5 5 0.005612
3 5 0.005612
4 4 0.004489
6 1 0.001122

Ticket

Ticket number

df_train.Ticket.sample(10)
527           PC 17483
51          A/4. 39886
485               4133
755             250649
464           A/S 2816
404             315096
776             383121
95              374910
594        SC/AH 29037
157    SOTON/OQ 392090
Name: Ticket, dtype: object
pd.concat([df_train.Parch.value_counts(),
           df_train.Parch.value_counts(normalize=True)], axis = 1)
count proportion
Parch
0 678 0.760943
1 118 0.132435
2 80 0.089787
5 5 0.005612
3 5 0.005612
4 4 0.004489
6 1 0.001122
df_train[['Embarked','Pclass', 'Ticket', 'Cabin']].sample(10)
Embarked Pclass Ticket Cabin
238 S 2 28665 NaN
280 Q 3 336439 NaN
533 C 3 2668 NaN
493 C 1 PC 17609 NaN
813 S 3 347082 NaN
438 S 1 19950 C23 C25 C27
321 S 3 349219 NaN
790 Q 3 12460 NaN
822 S 1 19972 NaN
672 S 2 C.A. 24580 NaN

Note

  1. Ticket could help fill in Cabin data by way of signalling it?
  2. It could also be dependent on Pclass and Embarked

Fare

Passenger fare

df_train.Fare.sample(10)
511     8.0500
753     7.8958
460    26.5500
9      30.0708
99     26.0000
347    16.1000
827    37.0042
351    35.0000
177    28.7125
825     6.9500
Name: Fare, dtype: float64
df_train.groupby(['Pclass', 'Embarked']).agg({'Fare':'mean'})
Fare
Pclass Embarked
1 C 104.718529
Q 90.000000
S 70.364862
2 C 25.358335
Q 12.350000
S 20.327439
3 C 11.214083
Q 11.183393
S 14.644083

Note

  1. As expected the class order in terms of fare is 1, 2 and then 3 but where the passenger embarks also has an impact on the fare for the travel.
  2. However, no new information in our context could be extracted from this column?

Embarked

Port of Embarkation; C -Cherbourg, Q - Queenstown, S - Southampton

pd.concat([df_train.Embarked.value_counts(),
           df_train.Embarked.value_counts(normalize = True)], axis = 1)
count proportion
Embarked
S 644 0.724409
C 168 0.188976
Q 77 0.086614

Note

  1. Don’t see relevance in survival but could be useful for other columns?

Columns with null values

Cabin

Cabin number

df_train.Cabin.sample(10)
57     NaN
72     NaN
348    NaN
631    NaN
422    NaN
689     B5
734    NaN
301    NaN
615    NaN
754    NaN
Name: Cabin, dtype: object
pd.concat([df_train.Cabin.isna().value_counts(),
           df_train.Cabin.isna().value_counts(normalize=True)], axis = 1)
count proportion
Cabin
True 687 0.771044
False 204 0.228956
df_train[df_train.Cabin.notnull()].groupby('Pclass')['Cabin'].agg(' '.join)
Pclass
1    C85 C123 E46 C103 A6 C23 C25 C27 B78 D33 B30 C...
2    D56 F33 E101 F2 F4 F2 D E101 D F2 F33 D F33 F4...
3    G6 F G73 F E69 G6 G6 G6 E10 F G63 F G73 E121 F...
Name: Cabin, dtype: object

Note

  1. Cabins could hold key information relating to who survived by virtue of how accessible they are to the deck.
  2. However, since large portions of the information is unavailable (~78%) It cannot be used unless encoded through any other variable.
  3. Class could signify some cabins in terms of allocation, the definition of class is separation, but what use is the attribute if class alone can define it. Needs exploration!

Age

df_train.Age.sample(10)
354     NaN
210    24.0
711     NaN
11     58.0
395    22.0
629     NaN
722    34.0
193     3.0
300     NaN
794    25.0
Name: Age, dtype: float64

Data Preprocessing

Feature Engineering

Title

df_train['Name'].apply(lambda x : x.split('.')[0].split(' ')[-1].strip()).value_counts()
Name
Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Mlle          2
Major         2
Col           2
Countess      1
Capt          1
Ms            1
Sir           1
Lady          1
Mme           1
Don           1
Jonkheer      1
Name: count, dtype: int64
df_train['Title'] = df_train['Name'].apply(lambda x : x.split('.')[0].split(' ')[-1].strip())
df_test['Title'] = df_test['Name'].apply(lambda x : x.split('.')[0].split(' ')[-1].strip())

df_train.sample(10)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title
637 638 0 2 Collyer, Mr. Harvey male 31.0 1 1 C.A. 31921 26.2500 NaN S Mr
697 698 1 3 Mullens, Miss. Katherine "Katie" female NaN 0 0 35852 7.7333 NaN Q Miss
806 807 0 1 Andrews, Mr. Thomas Jr male 39.0 0 0 112050 0.0000 A36 S Mr
324 325 0 3 Sage, Mr. George John Jr male NaN 8 2 CA. 2343 69.5500 NaN S Mr
835 836 1 1 Compton, Miss. Sara Rebecca female 39.0 1 1 PC 17756 83.1583 E49 C Miss
513 514 1 1 Rothschild, Mrs. Martin (Elizabeth L. Barrett) female 54.0 1 0 PC 17603 59.4000 NaN C Mrs
291 292 1 1 Bishop, Mrs. Dickinson H (Helen Walton) female 19.0 1 0 11967 91.0792 B49 C Mrs
767 768 0 3 Mangan, Miss. Mary female 30.5 0 0 364850 7.7500 NaN Q Miss
851 852 0 3 Svensson, Mr. Johan male 74.0 0 0 347060 7.7750 NaN S Mr
117 118 0 2 Turpin, Mr. William John Robert male 29.0 1 0 11668 21.0000 NaN S Mr
df_train.groupby("Title").agg({'Age':['count', 'mean']}).sort_values(('Age', 'mean'), ascending = False)
Age
count mean
Title
Capt 1 70.000000
Col 2 58.000000
Sir 1 49.000000
Major 2 48.500000
Lady 1 48.000000
Rev 6 43.166667
Dr 6 42.000000
Don 1 40.000000
Jonkheer 1 38.000000
Mrs 108 35.898148
Countess 1 33.000000
Mr 398 32.368090
Ms 1 28.000000
Mlle 2 24.000000
Mme 1 24.000000
Miss 146 21.773973
Master 36 4.574167
df_train[df_train.Age.isna()].Title.value_counts()
Title
Mr        119
Miss       36
Mrs        17
Master      4
Dr          1
Name: count, dtype: int64
df_test[df_test.Age.isna()].Title.value_counts()
Title
Mr        57
Miss      14
Mrs       10
Master     4
Ms         1
Name: count, dtype: int64

Note Use the average age by Title to encode missing value in the Age column. This method is preferred over mean encoding on the overall dataset.

Title mean age

Average age of all the passengers with a specific title. Note: Could be useful for age encoding

df_title_mean_age = df_train.groupby("Title").agg({'Age':'mean'}).reset_index().rename({'Age': 'Title_Mean_Age'}, axis = 1)

df_title_mean_age.sample(5)
Title Title_Mean_Age
16 Sir 49.0
2 Countess 33.0
7 Major 48.5
0 Capt 70.0
6 Lady 48.0
df_train = df_train.merge(df_title_mean_age, how = 'left', on = 'Title')
df_test = df_test.merge(df_title_mean_age, how = 'left', on = 'Title')

df_train.sample(10)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title Title_Mean_Age
634 635 0 3 Skoog, Miss. Mabel female 9.0 3 2 347088 27.9000 NaN S Miss 21.773973
232 233 0 2 Sjostedt, Mr. Ernst Adolf male 59.0 0 0 237442 13.5000 NaN S Mr 32.368090
721 722 0 3 Jensen, Mr. Svend Lauritz male 17.0 1 0 350048 7.0542 NaN S Mr 32.368090
266 267 0 3 Panula, Mr. Ernesti Arvid male 16.0 4 1 3101295 39.6875 NaN S Mr 32.368090
650 651 0 3 Mitkoff, Mr. Mito male NaN 0 0 349221 7.8958 NaN S Mr 32.368090
524 525 0 3 Kassem, Mr. Fared male NaN 0 0 2700 7.2292 NaN C Mr 32.368090
311 312 1 1 Ryerson, Miss. Emily Borie female 18.0 2 2 PC 17608 262.3750 B57 B59 B63 B66 C Miss 21.773973
116 117 0 3 Connors, Mr. Patrick male 70.5 0 0 370369 7.7500 NaN Q Mr 32.368090
865 866 1 2 Bystrom, Mrs. (Karolina) female 42.0 0 0 236852 13.0000 NaN S Mrs 35.898148
872 873 0 1 Carlsson, Mr. Frans Olof male 33.0 0 0 695 5.0000 B51 B53 B55 S Mr 32.368090
df_train[df_train.Age.isna()].sample()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title Title_Mean_Age
776 777 0 3 Tobin, Mr. Roger male NaN 0 0 383121 7.75 F38 Q Mr 32.36809
df_train['Age'] = df_train['Age'].fillna(df_train['Title_Mean_Age'])
df_test['Age'] = df_test['Age'].fillna(df_test['Title_Mean_Age'])

df_test.shape
(418, 13)

Alone

Flag to check if no siblings or spouses or children or parents are accompanying. Note: This is ignorant of friends or accomplices

print(f"The number of passengers who are travelling alone is \
{df_train[(df_train.SibSp==0)&(df_train.Parch==0)].shape[0]} \
which is about {df_train[(df_train.SibSp==0)&(df_train.Parch==0)].shape[0]/df_train.shape[0]*100:.2f}% of the total passengers.")
The number of passengers who are travelling alone is 537 which is about 60.27% of the total passengers.
df_train['Alone'] = (df_train.SibSp==0)&(df_train.Parch==0)
df_test['Alone'] = (df_test.SibSp==0)&(df_test.Parch==0)

df_train.sample(10)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title Title_Mean_Age Alone
573 574 1 3 Kelly, Miss. Mary female 21.773973 0 0 14312 7.7500 NaN Q Miss 21.773973 True
495 496 0 3 Yousseff, Mr. Gerious male 32.368090 0 0 2627 14.4583 NaN C Mr 32.368090 True
472 473 1 2 West, Mrs. Edwy Arthur (Ada Mary Worth) female 33.000000 1 2 C.A. 34651 27.7500 NaN S Mrs 35.898148 False
741 742 0 1 Cavendish, Mr. Tyrell William male 36.000000 1 0 19877 78.8500 C46 S Mr 32.368090 False
345 346 1 2 Brown, Miss. Amelia "Mildred" female 24.000000 0 0 248733 13.0000 F33 S Miss 21.773973 True
566 567 0 3 Stoytcheff, Mr. Ilia male 19.000000 0 0 349205 7.8958 NaN S Mr 32.368090 True
785 786 0 3 Harmer, Mr. Abraham (David Lishin) male 25.000000 0 0 374887 7.2500 NaN S Mr 32.368090 True
174 175 0 1 Smith, Mr. James Clinch male 56.000000 0 0 17764 30.6958 A7 C Mr 32.368090 True
474 475 0 3 Strandberg, Miss. Ida Sofia female 22.000000 0 0 7553 9.8375 NaN S Miss 21.773973 True
512 513 1 1 McGough, Mr. James Robert male 36.000000 0 0 PC 17473 26.2875 E25 S Mr 32.368090 True
df_train.Alone.value_counts()
Alone
True     537
False    354
Name: count, dtype: int64
df_train.Survived.value_counts()
Survived
0    549
1    342
Name: count, dtype: int64
df_train.groupby('Alone').agg({'Survived':'sum'}).rename({'Survived':'count'}, axis = 1)
count
Alone
False 179
True 163
pd.DataFrame(df_train.Alone.value_counts())
count
Alone
True 537
False 354
df_train.groupby('Alone').agg({'Survived':'sum'}).rename({'Survived':'count'}, axis = 1)/pd.DataFrame(df_train.Alone.value_counts())
count
Alone
False 0.505650
True 0.303538

Note

  1. This could be an important feature as the survival rate drops by 20% at least for a passenger who is travelling alone!

Age group

Use the Age column after filling null values with Title mean age

df_test.Age.info()
<class 'pandas.core.series.Series'>
RangeIndex: 418 entries, 0 to 417
Series name: Age
Non-Null Count  Dtype  
--------------  -----  
418 non-null    float64
dtypes: float64(1)
memory usage: 3.4 KB
# Define bins for the age ranges according to biological markers
bins = [0, 6, 13, 20, 36, 56, 76, float('inf')]  # float('inf') for ages above 75

# Labels for the age groups
labels = ['0-5', '6-12', '13-19', '20-35', '36-55', '56-75', '76+']

# Create age categories
df_train['Age_Group'] = pd.cut(df_train['Age'], bins=bins, labels=labels, right=False)
df_test['Age_Group'] = pd.cut(df_test['Age'], bins=bins, labels=labels, right=False)
df_train.Age_Group.value_counts()
Age_Group
20-35    505
36-55    179
13-19     95
0-5       48
56-75     38
6-12      25
76+        1
Name: count, dtype: int64
features = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Title', 'Alone', 'Age_Group']
cat_features = ['Sex', 'Title', 'Alone', 'Age_Group']

df_train[features].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   Pclass     891 non-null    int64   
 1   Sex        891 non-null    object  
 2   SibSp      891 non-null    int64   
 3   Parch      891 non-null    int64   
 4   Title      891 non-null    object  
 5   Alone      891 non-null    bool    
 6   Age_Group  891 non-null    category
dtypes: bool(1), category(1), int64(3), object(2)
memory usage: 37.0+ KB

Family Size


Encoding

# one hot encoding categorical columns

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output = False, handle_unknown = 'ignore')
encoder.fit(df_train[cat_features])
train_encoded = encoder.transform(df_train[cat_features])
test_encoded = encoder.transform(df_test[cat_features])
# Convert encoded data back to DataFrame
df_train_encoded = pd.DataFrame(train_encoded, columns=encoder.get_feature_names_out())
df_test_encoded = pd.DataFrame(test_encoded, columns=encoder.get_feature_names_out())
df_train_encoded.columns
Index(['Sex_female', 'Sex_male', 'Title_Capt', 'Title_Col', 'Title_Countess',
       'Title_Don', 'Title_Dr', 'Title_Jonkheer', 'Title_Lady', 'Title_Major',
       'Title_Master', 'Title_Miss', 'Title_Mlle', 'Title_Mme', 'Title_Mr',
       'Title_Mrs', 'Title_Ms', 'Title_Rev', 'Title_Sir', 'Alone_False',
       'Alone_True', 'Age_Group_0-5', 'Age_Group_13-19', 'Age_Group_20-35',
       'Age_Group_36-55', 'Age_Group_56-75', 'Age_Group_6-12',
       'Age_Group_76+'],
      dtype='object')
df_train_encoded['Survived'] = df_train['Survived']

df_train_encoded.columns
Index(['Sex_female', 'Sex_male', 'Title_Capt', 'Title_Col', 'Title_Countess',
       'Title_Don', 'Title_Dr', 'Title_Jonkheer', 'Title_Lady', 'Title_Major',
       'Title_Master', 'Title_Miss', 'Title_Mlle', 'Title_Mme', 'Title_Mr',
       'Title_Mrs', 'Title_Ms', 'Title_Rev', 'Title_Sir', 'Alone_False',
       'Alone_True', 'Age_Group_0-5', 'Age_Group_13-19', 'Age_Group_20-35',
       'Age_Group_36-55', 'Age_Group_56-75', 'Age_Group_6-12', 'Age_Group_76+',
       'Survived'],
      dtype='object')
df_train_encoded.sample(10)
Sex_female Sex_male Title_Capt Title_Col Title_Countess Title_Don Title_Dr Title_Jonkheer Title_Lady Title_Major ... Alone_False Alone_True Age_Group_0-5 Age_Group_13-19 Age_Group_20-35 Age_Group_36-55 Age_Group_56-75 Age_Group_6-12 Age_Group_76+ Survived
638 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0
510 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1
393 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1
857 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1
887 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1
17 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1
189 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0
444 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1
816 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0
26 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0

10 rows × 29 columns

Model Building

Auto-ML using PyCaret

from pycaret.classification import setup, compare_models

# Assuming 'data' is your DataFrame and 'target' is the name of the target column
setup_data = setup(data=df_train_encoded, target='Survived', fold = 3, session_id=123)

  Description Value
0 Session id 123
1 Target Survived
2 Target type Binary
3 Original data shape (891, 29)
4 Transformed data shape (891, 29)
5 Transformed train set shape (623, 29)
6 Transformed test set shape (268, 29)
7 Numeric features 28
8 Preprocess True
9 Imputation type simple
10 Numeric imputation mean
11 Categorical imputation mode
12 Fold Generator StratifiedKFold
13 Fold Number 3
14 CPU Jobs -1
15 Use GPU False
16 Log Experiment False
17 Experiment Name clf-default-name
18 USI 2335
best_model = compare_models()
  Model Accuracy AUC Recall Prec. F1 Kappa MCC TT (Sec)
lr Logistic Regression 0.8025 0.7862 0.7446 0.7421 0.7433 0.5829 0.5829 1.7533
ridge Ridge Classifier 0.8025 0.7878 0.7363 0.7463 0.7411 0.5815 0.5817 0.0500
lda Linear Discriminant Analysis 0.8009 0.7820 0.7321 0.7450 0.7383 0.5777 0.5779 0.0767
ada Ada Boost Classifier 0.7945 0.7753 0.7279 0.7343 0.7309 0.5647 0.5649 0.1600
lightgbm Light Gradient Boosting Machine 0.7881 0.7869 0.7111 0.7297 0.7202 0.5497 0.5499 0.5033
gbc Gradient Boosting Classifier 0.7865 0.7766 0.7069 0.7283 0.7174 0.5458 0.5461 0.3533
rf Random Forest Classifier 0.7849 0.7790 0.6944 0.7311 0.7123 0.5407 0.5411 0.2467
et Extra Trees Classifier 0.7801 0.7732 0.6777 0.7299 0.7027 0.5286 0.5297 0.3300
xgboost Extreme Gradient Boosting 0.7800 0.7814 0.6944 0.7219 0.7079 0.5316 0.5319 0.1833
svm SVM - Linear Kernel 0.7785 0.7883 0.7405 0.7027 0.7193 0.5369 0.5393 0.0500
dt Decision Tree Classifier 0.7769 0.7710 0.6777 0.7235 0.6996 0.5225 0.5234 0.0400
knn K Neighbors Classifier 0.7672 0.7734 0.6234 0.7317 0.6715 0.4934 0.4984 0.0733
qda Quadratic Discriminant Analysis 0.7192 0.7699 0.7361 0.6423 0.6740 0.4358 0.4498 0.0500
dummy Dummy Classifier 0.6164 0.5000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0400
nb Naive Bayes 0.4866 0.7688 0.6544 0.5382 0.4083 0.0352 0.0885 0.0433
Processing:   0%|          | 0/65 [00:00<?, ?it/s]
from pycaret.classification import tune_model

tuned_model = tune_model(best_model, fold=10)
  Accuracy AUC Recall Prec. F1 Kappa MCC
Fold              
0 0.7937 0.7618 0.7083 0.7391 0.7234 0.5590 0.5593
1 0.8571 0.8109 0.7917 0.8261 0.8085 0.6947 0.6951
2 0.9048 0.9054 0.9583 0.8214 0.8846 0.8043 0.8113
3 0.7903 0.7592 0.6522 0.7500 0.6977 0.5384 0.5415
4 0.7419 0.7577 0.5833 0.7000 0.6364 0.4389 0.4433
5 0.7419 0.7621 0.7083 0.6538 0.6800 0.4644 0.4654
6 0.8871 0.8805 0.8333 0.8696 0.8511 0.7602 0.7607
7 0.7258 0.7484 0.7083 0.6296 0.6667 0.4352 0.4373
8 0.7097 0.6727 0.5417 0.6500 0.5909 0.3688 0.3725
9 0.8387 0.8355 0.8750 0.7500 0.8077 0.6702 0.6761
Mean 0.7991 0.7894 0.7361 0.7390 0.7347 0.5734 0.5762
Std 0.0662 0.0656 0.1232 0.0777 0.0929 0.1431 0.1434
Processing:   0%|          | 0/7 [00:00<?, ?it/s]


Fitting 10 folds for each of 10 candidates, totalling 100 fits
predictions = tuned_model.predict(df_test_encoded)
df_test['Survived'] = predictions
df_submission = df_test[['PassengerId', 'Survived']]
df_submission.shape
(418, 2)

Deep learning

df_submission.to_csv('/kaggle/working/submission.csv', index=False)

XGBoost_Study

Open In Colab


  1. Play around with the “gamma” regularisation parameter. Gamma adds a complexity cost to the XGBoost trees. The XGBoost trees will only become deeper if the gain associated with expanding the tree is > gamma (also known as “pruning”). The higher gamma, the stronger the regularisation.
  2. You can use the “subsample” parameter to train each XGBoost tree on a subset of the data. If subsample is set to 0.9, each tree will be trained on 90% of the training data.
  3. Use the “colsample_bytree” parameter where again you would train each XGBoost tree on a random subset of your features. If you set colsample_bytree = 0.9, you would randomly remove 10% of the features when building each tree.
  4. Use early stopping. As you keep adding estimators (i.e trees) to better fit the training data, you will start to overfit. Early_stopping_rounds will stop XGBoost from adding additional trees when its performance on the validation set stops improving for a certain number of trees.
  5. Play around with the lambda and alpha regularisation parameters which are similar to L2 and L1 regularisation.

Bonus, like for random forest you can play around with the tree parameters as well:

  1. Adjust the “max_depth” parameter of the trees, the lower the max depth, the simpler your algorithm and the stronger the regularisation.

  2. Adjust the min_samples_leaf parameter to set a minimum number of samples per leaf. The higher the value the stronger the regularisation.

  3. dispensing with it when the dataset is fairly small and relationships are quite simple/linear. That would, in most cases, be fitting a square peg into a round hole.

  4. Another non-hyperparameter strategy: bucketing variables to make the algo less able to slice and dice continuous or ordinal variables too much and create overly complex relationships.

  5. And the best algo for tabular data is the one you’ve actually tried and cross-validated to be the best by whatever metric makes the most sense for your specific dataset and business goal.

  6. XGBoost is not the best one for tabular data, it is just the well-known one. CatBoost outperforms XGBoost in many ways. As you are discussing about overfit, CatBoost hardly overfits and it works perfectly well on numeric features. Of course, althe more complicated a ml model it is, the more time someone needs to master it.

  7. The good news is, because of the bagging method w/ rf, overfitting is very hard?

  8. I love xgboost, but I find if you are good at parameter tuning (using a good chunk of the techniques you have outlined above) lightgbm performs nearly as well for the huge increase in speed. Further EvoTrees does both better with proper tuning.

  9. However I have been resold on simple linear regression recently. If you can modify the features to encode some of the nonlinearities, Linear Regression (or more properly “geodesic regression” ) outperforms everything.