Jupyter Notebook

This document is converted from a Jupyter Notebook to a Markdown document. It is advised to open the original .ipynb file because this way you will be able to see the output of all code, and tinker with it.

Datascience

“Data science is the process of analysing data to gain insights or achieve other objectives. While it often involves artificial intelligence and machine learning, it can also be highly valuable without them, using traditional statistical methods and data analysis techniques.”

Me & AI Spellchecker 29/05/2025

This article will cover everything from simple analysis till Random forest algorithms.

Imports

This codeblock is a pip install of all packages we’ll be using.

!pip install jupyter
!pip install notebook
!pip install pandas
!pip install numpy
!pip install scipy
!pip install matplotlib
!pip install seaborn
!pip install fpdf
!pip install scikit-learn
!pip install graphviz
!pip install dtreeviz
!pip install opencv-python
!pip install ultralytics

above shows you how to pip install the packages, below will show you how to import them

import pandas as pd
import seaborn as sns
import scipy.stats as st
from scipy.stats import chi2_contingency

Distributions

There are a lot of distribution types. You can see them here.

If the distribution is mirrored, it is called a skewed distribution according to wikipedia.

In this notebook / markdown file, we’ll be looking at the default iris dataset that is included with the seaborn pything module.

Below is the line of code that imports the iris dataset. and when opened in a notebook, will show you the dataset.

iris = sns.load_dataset("iris")
iris.head()

Now we are going to look at a column and see which distribution it is.

iris['sepal_width'].plot(kind='hist', bins = 10)

W The above graph is a normal distribution, this image shows a clearer picture:

Examples of normal distribution in real life

Confidence Interval

What is a Confidence Interval? A confidence interval is a range of values that we believe is likely to contain the true population parameter (like the mean or proportion), based on a sample.

Usually, we talk about the mean, so let’s focus on that:

A 95% confidence interval for the mean says: “We are 95% confident that the true population mean lies within this range.”

Source: Avans

Visualisation:

Now let’s apply this confidence value to our dataset:

confidence = 0.95
st.t.interval(confidence, len(iris)-1, loc=iris['sepal_width'].mean(), scale=st.sem(iris['sepal_width']))

So that means that there is a 95% chance that the real Mean is somewhere between 2.98 and 3.12.

Analysis

Univariate Analysis

Univariate analysis is the act of looking at a single column of data, to analyze it.

This can be done in multiple ways. With numerical columns you could use a boxplot, a histogram etc. With categorical columns you can use a barchart, piechart etc.

Categorical Univariate Analysis

I am going to show you how to make a barchart. The reason why this is useful to see in a clear picture what category is most common. Like in the following picture you can see that the species is pretty equal in our dataset.

iris['species'].value_counts().plot(kind='bar')
# the value_counts() counts the amount of times each value is found in the column.

iris['species'].value_counts().plot(kind='pie')

You have now seen 2 examples of what you can do with a categorical column. here is a long list of other charts.

Numerical Univariate Analysis

with a numerical analysis we do in theory the same thing as with categorical data. but using a pie chart to display numerical data is not really usefull. You can see this in the following example:

iris['sepal_length'].value_counts().plot(kind='pie')

What we’re getting now is a pie chart with lots of slices, which aren’t very clear. And if you remove .value_counts(), it only gets more confusing.

So now you might wonder: what’s the point of numerical analysis?

You can use a histogram.

#The data is grouped into 10 intervals (bins) and the number of data points in each interval is counted.
iris['sepal_length'].plot(kind='hist', bins = 10)

I’d advice you use a boxplot to see the mean, and outliers

iris['sepal_length'].plot(kind='box')

Here is an image to describe it more clearly:

For a good visual explenation:

*|---[  |  ]---| * * ⟵ a full boxplot

*     *              ⟵ Outliers (extreme values)
|----[  |  ]----|    ⟵ Full range visualization
     Q1 |  Q3        Q1 = 25th percentile, Q3 = 75th percentile
        |            Median (50th percentile)

Bivariate Analysis

The title says it, bi means 2, so it is analysis of 2 rows. We can do this in multiple ways. Categorical x Categorical, Numerical x Numerical and Categorical x Numerical.

The reason why this is useful is to check for correlations. Here are some examples:

Body length VS Shoe size
Age VS Seconds it takes to run 1 kilometre
Country Unemployment rate VS Country Happiness Rating
Customer Revenue VS Customer Lifetime
Game Wishlists VS Game Purchases
Student grades VS Student attendance

Numerical vs Numerical

Lets apply it to our dataset and see whether there is a correlation between the 2 columns

iris.plot(kind='scatter', x='sepal_length', y='petal_length')

A useful statistic for calculating the linear relation between two variable is the Pearson correlation. The Pearson correlation measures the linear relation between two numerical variables. The result is a number between -1 and 1 where

-1 indicates a perfect negative linear relation
0 indicates that there is no linear relation
1 indicates a perfect positive linear relation

Here is a nice visualisation:

Let’s now see which collumns have any correlation.

irisCorrelations = iris.corr(numeric_only=True) # Only include numeric columns
irisCorrelations.style.background_gradient(cmap='coolwarm', axis=None).format(precision=2)

here we could see that some things do have correlation, and some things don’t.

The further away from the 0, the higher the correlation.

Numerical vs Categorical

Examples:

Monthly earnings VS Highest obtained degree
Body length VS Country of origin
Happiness rating VS Country of origin
Sales VS Account Manager.
Revenue VS Product Category
Revenue VS Game genre
Retention VS Software version

In univariate analysis for numerical data we’ve looked at confidence intervals. We can use these as well to check if there are any signicant differences between categories.

seaborn.barplot() is a function from the Seaborn library that creates a bar chart --- it’s great for visualizing comparisons between categories.

Unlike basic bar plots, Seaborn’s barplot():

Can automatically calculate averages if you give it multiple values per category
Adds error bars (by default, 95% confidence intervals)

If you’re plotting the average dice rolls per person, the bar height shows the average --- and the error bar shows how much that average might vary if you repeated the experiment. So if you hav a bar at 5, and the error bar goes from 3 to 5, it means:

“We’re fairly confident the true average rolls for you is between 3 and 5.”

sns.barplot(x='species', y='sepal_length', data=iris)

now we can look at a table to see this more clearly

iris.groupby('species').mean()

there is clearly no overlap, so we could say that there is a correlation between sepal_length and species

Categorical vs Categorical

Examples:

Inbound channel VS Type of customer
Education VS Job function
Region of citizen VS Who the citizen votes for
Color shirt of crew member in Stat Trek VS Does the crew member survive? (https://statisticsbyjim.com/hypothesis-testing/chi-square-test-independence-example/)
Customer newsletter subscription (Yes/No) VS Custom churn (Yes/No)
Penguin island VS Penguin species
…

For these examples we need to switch to the built in penguins dataset since the iris dataset only has 1 categorical column

penguins = sns.load_dataset("penguins")
penguins.head()

def create_contingency_table(dataset, column1, column2):
    return dataset.groupby([column1, column2]).size().unstack(column1, fill_value=0)

A contingency table (also called a cross-tabulation or crosstab) is a table used to show the frequency distribution of two (or more) categorical variables. Each cell in the table represents the count of records that have a specific combination of values from the two variables.

What does stack() and unstack() do in pandas? Both are used to reshape multi-level (hierarchical) indexes:

unstack(level) Pivots a level of row index into columns

Converts a MultiIndex row into a wider DataFrame

stack(level) Pivots a level of columns into row index

Converts columns into a deeper row index

penguinsContingencyTable = create_contingency_table(penguins, 'species','island')
penguinsContingencyTable

here we can see that there is a correlation between species and island, because Chinstrap only lives on Dream Island and Gentoo only lives on Biscoe Island.

we can use the following method to check whether there is a real correlation though.

def check_cat_vs_cat_correlation(dataset, column1, column2):
    contingency_table = create_contingency_table(dataset, column1, column2)
    chi2 = chi2_contingency(contingency_table)
    print(chi2)
    p_value = chi2[1]
    odds_of_correlation = 1 - p_value
    print(f"The odds of a correlation between {column1} and {column2} is {odds_of_correlation * 100}% (Based on a p value of {p_value}).")
    print("This percentage needs to be at least 95% for a significant correlation.")
check_cat_vs_cat_correlation(penguins, 'species','island')

The function chi2_contingency from scipy.stats performs the Chi-squared test of independence, which is a statistical test used to determine whether two categorical variables are independent or associated.

Given a contingency table (i.e., a cross-tab of counts between two categorical variables), chi2_contingency checks if the observed frequencies are significantly different from the frequencies we would expect if the variables were independent.

Chi-squared statistic (statistic=299.55) This is a measure of how different the observed values are from the expected values (assuming no relationship). A large value = bigger difference.

p-value (pvalue=1.35e-63) This is extremely small (almost zero), which means the probability that this difference happened by chance is effectively zero.

Expected Frequencies These are the counts you would expect in each cell if there were no relationship between species and island.

“The odds of a correlation between species and island is 100.0% (Based on a p-value of 1.35e-63). This percentage needs to be at least 95% for a significant correlation.”

This means:

Your p-value is way below the threshold of 0.05 (which is a 95% confidence level).

Therefore, you reject the null hypothesis.

There is a statistically significant association between species and island.

In plain English: certain species are much more likely to be found on certain islands, and this pattern is not due to random chance.

Important: when the P value is 3.2359805560820074e-107 it will look like there is no correlation since 3.23 is bigger than 0.05. But that is not true. When 3.2359805560820074e-107 is your P value, then it means that there is a correlation. This is because 3.2359805560820074e-107 is a scientific notation. A number like 3.2359805560820074e-107 actually means 3.2359805560820074 × 10^(-107). Which in practice looks something like this:

0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000032359805560820074

That’s a 0. followed by 106 zeros, then 32359805560820074.

I personally made this mistake before of overlooking this, hence why it is advised to use the percentage indicator instead of the 0.05 one. And why it is never advised to do analysis and scientific research in a rush. Take your time.

Multivariate analysis

We have seen that we can uncover correlations by using bivariate analyis. This also raises the question: If we extract information from multiple columns (Multivariate analysis), could we use these correlations to calculate/predict the value of a column for rows that do not yet have a value?

For example:

Can we calculate if a customer will churn (= We lose the customer)?
Can we calculate if a customer would use a certain product? (Product recommendation)
Can we calculate if a mail is spam or not?
Can we calculate if a financial transaction is fraudulent or not?
Can we calculate if a customer will be able to pay back their loan or not?
Can we calculate the price people are willing to pay for a house?
Can we calculate the salary a student will earn in the future?
…

Just as we used math/statistics to support to explore the data one column at a time (Univariate) and per combination of two columns (Bivariate analysis), we will use machine learning algorithms to extract information across multiple columns (Multivariate analysis) and to build data products.

Machine learning is:

Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention. (https://www.sas.com/en_us/insights/analytics/machine-learning.html#:~:text=Machine%20learning%20is%20a%20method,decisions%20with%20minimal%20human%20intervention.)
Machine learning (ML) is the study of computer algorithms that improve automatically through experience.[1] It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to do so.[2] Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks. (https://en.wikipedia.org/wiki/Machine_learning)

Examples of data products could include:

A product recommender
A customer churn predictor
A mail labeler (Primary, promotion, social, spam, etc.)
Sentiment analyser for social media messages
…

For example, if we want to predict a person’s shoe size based on their body length:

The shoe size is the target variable
The body length is the feature variable

Extra info: The terms ‘target’ and ‘feature’ are borrowed from the field of Machine Learning. In the field of statistics, we refer to the target variable as the dependent variable and we refer to the feature variables as the independent variables.

Classification

When the target variable is a categorical variable then we refer to this task as a classification task.

Examples of classification tasks:

Predict if a customer will churn.
Predict if a customer will use a specific product.
Label mails as spam.
Label financial transactions as fraudulent.
Label a social media message as positive, neutral or negative.
Predict if a customer of the bank will be able to pay back the loan.

Examples of machine learning algorithms that we could use for classification:

Decision trees
Random forests
Logistic regression
Neural networks
Naive bayes

Training the Classification model

We are now going to be training a model.

We are now going to train a decision tree to detect the species of iris.

from sklearn.tree import DecisionTreeClassifier
features= ['sepal_length', 'petal_length']
dt = DecisionTreeClassifier(max_depth = 5) # Increase max_depth to see effect in the plot
dt.fit(iris[features], iris['species'])

This Python code trains a Decision Tree Classifier on a subset of the famous Iris dataset, using only two features: ‘sepal_length’ and ‘petal_length’. Only these will be used to train the model. It creates a decision tree classifier with a maximum depth of 5, meaning the tree will be allowed to split at most 5 times from the root to a leaf. This limits its complexity and helps prevent overfitting.

It trains (fits) the decision tree on the selected features and the target label ‘species’. It learns to classify the type of Iris flower (setosa, versicolor, virginica) based on ‘sepal_length’ and ‘petal_length’.

from sklearn import tree
import graphviz
 
def plot_tree_classification(model, features, class_names):
    # Generate plot data
    dot_data = tree.export_graphviz(model, out_file=None,
                          feature_names=features,
                          class_names=class_names,
                          filled=True, rounded=True,
                          special_characters=True)
    # Turn into graph using graphviz
    graph = graphviz.Source(dot_data)
    # Write out a pdf
    graph.render("decision_tree")
    # Display in the notebook
    return graph
 
import numpy as np
plot_tree_classification(dt, features, np.sort(iris.species.unique()))

What do you see here?

Let’s take a look at the first node:

samples = 150: all samples from the dataset. value = [50, 50, 50]: 50 samples each from setosa, versicolor, and virginica. gini = 0.667: high impurity (which makes sense, since the classes are perfectly balanced). petal_length ≤ 2.45: this is the first split criterion.

class = the predicted class. The class with the highest number of samples at that node. But at the first node, all classes are equally represented ([50, 50, 50]), so technically the model is just picking one arbitrarily --- probably the first in lexicographic order, which is: setosa, versicolor, virginica So, it’s not a prediction, just a default label when there’s a tie.

Evaluating the Classification model

Now we would like to evaluate our model.

predictions = dt.predict(iris[features])
def calculate_accuracy(predictions, actuals):
    if(len(predictions) != len(actuals)):
        raise Exception("The amount of predictions did not equal the amount of actuals")
 
    return (predictions == actuals).sum() / len(actuals)
calculate_accuracy(predictions, iris.species)

This function calculate_accuracy(predictions, actuals) computes the classification accuracy of your model’s predictions compared to the actual (true) labels.

And we can see that the model is quite accurate, But that is because we tested and trained on the same dataset. So if we want to actually make sure we train a good algorithm, we would have to spilt our dataset into a test and train dataset.

Test/train split

Why Test/Train split?

Splitting the dataset allows you to train the model on one part (training set) and evaluate it on another (test set). This helps you simulate how the model performs on unseen, real-world data. It prevents overfitting, where the model memorizes the training data but fails on new data.

The stratify option ensures that each class is proportionally represented in both the training and test sets. This is important for balanced model training and fair evaluation. random_state ensures reproducibility, so you get the same split each time you run the code. A 70/30 split (test_size=0.3) is a common and balanced choice for small datasets like Iris.

from sklearn.model_selection import train_test_split
iris_train, iris_test = train_test_split(iris, test_size=0.3, random_state=42, stratify=iris['species'])
print(iris_train.shape, iris_test.shape)

And now we are going to train our model again on our trainign dataset.

features= ['sepal_length', 'sepal_width']
dt_classification = DecisionTreeClassifier(max_depth = 5) # Increase max_depth to see effect in the plot
dt_classification.fit(iris_train[features], iris_train['species'])

predictionsOnTrainset = dt_classification.predict(iris_train[features])
predictionsOnTestset = dt_classification.predict(iris_test[features])
 
accuracyTrain = calculate_accuracy(predictionsOnTrainset, iris_train.species)
accuracyTest = calculate_accuracy(predictionsOnTestset, iris_test.species)
 
print("Accuracy on training set " + str(accuracyTrain))
print("Accuracy on test set " + str(accuracyTest))

Now we have tested our accuracy.

Overfitting

📊Results Summary

Training Accuracy: 85.7% Test Accuracy: 73.3%

🧠What this means

Your model performs reasonably well on the training data. The lower test accuracy suggests the model is not generalizing perfectly to unseen data. The ~10% drop in accuracy is a sign of mild overfitting --- the model may have learned some patterns specific to the training data that don’t hold in the test set.

Regression

When the target variable is a numerical variable then we refer to this task as a regression task.

Examples of regression tasks:

Predict the price people are willing to pay for a house.
Predict the salary a student will earn in the future.

Examples of machine learning algorithms that we could use for regression:

Linear regression
Decision trees
Random forests
Neural networks

iris.corr(numeric_only=True).style.background_gradient(cmap='coolwarm', axis=None).format(precision=2) # to find out which features  correlate with sepal_length

from sklearn.tree import DecisionTreeRegressor
features= ['petal_length'] # add 'petal_width' ('species' does not work; categorical is not implemented in DT of sciki learn)
dt_regression = DecisionTreeRegressor(max_depth = 2) # Increase max_depth to see effect in the plot
dt_regression.fit(iris_train[features].values, iris_train['sepal_length'].values)
 
#visualisation in a scatterplot how a decision tree regressor makes his decisions
import matplotlib.pyplot as plt
 
X_train=iris_train[features].values                     # Assign matrix X
y_train=iris_train['sepal_length'].values               # Assign matrix y
 
sort_idx = X_train.flatten().argsort()                  # Sort X and y by ascending values of X
X_train = X_train[sort_idx]
y_train = y_train[sort_idx]
 
plt.figure(figsize=(16, 8))
plt.scatter(X_train, y_train, c='steelblue',                  # Plot actual target against features
            edgecolor='white', s=70)
plt.plot(X_train, dt_regression.predict(X_train),                      # Plot predicted target against features
         color='red', lw=2)
plt.xlabel('petal_length')
plt.ylabel('sepal_length')
plt.show()

import dtreeviz
viz_rmodel = dtreeviz.model(dt_regression, X_train, y_train,
                            feature_names=features,
                            target_name='sepal_length')
viz_rmodel.rtree_feature_space(features=features)

lets now make a visual graph.

from sklearn import tree
import graphviz
 
def plot_tree_regression(model, features):
    # Generate plot data
    dot_data = tree.export_graphviz(model, out_file=None,
                          feature_names=features,
                          filled=True, rounded=True,
                          special_characters=True)
 
    # Turn into graph using graphviz
    graph = graphviz.Source(dot_data)
 
    # Write out a pdf
    graph.render("decision_tree")
 
    # Display in the notebook
    return graph
 
plot_tree_regression(dt_regression, features)

Evaluating the Regression model

Let’s calculate the RSME

def calculate_rmse(predictions, actuals):
    if(len(predictions) != len(actuals)):
        raise Exception("The amount of predictions did not equal the amount of actuals")
 
    return (((predictions - actuals) ** 2).sum() / len(actuals)) ** (1/2)
 
predictionsOnTrainset = dt_regression.predict(iris_train[features])
predictionsOnTestset = dt_regression.predict(iris_test[features])
 
rmseTrain = calculate_rmse(predictionsOnTrainset, iris_train.sepal_length)
rmseTest = calculate_rmse(predictionsOnTestset, iris_test.sepal_length)
 
print("RMSE on training set " + str(rmseTrain))
print("RMSE on test set " + str(rmseTest))

✅ What is RMSE (Root Mean Squared Error)?

RMSE tells you how far off your model’s predictions are, on average --- using the same units as the target you’re predicting.

In this case, we’re predicting sepal length (in cm). An RMSE of 0.46 cm means your model is, on average, less than half a centimeter off. That’s pretty accurate, since sepal lengths range from about 4.3 to 7.9 cm.

👍 Why is this good?

The error is small compared to the total range of sepal lengths.
The training and test errors are similar, so the model isn’t overfitting.

📌 How to tell if an RMSE is good?

Compare it to the range or standard deviation of your target.
Compare it to a simple model (like always predicting the average value).

🧠 Note: If your numbers are much bigger (like in the millions), then a bigger RMSE (e.g., 1000+) might still be totally fine. It depends on the scale of your data.

Random Forest

Random Forest is one of the most popular and most powerful machine learning algorithms. It is a type of ensemble machine learning algorithm called Bootstrap Aggregation or bagging.

We will apply a Random Forest classifier to the task of classifying penguin species. To optimize performance and accuracy, we will focus on two key parameters: max_depth and n_estimators.

n_estimators: number of trees in the forest
max_depth: controls how deep each decision tree can grow

penguins = sns.load_dataset("penguins")
penguins = penguins.fillna(0)
print(penguins.head())
features = ['bill_length_mm', 'body_mass_g', "flipper_length_mm"] #add features per iteration such as 'body_mass_g'
X = penguins[features]
y = penguins['species'] # the thing we want to predict
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(criterion='entropy', n_estimators=5, max_depth=3)
rf.fit(X_train, y_train) #fit the random forest to the training data

from sklearn import tree
import graphviz
from fpdf import FPDF
 
def plot_tree_classification(model, features, class_names, output_file='random_forest'):
    if isinstance(model, RandomForestClassifier):
        pdf = FPDF()
 
        for i, tree_model in enumerate(model.estimators_):
            dot_data = tree.export_graphviz(tree_model, out_file=None,
                                  feature_names=features,
                                  class_names=class_names,
                                  filled=True, rounded=True,
                                  special_characters=True)
 
            # Turn into graph using graphviz
            graph = graphviz.Source(dot_data)
 
            # Save as PNG for embedding in PDF
            image_file = f"{output_file}_tree_{i+1}.png"
            graph.render(filename=image_file, format='png')
 
            # Add each tree image to PDF
            pdf.add_page()
            pdf.image(image_file + '.png', x=10, y=10, w=180)
 
        # Save the complete PDF
        pdf_output_file = f"{output_file}.pdf"
        pdf.output(pdf_output_file)
 
        print(f"All trees saved in {pdf_output_file}.")
 
    else:
        raise ValueError("The model is not a RandomForestClassifier.")
 
    return graph
 
feature_names = X.columns
class_names = np.sort(np.unique(y)).astype(str)
plot_tree_classification(rf, feature_names, class_names)

here we have created a pdf for our rando forest, and we have a visual graph.

ofc you can ignore the visualisation / the pdf generation.

def calculate_accuracy(predictions, actuals):
    if(len(predictions) != len(actuals)):
        raise Exception("The amount of predictions did not equal the amount of actuals")
 
    return (predictions == actuals).sum() / len(actuals)
 
predictionsOnTrainset = rf.predict(X_train)
predictionsOnTestset = rf.predict(X_test)
 
accuracyTrain = calculate_accuracy(predictionsOnTrainset, y_train)
accuracyTest = calculate_accuracy(predictionsOnTestset, y_test)
 
print("Accuracy on training set " + str(accuracyTrain))
print("Accuracy on test set " + str(accuracyTest))

so this means that with our settings:

features = ['bill_length_mm', 'body_mass_g', "flipper_length_mm"]
n_estimators=5
max_depth=3

That means that we have a pretty accurate random forest, where there is only 5% accuracy difference between the test and training set.

BY tweaking these numbers, you’ll get a different result, better or worse. You can check this manually to try and find the best result, or you brute force it like I did:

results = []
amount_of_depths = 25
start_depths = 1
amount_of_estimators = 25
start_estimators = 1
for loop in range(start_depths,amount_of_depths +1):
    for loop2 in range(start_estimators,amount_of_estimators+1):
        print(f"calculating n_estimators: {loop2}, max_depth: {loop}")
        rf = RandomForestClassifier(criterion='entropy', n_estimators=5, max_depth=loop)
        rf.fit(X_train, y_train)
        feature_names = X.columns
        class_names = np.sort(np.unique(y)).astype(str)
        # plot_tree_classification(rf, feature_names, class_names)
        predictionsOnTrainset = rf.predict(X_train)
        predictionsOnTestset = rf.predict(X_test)
        accuracyTrain = calculate_accuracy(predictionsOnTrainset, y_train)
        accuracyTest = calculate_accuracy(predictionsOnTestset, y_test)
        print("Accuracy on training set " + str(accuracyTrain))
        print("Accuracy on test set     " + str(accuracyTest))
        print(f"difference at {loop}&{loop2} = {(accuracyTrain - accuracyTest)}")
        results.append({
            "max_depth": loop,
            "n_estimators": loop2,
            "accuracyTrain": accuracyTrain,
            "accuracyTest": accuracyTest,
            "difference": accuracyTrain - accuracyTest
        })
 
 
results.sort(key=lambda x: x["difference"])
df = pd.DataFrame(results)
df

here I had wrote a method to go by all settings combinations (max depth and estimators) to see which works the best. and in my conclusion:

at 5 combinations, the accuracy of the test data is higher than the accuracy of the training data which is interesting.

I find it also interesting that the order of depth increasing doesn’t mean the difference gets smaller nor bigger. it is a bit random (in my eyes)

I also find it interesting that more trees doesn’t mean a bigger or smaller difference in the accuracy between the 2 datasets

You can filter the results table on specific columns by using pycharm for jupyter notebooks

🌵OldMartijntje

Explorer

Datascience

Datascience

Imports

Distributions

Examples of normal distribution in real life

Confidence Interval

Analysis

Univariate Analysis

Categorical Univariate Analysis

Numerical Univariate Analysis

Bivariate Analysis

Numerical vs Numerical

Numerical vs Categorical

Categorical vs Categorical

Multivariate analysis

Classification

Training the Classification model

Evaluating the Classification model

Test/train split

Overfitting

Regression

Evaluating the Regression model

Random Forest

Graph View

Table of Contents

Backlinks