My Journey into Data Science and Machine Learning

Hey everyone!

My Python knowledge is limited. Since I started coding more than ten years ago, I've relied heavily on PHP, Laravel, MySQL, and VueJS.

But do you know what I've been thinking about for the past six months? I love data science and machine learning! I've been going from one lesson to another, but I still haven't worked out how to learn everything. After six months, I'm still in the same place I started.

But a new year means a fresh start for me, right? I've promised myself that I will only take one course and finish it. What's more? I'm going to write down everything I learn and share it with everyone. I'm going to tell you about the problems I have and the ways I solve them.

So, I took the plunge and invested in a Udemy course called "Complete A.I. & Machine Learning, Data Science Bootcamp" by the awesome duo Andrei Neagoie and Daniel Bourke from zerotomastery.io.

Week 1: Diving into the Machine Learning Basics

On 1st Week of my learning journey, I delved into the fundamentals of Machine Learning through an Introduction module. This immersive experience provided insights into the essence of Machine Learning, its classifications, and its intricate relationship with Artificial Intelligence and Data Science.

  • Some tasks are done very well be humans like: Identifying Cats, Dogs ect.
  • Some tasks are done very well and quickly by computers like: Calculations, etc.

So Machine Learning is a process by which we make computers act more like humans. Because the smarter they get, the more they help us humans accomplish our goals.

AI / Machine Learning / Data Science

AI: Imagine AI as machines trying to be as smart as us humans! Right now, we mostly have Narrow AI. It's like having super-smart machines that are really good at specific jobs. For example, some can spot heart or eye problems in pictures. But here's the catch: these machines are like superheroes who excel in just one special power. They can't do lots of different things like humans. We call that super-smart, do-everything kind of AI General AI. That's the next big goal for the smart machines!

Machine Learning: Machine learning is like a special way these machines learn and become smarter. It's like the science of teaching computers to do things without telling them exactly what to do. At Stanford University, they call it "the magic of getting computers to act without a step-by-step guide".

Deep Learning or Deep Nural Networks Now, deep learning is a cool technique for making machines smarter. It's like a special way within machine learning. Think of it as a superhero move in the world of smart machines!

Data Science: Ever heard of data science? It's all about looking at information, figuring out patterns, and doing something useful with it—usually to achieve some big goal in business.

If you are a data scientist, you need to know machine learning. If you are a machine learning expert, you need to know data science.

How the world got into Data Science and Machine Learning

Let's take a journey through the evolution of data handling!

Spreadsheets Era: Back in the day, data lived in spreadsheets like Excel or CSV files. Think customer and sales data neatly organized in an Excel file. Businesses used these to make important decisions.

Relational Database Revolution: As companies grew, so did their data. Handling it in Excel became a hassle. Enter Relational Databases like MySql, introducing the cool language SQL. Now, reading and writing data for business decisions became a breeze!

Big Data Explosion: Fast forward to the 2000s, where giants like Facebook, Amazon, Google, and Twitter generated insane amounts of data. Spreadsheets couldn't handle it anymore, and Relational Databases struggled with messy, unstructured data. That's when NoSQL, like MongoDB, stepped in to store and make sense of the unorganized data.

Machine Learning Era: With the explosion of data, humans found it impossible to visualize and make decisions. Now, we use machines to analyze massive data sets more easily, quickly, and precisely than our human brains ever could. It's like having super-smart helpers in the world of data and decisions!

Types of Machine Learning

  1. Supervised Learning: The data we receive already has categories. Eg, a csv file with rows and columns labelled.

    1. Classification: Here we can do classification like it is an apple or orange.
    2. Regression: Or we can do regression based on input. Eg, Predicting Stock Prices
  2. Unsupervised: The data we receive is not labelled. Eg, a csv file with rows and columns with no labels. We can tell the machine that they are right or wrong, like we can when we do apples versus pears, since there are no true categories. But we let the machines just create these categories for us.

    1. Clustering: We give it a bunch of data points, and then the machine decides, This is Group A, This is Group B, and This is a Group C.
    2. Association Rule: Where we associate different things to predict what a customer might buy in the future when groups don't exist.
  3. Reinforcement: It's all about teaching machines through trial and error, through rewards and punishment. It's like a program learning to master a game by playing it over and over.

Machine Learning & Data Science Framework

The framework I learned comprises following 6 steps. Let's unravel the secrets of the machine learning framework in six simple steps!

  1. Problem Definition: The process begins by understanding the problem we aim to resolve. Can we solve this puzzle using either supervised or unsupervised learning? Are we diving into classification or regression? Establishing the problem lays the foundation for our journey into machine learning.

  2. Data: The question we're trying to answer here in step two is: What kind of data do we have? Depending on the problem, there are different kinds of data: structured data such as rows and columns, or what you'd expect to find in an Excel spreadsheet, or unstructured data such as images or audio. Once we know what kind of data we have, we can start to make decisions on how to use machine learning with it.

  3. Evaluation: Here we will define what success means to us. What does success look like for our project? If we're predicting house prices, maybe we aim for a jaw-dropping 95% accuracy. Defining success guides our machine learning journey.

  4. Feature: One question we answer here is what do we already know about the data? For example, for predicting whether or not someone has heart disease, you might use their body weight as a feature. Since body weight is a number, it's called a numerical feature. And after talking to a doctor, they might tell you if someone's body weight is over a certain number, they're more likely to have heart disease.

  5. Modelling: Once you've learned a little bit about your data, the next step is to model it. The question here is based on our problem and data. What machine learning model should we use? Unlike other algorithms and sets of instructions, you have to write from scratch. Many of the most useful machine learning algorithms have already been coded for you, which is beautiful for us. Some models work better on different problems than others, and in the beginning, your focus will be to figure out the right model for the right kind of problem.

  6. Experimentation: All of the steps we've just been through happen in a cycle. You might start out with one problem definition and find your data isn't suited to it. Then you might build a model and find it doesn't work as well as you outlined in your evaluation metric. So you build another one, and you find out that this one actually works pretty well.

Let us understand the previously mentioned six steps in great detail.

1. Problem Definition

Before we start using machine learning we must know When shouldn't we use machine learning ?

Well, if a simple hand coded instruction based system work ? Then we should favor the simpler system over the machine learning system.

For example, if we knew exactly what to do to make our favorite chicken dish and had all the ingredients, it would probably be better to use a simple method than machine learning to try to figure out the steps.

2. Data

  1. Structured Data: Structured data is what you'd expect to see in an Excel file. For example, rows and columns of different patient medical records that show if they have heart disease or not, or customer purchase transactions. It's called "structured data" because most of the samples, like patient records, are in the same format. For example, one column might have numbers of a certain type, like a patient's normal blood pressure, sex, or weight.

  2. Unstructured Data: Unstructured data are things like images, natural language, text, such as transcribed phone calls, videos and audio files.

  3. Static Data: The kind of data that doesn't change over time is called static data. You might have a spreadsheet with patient information in a CSV format. CSV stands for "comma-separated values," which means that all the different types of data are in one file and are separated by commas.

  4. Streaming Data: Data that changes all the time is called streaming data. Let's say you wanted to use news stories to guess how the price of a stock will change. Since news stories are always being changed, you'd be working with streaming data. You'll want to see how they change stocks first.

3. Evaluation

Every machine learning problem we come across will have the similar goal of finding insights in data to predict the future in some way. An evaluation metric is a measure of how well a machine learning algorithm predicts the future.

And in this step, the question you want to answer is What defines success for us?

For example, if your problem is to use patient medical records to classify whether someone has heart disease or not, you might start by saying, for this project to be valuable, we need a machine learning model with over 99% accuracy because predicting whether or not a patient has heart disease is an important task.

Different Types of Matrix

Classification Regression Recommendation
Accuracy Mean absolute error (MAE) Precision at K
Precision Mean squared error (MSE)
Recall Root mean squared error (RMSE)

4. Features

What do we already know about the data ?

Feature is another word for different forms of data. Now we've leaned about different kinds of data earlier, such as structured and unstructured, but features refers to the different forms of data within structured or unstructured data.

a. Structured Data

For Example: Heart Rate Data

ID Weight Sex Heart Rate Chest Pain Heart Disease
4568 110 Kg M 82 4 Yes
4569 70 Kg F 60 1 No
4570 78 Kg M 57 0 No

Feature Variables are : Weight, Sex, Heart Rate, Chest Pain Target Variable is : Heart Disease

Now when we talk about Feature Variables there are different Kinds.

Numerical Features: Weight, Heart Rate, Chest Pain Categorical Featured: Sex, Heart Disease

So there are another Featured call Derived Feature. This is created based on some other data.

ID Weight Sex Heart Rate Chest Pain Heart Disease Visited Last Year
4568 110 Kg M 82 4 Yes Yes
4569 70 Kg F 60 1 No Yes
4570 78 Kg M 57 0 No No

Derived Feature: Visited Last Year

The process of deriving feature based on another data is called Feature Engineering. The process of ensuring all samples have similar information is called Feature Coverage in an ideal data set.

So the Structured Data Tree looks like following:

  1. Feature Variables
    1. Numerical Features
    2. Target Variable
  2. Target Variable
  3. Derived Feature

b. Unstructured Data

Unstructured data has featured too but they are little less obvious. For example of dog images. Every dog will have 4 legs, 2 eyes, one tail etc.

5. Modeling

Based on our problem and data, what model should we use ?

Modeling can be broken down into 3 parts:

  1. Choosing and training a model
  2. Tuning a model
  3. Model comparison

This is one of the most important concepts in Machine Learning. This is also called as the 3 Sets.

Why this is important, let us understand by using an example:

We are studying using the study materials. Before appearing the final exams we go through some Practice Exams to validate our knowledge. Once we are ready we go for the Final Exam.

Course Material (Training Set) => Practice Exam (Validation Set) => Final Exam (Test Set)

When things go wrong ?

Consider if the Practice exam questions are exactly same as the Final Exam. What will happen ? We will think we have already seen it and everybody will pass by great marks.

But did they learned something new ? NO

Course Material (Training Set) => Practice Exam (Same As Final Exam) => Final Exam (Already Seen It)

So for machine learning we don't want them to be memorizing machine which knows only the info we provided. It should be able to handle and predict from the future and unknown data. This is where the Modelling comes in.

Let us discuss the above 3 parts in details:

1. Choosing and training a model:

In this process will be working with the "Training Data Set" of the data we have.

Why we need to choose a model ? Because we have learned earlier that there are Structured Data and Unstructured data. Some model works best with Structured Data and some are best for Unstructured Data.

  1. Structured Data -> Decision trees such as Random Forest and Gradient Boosting Algo like CatBoost, XGBoost tend to work best.

  2. Unstructured Data -> Deep Learning, Neural Networks and Transfer Learning tend to work best.

Things to Remember :

  • Some models work better than others on different problems.
  • Don't be afraid to try things.
  • Start small and build up (add complexity) as you need.

2. Tuning a Model:

In this process will be working with the Validation Data Set of the data we have.

Here we make the adjustments to the model to get the desired outcome. For example if we are using Random Forest we can adjust the Number of Tree and if we are using Neural Networks we can adjust the Layers.

Things to Remember :

  • Machine learning models have hyperparameter that we can adjust
  • A model first result aren't its last
  • Tuning can take place on Training or Validation Data sets.

3. Model comparison

"How will our model perform in the real world ?"

In this process will be working with the "Test Data Set" of the data we have.

After we do all the hard work in the previous steps its time to see how our model performs in the test set. The test set is like the final exam for the machine learning models. Since our model has not seen the test set yet, evaluating the model in it is a good way to see how it generalizes.

By generalizes means how it adapts to the data it has not seen before. Example: If a new patient comes which data out model has not seen, will it be able to predict about his heart problems accurately.

So the performance in such cases will be different from the Training and Validation data set. But we should consider the following:

Acceptable:

Data Set Performance
Training 98 %
Test 96 %

Here, the slight decline is acceptable.

Not Acceptable:

Data Set Performance
Training 64 %
Test 47 %

Here, the training performance is significantly higher than the test. This is called as Underfitting (potential).

Not Acceptable:

Data Set Performance
Training 93 %
Test 99 %

Here, the test performance is higher than the training. This is called as Overfitting (potential).

Reasons for Underfitting & Overfitting

  1. Data Leakage: It happens when some of the Test Data leaks into the Training Data. This Results in Overfitting.
  2. Data Mismatch: It happens when the data we are training on is different from the data we trained on. Eg having different features on the respective sets of data. This Results in Underfitting.

Fixes for Underfitting & Overfitting

  1. Underfitting:

    • Try a more advanced model
    • Increase model hyperparameter
    • Reduce amount of features
    • Train longer
  2. Overfitting:

    • Collect more data
    • Try a less advanced model

Things to remember:

  • Want to avoid Overfitting and Underfitting. Head towards generality.
  • Keep the test set separate at all cost
  • Compare apples to apples. Eg. have Model 1 train to Data Set 1 then train Model 2 also on the Data Set 1 only.
  • One best performance metric does not equal best model.

6. Experimentation

How could we improve/what can we try next ?

In this step we have to make a decision. Are the results we are getting is enough for us. Or we can make some changes like updating the data sets, changing the model to another model etc and look for better results.

Once we make the decision to make the changes, we will be starting again from the "Problem Definition".

Tools we will be using.

In the next week we wil be learning many things. Let us look an overview of what we will be going to learn.

1. Setting up machine.

  1. Install Anaconda
  2. Install other machine learning tools like:
    1. Jupyter
    2. Numpy
    3. Pandas
    4. Matplotlob
    5. Sckikit Learn
    6. Tensorflow
    7. Cat Boost
    8. XG Boost
    9. Pytorch