Learn Machine/Statistical Learning in 5 min

Chia-Hui (Alice) Liu
5 min readOct 15, 2018
Figure 1. PC: outsourceworkers.com.au

I bet y'all have heard (a lot) about machine/statistical learning (abbreviated as ML/SL)in your daily life for several years. So, how do you denote ML/SL? If you need to briefly introduce ML/SL to people who are complete newbies, what will you talk about?

Let's start with the term "statistics". Statistics are the numerical summary used to describe sample data. Statistical learning, literally speaking, means that learning data in statistical ways. From my point of view, ML/SL is a vest set of (programming/statistical) tools for us to understand data.

Then, how does ML/SL really learn from data? Based on whether we have an output variable as our target or not, ML/SL can be divided into two categories: supervised and unsupervised learning. Here's the comparison between the two.

Figure 2. Comparison between supervised and unsupervised learning

Now, you may wonder as there is an output variable, there may exist input variables. You're absolutely correct! In ML/SL, variables normally consist of input variables and output variables. Input variables (denoted as X), also known as predictors/features or independent variables, are the ones we used to estimate the target. The target, in supervised learning, is the output variables (denoted as Y), also known as outcomes/responses or dependent variables.

In other words, we used a set of input variables denoted as X to estimate the target denoted Y in ML/SL.

So, how do we estimate our target? Here's the process of learning.

Figure 3. Learning Process (PC: Learning From Data, 2012.)

To help you understand more about the process. I'm going to use a very simple example — prediction of the house prices in Austin. Imagine there is a dataset containing 1,000 data points of house prices in Austin (denoted as D). In this case, the house prices in Austin are our target (which are also called dependent variables/ outcomes/ responses, denoted as Y). Then, let's come up with three possible factors that may affect the house prices.

  • Size of the house
  • Postal code: as the location of a house is important
  • num_of_schools: schools nearby may be important when considering buying a house

So, the above three factors are our input variables (denoted as a collection X) for us to estimate/predict the house prices in Austin (Y). Here are the steps.

Step#1. Come up with a hypothesis (H)

Here, we can make a hypothesis by saying that the three factors (size of the house, postal code, number of schools) are affecting the house prices in Austin. That is, the house prices in Austin are determined by the three features.

Now we have the hypothesis, imagine there is a true unknown function (denoted as f) that perfectly estimates the average of house prices in Austin using the three input variables. What else do we have? Yes! We have data that contains a thousand data points, and each one has its own size of the house, postal code, number of schools, and the price of the house.

Step#2. Using ML/SL and the data to approximate the true unknown function.

As the true function is unknown, we need to utilize the data and learning algorithms (e.g. linear regression) to obtain a function (denoted as g in Figure 3) to be approximately similar to the true unknown function.

In practical, we try different learning algorithms (such as support vector machine, linear regression, neural network) to approximate the true unknown function.

Right now, you may know why we want to estimate f? Roughly speaking, there are two purposes.

  1. Inference
    When making inferences is the purpose, the direction/magnitude of predictors (X) on the outcome (Y) is of the interest.
  2. Prediction
    When making predictions is the purpose, only the values of the outcome variables (Y) are of the interest.

There are a lot of different learning algorithms, it is okay for you to use any of them to come up with the most suitable one. However, there is an interpretability-flexibility trade-off among algorithms.

Figure 4. Flexibility-Interpretability Tradeoff (PC: Book Statistical Learning Figure2.7)

As you can see in Figure 4 when you increase the flexibility by choosing more complicated learning algorithm, the interpretability will decrease.

Last but not least, as you have freedom choosing any learning algorithms, concerns of selecting models become critical. Here are three concerns you should keep in mind.

Figure 5. Model concerns (PC: Daniel Saunders)
  1. Overfitting:
    Overfitting occurs when we obtain an estimated function that fits the training data too well but poorly fits the test data.
  2. Training and testing errors:
    Training error refers to the unexplained variances on the training dataset; on the contrary, the testing error is the unexplained variances on the test dataset. With more complex models, you are able to reduce training error; however, you may get higher test error since you might start having overfitting issue.
    P.S. Training dataset refers to the data we used to estimate target function and the test dataset is used to test how the estimated function performs with different dataset.
Figure 6. Train and test errors (source)
  1. Bias-Variance Tradeoff:
    If we obtained an estimated function that can explain most of the variance in the dataset, we’ll unlikely have a low bias. On the other hand, if we have a high bias, we’ll unlikely get a model with a high variance.
Figure 8. Bias-Variance Tradeoff (From Understanding the Bias-Variance Tradeoff, by Scott Fortmann-Roe.)

These are the brief introduction of machine/statistical learning, next time I’ll take about different ways of estimating function and measuring the quality of fit (MSE and Bias-Variance Tradeoff).

Thank you for reading, and please give me any advice if you think there’s a space to improve! Or give me a clap if this article helps you! 😊

--

--