Machine Learning Foundation (part 2 of 2)

Following the previous blogpost, I will continue to explain some foundational ideas in ML:

Prototyping Algorithm

Define the problem

To over-simplify, the goal of a ML algorithm is..

Given a set of data point, x1,x2,...xnXx_{1}, x_{2}, ... x_{n} \in X, and each data point has m features xiRmx_{i} \in R_{m},
And an algorithm AA takes the data point,
And a cost function is defined f(A,X) f(A, X) ,
Then we define a ML process that optimize the outcome of f(A,X) f(A, X)

When prototyping a ML algorithm, we need to have the answers for the following:

Note that in this blogpost, an algorithm refers to the abstract mathematical framework to form a hypothesis, whereas a hypothesis is the concrete model with parameters calculated using the algorithm and the given dataset.

Mainstream algorithms

Depending on the type of the problem, either a supervised or unsupervised learning problem, the Coursera class introduces some mainstream algorithms:

Although each school of algorithm follows different process, I find that *general framekwork of these algorithms rather similar (input, output, hypothesis, cost function). The technical detail of each algorithm is outside the scope of this blogpost.

Evaluating Hypothesis

When different hypothesis are formed given a dataset, we want to have a process to help determine which hypothesis makes better prediction than others. In general, the fruition of a hypothesis comes from two steps: training step and testing step. Each step requires dataset.

We split a whole dataset into three parts: training set, validation set, and test set. To quote from the Coursera class, the rationale of splitting the dataset three ways is:

Just because a learning algorithm fits a training set well, that does not mean it is a good hypothesis. The error of your hypothesis as measured on the data set with which you trained the parameters will be lower than any other data set.

As a rule of thumb, 60% of the whole dataset should be training set, 20% should be validation set, and the rest 20% should be test set, and the three datasets can be applied as below:

Underfitting vs. Overfitting

When bad prediction of a hypothesis happens, we need to understand what contributes to the bad prediction. There are two main errors that can cause a bad prediction:

Underfitting vs. Overfitting

To over simplify, the steps of troubleshooting a underperforming hypothesis are:

  1. Understad what is the cause of underperformance (bias vs. variance).
  2. Apply techniques based on the type of underperformance.

The details of how to diagnose and improve performance of hypothesis is beyond the scope of this blogpost.

Large ML systems

Two major challenges are present when building a large ML systems.

Several techniques have been introduced in class, including Stochastic Gradient Descent, Online Learning, and MapReduce. The goal of these technqieus is to speed up the computing process, and the big ideas can be summarized below:

rss facebook twitter github youtube mail spotify instagram linkedin google pinterest medium vimeo