One example of the use of machine learning: is the billion-dollar hedge fund program “Man group’s AHL Dimension” which is partly driven by artificial intelligence. After AI went live in 2016, machine learning algorithms accounted for more than half of the fund’s returns, despite managing a small fraction of the assets.
In this article, we will look at most useful machine learning algorithms that are actively used in the trading community and can become the basis for your own algorithm.
Linear regression is perhaps one of the most well-known and understood algorithms in statistics and machine learning.
Predictive modeling is primarily about minimizing model error, or in other words, making predictions as accurate as possible. We will borrow algorithms from various fields, including statistics, and use them for these purposes.
Linear regression can be represented as an equation that describes a straight line that most accurately shows the relationship between input variables and output variables Y. To compile this equation, you need to find certain coefficients B for the input variables.
Linear regression has been around for more than 200 years, and during that time it has been carefully studied. So here are a couple of rules of thumb: remove similar (correlated) variables and get rid of data noise if possible. Linear Regression is a fast and simple algorithm that is well suited as a first algorithm to learn.
Logistic regression is another algorithm that came to machine learning straight from statistics. It is good to use it for binary classification problems (these are problems in which we get one of two classes as an output).
Logistic regression is similar to linear regression in that it also requires you to find the values of the coefficients for the input variables. The difference is that the output value is converted using a non-linear or logistic function.
The logistic function looks like a capital S and converts any value to a number between 0 and 1. This is quite useful as we can apply a rule to the output of the logistic function to bind to 0 and 1 (for example, if the result of the function is less than 0.5, then the output is 1) and class predictions.
Linear Discriminant Analysis (LDA)
Logistic regression is used when a sample needs to be assigned to one of two classes. If there are more than two classes, then it is better to use the LDA (Linear discriminant analysis) algorithm.
The representation of LDA is quite simple. It consists of the statistical properties of the data calculated for each class. For each input variable, this includes:
- The average value for each class;
- The dispersion was calculated for all classes.
Predictions are made by calculating the discriminant value for each class and selecting the class with the largest value. It is assumed that the data has a normal distribution, so it is recommended to remove anomalous values from the data before starting work. It is a simple and efficient algorithm for classification problems.
A decision tree can be represented as a binary tree, familiar to many in algorithms and data structures. Each node represents an input variable and a split point for that variable (assuming a variable is a number).
Leaf nodes are the output variable that is used for prediction. Predictions are made by traversing the tree to a leaf node and outputting the class value at that node.
Trees learn quickly and make predictions. In addition, they are accurate for a wide range of applications and do not require special data preparation.
Naive Bayes Classifier
Naive Bayes is a simple yet remarkably effective algorithm.
Once a probabilistic model has been computed, it can be used to make predictions with new data using Bayes’ theorem. If you have real data, then, assuming a normal distribution, calculating these probabilities is not particularly difficult.
Naive Bayes is called naive because the algorithm assumes that each input variable is independent. This is a strong assumption that does not match the real data. Nevertheless, this algorithm is very effective for a number of complex tasks such as spam classification or handwritten digit recognition.
K-nearest neighbors (KNN)
K-nearest neighbors are a very simple and very efficient algorithm. The KNN (K-nearest neighbors) model is represented by the entire training dataset. Pretty simple, right?
Prediction for a new point is done by looking up the K nearest neighbors in the data set and summing the output variable for those K instances.
The only question is how to determine the similarity between data instances. If all the features have the same scale (for example, centimeters), then the easiest way is to use the Euclidean distance, a number that can be calculated based on the differences with each input variable.
KNN may require a lot of memory to store all the data, but it will quickly make a prediction. The training data can also be updated to keep the predictions accurate over time.
Vector quantization networks (LVQ)
The disadvantage of KNN is that you need to store the entire training dataset. If KNN performed well, then it makes sense to try the LVQ (Learning vector quantization) algorithm, which is devoid of this drawback.
LVQ is a set of code vectors. They are chosen randomly at the beginning and, over a certain number of iterations, are adapted in such a way as to best generalize the entire data set. After training, these vectors can be used for prediction in the same way as it is done in KNN. The algorithm searches for the nearest neighbor (best-fit codevector) by calculating the distance between each code vector and the new data instance. The class (or number in the case of regression) is then returned as a prediction for the best-fit vector. The best result can be achieved if all the data is in the same range, for example from 0 to 1.
When beginners see all the variety of algorithms, they ask the standard question: “Which one should I use?” The answer to this question depends on many factors:
- Size, quality, and nature of data;
- available computing time;
- The urgent of the task;
- What do you want to do with the data.
Even an experienced data scientist won’t tell you which algorithm will work best before trying several options. There are many other machine learning algorithms, but the ones above are the most popular. If you are just getting started with machine learning, then they are a good starting point.