A Gentle, Minimalist introduction to Machine Learning
Hello everybody! Recently, I’ve been spending non-trivial amounts of time on the fascinating subject of artificial intelligence. It’s come a long way! With the release of Midjourney and ChatGPT, among other products, 2023 looks to be extremely promising, even revolutionary.
I’d like to recommend the following tutorial: https://realpython.com/python-ai-neural-network/
It is simple, sufficiently detailed, does not use tensor flow, and produces a picture at the end!
The complete code to run the example is reproduced below:
import matplotlib.pyplot as plt import numpy as np class NeuralNetwork: def __init__(self, learning_rate): self.weights = np.array([np.random.randn(), np.random.randn()]) self.bias = np.random.randn() self.learning_rate = learning_rate def _sigmoid(self, x): return 1 / (1 + np.exp(-x)) def _sigmoid_deriv(self, x): return self._sigmoid(x) * (1 - self._sigmoid(x)) def predict(self, input_vector): layer_1 = np.dot(input_vector, self.weights) + self.bias layer_2 = self._sigmoid(layer_1) prediction = layer_2 return prediction def _compute_gradients(self, input_vector, target): layer_1 = np.dot(input_vector, self.weights) + self.bias layer_2 = self._sigmoid(layer_1) prediction = layer_2 derror_dprediction = 2 * (prediction - target) dprediction_dlayer1 = self._sigmoid_deriv(layer_1) dlayer1_dbias = 1 dlayer1_dweights = (0 * self.weights) + (1 * input_vector) derror_dbias = ( derror_dprediction * dprediction_dlayer1 * dlayer1_dbias ) derror_dweights = ( derror_dprediction * dprediction_dlayer1 * dlayer1_dweights ) return derror_dbias, derror_dweights def _update_parameters(self, derror_dbias, derror_dweights): self.bias = self.bias - (derror_dbias * self.learning_rate) self.weights = self.weights - ( derror_dweights * self.learning_rate ) def train(self, input_vectors, targets, iterations): cumulative_errors = [] for current_iteration in range(iterations): # Pick a data instance at random random_data_index = np.random.randint(len(input_vectors)) input_vector = input_vectors[random_data_index] target = targets[random_data_index] # Compute the gradients and update the weights derror_dbias, derror_dweights = self._compute_gradients( input_vector, target ) self._update_parameters(derror_dbias, derror_dweights) # Measure the cumulative error for all the instances if current_iteration % 100 == 0: cumulative_error = 0 # Loop through all the instances to measure the error for data_instance_index in range(len(input_vectors)): data_point = input_vectors[data_instance_index] target = targets[data_instance_index] prediction = self.predict(data_point) error = np.square(prediction - target) cumulative_error = cumulative_error + error cumulative_errors.append(cumulative_error) return cumulative_errors input_vectors = np.array( [ [3, 1.5], [2, 1], [4, 1.5], [3, 4], [3.5, 0.5], [2, 0.5], [5.5, 1], [1, 1], ] ) targets = np.array([0, 1, 0, 1, 0, 1, 1, 0]) learning_rate = 0.01 neural_network = NeuralNetwork(learning_rate) training_error = neural_network.train(input_vectors, targets, 1000) plt.plot(training_error) plt.xlabel("Iterations") plt.ylabel("Error for all training instances") plt.savefig("cumulative_error.png")
And I would like to add some commentary of my own to this great tutorial.
First, the author writes the resulting error after training doesn’t decrease because the dataset is tiny, only 8 data points:
But of course, an astute student would note that by decreasing the learning rate, and increasing the number of learning iterations, we can slightly reduce the error. Or if not reduce the error, at least reduce the variance of the error. The following is the plot of error after decreasing learning rate by 10-fold, and increasing iterations by 3-fold:
If we zoom in, the original error looks like this, where smaller is better:
So you can see the effects of reducing learning rate on the error.
My second commentary is: what does all of this mean? Let’s plot the input data:
Red are vectors that should be categorized as “0”, green are categorized as “1”. Blue is the vector representing the learned weights of the network (there are only two, so I plot them as x,y).
Humans are great at pattern recognition. Just looking at the plot, you can see that the best (if overfit) predictor for this data would be a vector pointing to the average of red arrows, and an activation function to specify the radius around the average point, to define the red cluster.
Of course, the advantage of a neural network is that it is capable of classifying (and performing other operations) on much more complex data, where plotting inputs would perhaps be impossible. Nevertheless, for an introductory tutorial, I believe plotting inputs and outputs, whenever possible, is a nice way of developing intuition about mathematical concepts.
In a next article, we’ll go into details about various types of ANN’s, and write some further implementation of concepts.