Andrew Ng: Six core algorithms of machine learning-AI-php.cn

Andrew Ng: Six core algorithms of machine learning

This article is reproduced from Lei Feng.com. If you need to reprint, please go to the official website of Lei Feng.com to apply for authorization.

Recently, Andrew Ng updated a blog post on "The Batch," the weekly artificial intelligence newsletter he founded, summarizing the historical origins of multiple basic algorithms in the field of machine learning. At the beginning of the article, Andrew Ng recalled a decision he made in his research process: Many years ago, in a project, when choosing an algorithm, he had to choose between a neural network and a decision tree learning algorithm. Considering the computational budget, he ultimately chose neural networks, abandoning boosted decision trees for a long time. This was a wrong decision. "Fortunately, my team quickly revised my choice and the project was successful," Ng said. He lamented that it is very important to continuously learn and update basic knowledge. Like other technical fields, the field of machine learning is constantly evolving as the number of researchers increases and the number of research results grows. However, the contributions of some basic algorithms and core ideas can stand the test of time:

Algorithms: linear and logistic regression, decision trees, etc.
Concepts: regularization, optimization loss function, bias/variance, etc.

In Ng’s view, these algorithms and concepts are the core ideas of many machine learning models. Including house price predictors, text-image generators (such as DALL·E), etc. In this latest article, Ng Enda and his team investigated the origin, use, evolution, etc. of six basic algorithms, and provided a more detailed explanation. The six algorithms are: linear regression, logistic regression, gradient descent, neural network, decision tree and k-means clustering algorithm.

1 Linear Regression: Straight & Narrow

Linear regression is a key statistical method in machine learning, but it does not win without a fight. It was proposed by two outstanding mathematicians, but 200 years later, the problem remains unsolved. The long-standing controversy demonstrates not only the algorithm's remarkable utility, but also its essential simplicity.

So whose algorithm is linear regression? In 1805, French mathematician Adrien-Marie Legendre published a method of fitting a line to a set of points while trying to predict the position of comets (celestial navigation was the most valuable scientific direction in global commerce at the time, much like today's artificial intelligence) smart).

Andrew Ng: Six core algorithms of machine learning

Caption: Sketch portrait of Adrien-Marie LegendreFour years later, the 24-year-old German prodigy Carl Friedrich Gauss insisted that he had been using it since 1795 but thought it was too trivial to write about. Gauss's assertion prompted Legendre to publish an anonymous article stating that "a very famous geometer adopted this method without hesitation."

Illustration: Carl Friedrich GaussSlope and Bias: Linear regression is simple when the relationship between the outcome and the variables that affect it follows a straight line. it works. For example, a car's fuel consumption is linearly related to its weight.

The relationship between the fuel consumption y of a car and its weight x depends on the slope w of the straight line (the extent to which fuel consumption increases with weight) and the offset term b (fuel consumption at zero weight): y=w *x b.
During training, the algorithm predicts expected fuel consumption given the weight of the car. It compares expected and actual fuel consumption. It then minimizes the difference of squares, typically by ordinary least squares techniques, honing the values of w and b.
Considering the car's drag can generate more accurate predictions. Additional variables extend the line to the plane. In this way, linear regression can accommodate any number of variables/dimensions.

Two steps to popularization: The algorithm immediately helped navigators track the stars, and later biologists (especially Charles Darwin) Cousin Francis Galton) identified heritable traits in plants and animals. These two further developments unlocked the broad potential of linear regression. In 1922, British statisticians Ronald Fisher and Karl Pearson showed how linear regression could fit into the general statistical framework of correlations and distributions, making it useful in all sciences. And, nearly a century later, the advent of computers provided the data and processing power to exploit it to an even greater extent.

Coping with Ambiguity: Of course, data will never be measured perfectly, and some variables are more important than others. These facts of life inspire more complex variations. For example, linear regression with regularization (also known as ridge regression) encourages linear regression models not to rely too much on any one variable, or rather to rely evenly on the most important variable. If for simplicity another form of regularization (L1 instead of L2) produces a lasso (compressed estimate) that encourages as many coefficients as possible to be zero. In other words, it learns to select variables with high predictive power and ignore the rest. Elastic networks combine these two types of regularization. It is useful when the data is sparse or when features appear to be related.

In every neuron: Now, the simple version is still very useful. The most common type of neurons in neural networks are linear regression models, followed by nonlinear activation functions, making linear regression a fundamental component of deep learning.

2 Logistic Regression: Following the Curve

There was a time when logistic regression was only used to classify one thing: if you drank a bottle of poison, you Is it possible to be labeled "alive" or "dead"? Times have changed, and today, not only does calling emergency services provide a better answer to this question, but logistic regression is also at the core of deep learning.

Poison Control: The logistic function dates back to the 1830s, when Belgian statistician P.F. Verhulst invented it to describe population dynamics: exponential growth over time The initial explosion of growth levels off as it consumes available resources, producing a characteristic logistic curve. More than a century later, American statistician E. B. Wilson and his student Jane Worcester devised logistic regression to calculate how much of a given hazardous substance was lethal.

Andrew Ng: Six core algorithms of machine learning

Caption: P.F. VerhulstFitting function: Logistic regression Fit a logistic function to the data set in order to predict the probability of a specific outcome (e.g., premature death) for a given event (e.g., ingestion of strychnine).

Training adjusts the center position of the curve horizontally and the middle position of the curve vertically to minimize the error between the function output and the data.
Adjusting the center to the right or left means it takes more or less poison to kill an average person. The steep slope implies certainty: before the halfway point, most people survive; beyond the halfway point, "just say goodbye" (meaning death). Gentle slopes are more forgiving: below the middle of the curve, more than half will survive; above that, less than half will survive.
Set a threshold, say 0.5, between one result and another, and the curve becomes a classifier. Just enter the dose into the model and you'll know whether you should plan a party or a funeral.

More results: Verhulst's work found probabilities for binary outcomes, ignoring further possibilities such as a poisoning victim might enter Which side of the afterlife. His successors expanded the algorithm:

In the late 1960s, British statistician David Cox and Dutch statistician Henri Theil, working independently, conducted logistic regression for situations with more than two possible outcomes.
Further work resulted in ordered logistic regression, where the results are ordered values.
To handle sparse or high-dimensional data, logistic regression can leverage the same regularization techniques as linear regression.

Andrew Ng: Six core algorithms of machine learning

Illustration: David CoxMulti-functional curve : Logistic functions describe a wide range of phenomena with considerable accuracy, so logistic regression provides useful baseline predictions in many situations. In medicine, it estimates mortality and disease risk. In political science, it predicts election winners and losers. In economics, it predicts business prospects. More importantly, it drives a subset of neurons in a wide variety of neural networks (where the nonlinearity is a sigmoid function).

3 Gradient Descent: Everything Goes Downhill

Imagine hiking in the mountains after dusk and realizing you can’t see anything below you. And your phone battery is dead, so you can’t use your GPS app to find your way home. You might find the fastest path via gradient descent. Be careful not to walk off the cliff. Sun and Carpet: Gradient descent is more advantageous than descending through steep terrain. In 1847, French mathematician Augustin-Louis Cauchy invented an algorithm for approximating stellar orbits. Sixty years later, his compatriot Jacques Hadamard independently developed it to describe the deformation of thin, flexible objects, such as carpets, that might make knee-down hiking easier. However, in machine learning, its most common use is to find the lowest point of the loss function of a learning algorithm.

Andrew Ng: Six core algorithms of machine learning

Caption: Augustin-Louis CauchyClimb down: A trained neural network provides a function that computes a desired output given an input. One way to train a network is to minimize the loss, or error, in the output by iteratively calculating the difference between the actual and desired output and then changing the network's parameter values to reduce the difference.

Gradient descent reduces the difference, minimizing the function used to calculate the loss. The parameter value of the network is equivalent to a position on the terrain, and the loss is the current height. As you go down, you can improve the network's ability to compute something close to the desired output. Visibility is limited because in a typical supervised learning situation, the algorithm relies only on the parameter values of the network and the gradient or slope of the loss function - i.e. where you are on the hill and the slope below you.

The basic method is to move in the direction where the terrain drops steepest. The trick is to calibrate your stride length. Take a step that's too small and it takes a long time to make progress; take a step that's too big and you'll plunge into unknown territory, possibly going uphill instead of downhill.
Given the current position, the algorithm estimates the direction of fastest descent by calculating the gradient of the loss function. The gradient points uphill, so the algorithm goes in the opposite direction by subtracting a small portion of the gradient. A fraction α called the learning rate determines the step size before measuring the gradient again.
Repeat these steps and hope you can reach a valley. Congratulations!

Stuck in the Valley: Too bad your phone is out of battery because the algorithm probably didn't push you to the bottom of the mountain. You can get stuck in a non-convex landscape consisting of multiple valleys (local minima), peaks (local maxima), saddle points (saddle points), and plateaus. In fact, tasks such as image recognition, text generation, and speech recognition are all non-convex, and many variations of gradient descent have emerged to handle this situation.

For example, the algorithm may have momentum that helps it amplify small ups and downs, making it more likely to reach the bottom. The researchers devised so many variations that it seemed like there were as many optimizers as there were local minima. Fortunately, local minima and global minima tend to be approximately equal.

Optimal Optimizer: Gradient descent is the clear choice for finding the minimum of any function. In cases where the exact solution can be calculated directly - for example, in linear regression tasks with a large number of variables - it can approximate a value, often faster and cheaper. But it does come in handy in complex non-linear tasks. With gradient descent and a sense of adventure, you can probably make it out of the mountains in time for dinner.

4 Neural Networks: Finding Functions

Let’s get this out of the way first: The brain is not a set of graphics processing units,if it were If so, the software it runs is much more complex than a typical artificial neural network. Neural networks are inspired by the structure of the brain: layers of interconnected neurons, each calculating its own output based on the state of its neighbors. The resulting chain of activity forms an idea—or recognizes an idea. A picture of a cat.

From biological to artificial: The idea that the brain learns through interactions between neurons dates back to 1873, but it was not until 1943 that American neuroscientist Warren McCulloch and Walter Pitts used simple mathematical rules to establish a biological neural network model. In 1958, American psychologist Frank Rosenblatt developed the sensor, a single-layer visual network implemented on a punch card machine, with the goal of building a hardware version for the U.S. Navy.

Andrew Ng: Six core algorithms of machine learning

Caption: Frank RosenblattThe bigger the better: Rosenblatt’s Invention can only recognize single line classifications. Later, Ukrainian mathematicians Alexey Ivakhnenko and Valentin Lapa overcame this limitation by stacking networks of neurons in any number of layers.

In 1985, French computer scientists Yann LeCun, David Parker, and American psychologist David Rumelhart and colleagues, working independently, described the use of backpropagation to effectively train such networks.

In the first decade of the new millennium, researchers including Kumar Chellapilla, Dave Steinkraus, and Rajat Raina (in collaboration with Andrew Ng) pushed things even further by using graphics processing units. The development of neural networks, which enables larger and larger neural networks to learn from the massive amounts of data generated by the Internet.

Suitable for every task: The principle behind neural networks is simple: for any task, there is a function that performs it. A neural network forms a trainable function by combining multiple simple functions, each performed by a single neuron. The function of a neuron is determined by adjustable parameters called "weights".

Given these weights and an input example with random values for its desired output, you can iteratively change the weights until the trainable function performs the task at hand.

A neuron takes various inputs (e.g., numbers representing pixels or words, or the output of the previous layer), multiplies them with weights, and the products are are added, and results in a sum of nonlinear functions or activation functions chosen by the developer. During this period, you should consider that it is linear regression, plus an activation function.
Training changes weights. For each example input, the network computes an output and compares it to the expected output. Backpropagation changes the weights through gradient descent to reduce the difference between the actual and expected output. When this process is repeated enough times with enough (good) examples, the network learns to perform the task.

Black Box: While with any luck a well-trained network can do its job, eventually you end up reading a function that will often It’s so complex — containing thousands of variables and nested activation functions — that it’s difficult to explain how the network succeeds in its task. Furthermore, a well-trained network is only as good as the data it learned from.

For example, if the data set is biased, then the output of the network will also be biased. If it only contains high-resolution images of cats, it's unknown how it will react to low-resolution images. A piece of common sense: The New York Times blazed a trail for artificial intelligence hype when it reported on Rosenblatt’s 1958 sensor, noting that “the U.S. Navy wants a machine that can walk, talk, see, write, and replicate itself.” And the prototype of an electronic computer that is aware of its own existence."

Although the sensors at the time did not meet this requirement, it produced many impressive models: convolutional neural networks for images; recurrent neural networks for text; , transformers for speech, video, protein structure, etc.

They are already doing amazing things, like surpassing human-level performance at the game of Go and approaching human-level performance in practical tasks like diagnosing X-ray images. However, they are still more difficult to deal with in terms of common sense and logical reasoning.

5 Decision Tree: From Root to Leaf

What kind of "beast" is Aristotle? Porphyry, a follower of the philosopher who lived in Syria during the third century, came up with a logical way to answer this question.

He combined the "categories of existence" proposed by Aristotle from general to specific, and classified Aristotle into each classification in turn: Aristotle Dodd's existence was material rather than conceptual or spiritual; his body was animate rather than inanimate; his thoughts were rational rather than irrational.

Therefore, his classification is human. Medieval logic teachers drew this sequence as a vertical flow diagram: an early decision tree.

DIFFERENCE IN NUMBERS: Fast forward to 1963, when University of Michigan sociologist John Sonquist and economist James Morgan were grouping survey respondents. Decision trees were implemented for the first time in computers. This kind of work has become common with the advent of software that automatically trains algorithms, and decision trees are now used by various machine learning libraries including scikit-learn and others. The code was developed over a 10-year period by four statisticians at Stanford University and the University of California, Berkeley. Today, writing decision trees from scratch has become a Machine Learning 101 homework assignment.

Roots in the air: Decision trees can perform classification or regression. It grows downward, from the roots to the crown, classifying the input examples of a decision hierarchy into two (or more). The topic of German medical scientist and anthropologist Johann Blumenbach comes to mind: around 1776, he first distinguished monkeys from apes (leaving aside humans). Before that, monkeys and apes were grouped together.

This classification depends on various criteria, such as the presence of a tail, a narrow or broad chest, whether it is upright or crouching, and the level of intelligence. Use a trained decision tree to label such animals, considering each criterion one by one, and ultimately separate the two groups of animals.

The tree starts from the root node of what can be considered a biological database that contains all cases - chimpanzees, gorillas, and orangutans, as well as capuchin monkeys, Baboons and marmosets. The root provides a choice between two child nodes whether they exhibit a particular characteristic, resulting in the two child nodes containing examples with and without that characteristic. By analogy, this process ends with any number of leaf nodes, each containing most or all of a category.
In order to grow, the tree must find the root decision. To make a choice, consider all the features and their value - hind appendages, barrel chest, etc. - and choose the one that maximizes the purity of the segmentation. "Optimum purity" is defined as 100% of instances of a category going into a particular child node and never into another node. Forks are rarely 100% pure after just one decision, and they probably never will be. As this process continues, one level after another of child nodes is produced, until the purity does not increase much by considering more features. At this point, the tree has been fully trained.
During inference, a new example goes through the decision tree from top to bottom, completing the evaluation of different decisions at each level. It will get the data label contained in the leaf node where it is located.

Enter the Top 10: Given Blumenbach's conclusion (later overturned by Charles Darwin) that humans are distinguished from apes by their broad pelvises, hands, and clenched teeth, What if we wanted to extend the decision tree to classify not just apes and monkeys, but humans? Australian computer scientist John Ross Quinlan made this possible in 1986 with ID3, which extended decision trees to support non-binary outcomes. In 2008, in the list of the top ten data mining algorithms planned by the IEEE International Data Mining Conference, an extended refinement algorithm named C4.5 ranked at the top.

In a world where innovation is rampant, this is staying power. Peeling back the leaves: Decision trees do have some disadvantages. They can easily overfit the data by adding multiple levels of hierarchy so that a leaf node only includes a single example. Worse yet, they are prone to the butterfly effect: changing one example, the tree that grows is completely different.

Into the forest: American statistician Leo Breiman and New Zealand statistician Adele Cutler turned this feature into an advantage and developed random forest (random forest) in 2001—— This is a collection of decision trees, each of which processes a different, overlapping selection of examples and votes on the final outcome. Random Forest and its cousin XGBoost are less prone to overfitting, which helps make them one of the most popular machine learning algorithms. It's like having Aristotle, Porphyry, Blumenbach, Darwin, Jane Goodall, Dian Fossey and 1,000 other zoologists in the room together to make sure your classification is the best it can be.

6 K-Means Clustering: Groupthink

If you stand close to other people at a party, chances are you have something in common. This is the idea of using k-means clustering to group data points. Whether groups formed through human agency or other forces, this algorithm will find them. From Explosions to Dial Tone: American physicist Stuart Lloyd, an alumnus of Bell Labs’ iconic Innovation Factory and the Manhattan Project that invented the atomic bomb, first proposed k-means clustering in 1957 to perform numerical analysis on Assign information in signals, but this work was not published until 1982:

Andrew Ng: Six core algorithms of machine learning

Paper address: https://cs.nyu.edu /~roweis/csc2515-2006/readings/lloyd57.pdfMeanwhile, American statistician Edward Forgy described a similar approach in 1965, leading to its alternative name "Lloyd-Forgy Algorithm". Finding Centers: Consider dividing clusters into like-minded working groups. Given the location of the participants in the room and the number of groups to form, k-means clustering can divide participants into roughly equal-sized groups, with each group clustered around a central point or centroid.

#During training, the algorithm initially specifies k centroids by randomly selecting k people. (K must be chosen manually, and finding an optimal value is sometimes very important.) It then grows k clusters by associating each person with the nearest centroid.
For each cluster, it calculates the average position of all people assigned to that group and assigns that average position as the new centroid. Each new center of mass may not be occupied by a person, but so what? People tend to gather around chocolate and fondue.
After calculating the new centroid, the algorithm reassigns individuals to their nearest centroid. It then calculates new centroids, adjusts the clusters, and so on until the centroids (and the groups around them) are no longer moving. Afterwards, assigning new members to the correct cluster is easy. Get them into position in the room and look for the nearest center of mass.
Be forewarned: given the initial random centroid assignment, you may not end up in the same group as the lovable data-centric AI experts you hope to hang out with. The algorithm does a good job, but it's not guaranteed to find the best solution.

Different distances: Of course, the distance between clustered objects does not need to be large. Any metric between two vectors will do. For example, instead of grouping partygoers based on physical distance, k-means clustering can divide them based on their clothing, occupation, or other attributes. Online stores use it to segment customers based on their preferences or behavior, and astronomers can group stars of the same type together. The Power of Data Points: This idea produced some significant changes:

K-medoids Use actual data points as centroids rather than the average position within a given cluster. The center point is the point that minimizes the distance to all points in the cluster. This change is easier to interpret because the centroid is always the data point.
Fuzzy C-Means Clustering Enables data points to participate in multiple clusters to varying degrees. It replaces hard cluster assignments with cluster degrees based on distance from the centroid.

n Dimensional Carnival: Nonetheless, the algorithm in its original form is still widely useful - especially because as an unsupervised algorithm it does not require Collecting expensive labeled data. It is also being used faster and faster. For example, machine learning libraries including scikit-learn benefited from the addition of kd-trees in 2002, which can partition high-dimensional data very quickly.

The above is the detailed content of Andrew Ng: Six core algorithms of machine learning. For more information, please follow other related articles on the PHP Chinese website!