French mathematician Adrien-Marie Legendre has been obsessed with predicting the future position of comets. In view of the comet’s previous position, he was going to create a way to calculate its trajectory.
After trying several methods, I finally made progress.
Legendre started by guessing the future position of the comet, recorded data, analyzed it, and finally verified his guess through data to reduce the sum of square errors.
This is the seed of linear regression.
Two steps to popularization: The algorithm immediately helped navigators track the stars, as well as later biologists (especially Charles Darwin’s cousin Francis Galton) to identify plants and animals heritable characteristics. These two further developments unlocked the broad potential of linear regression. In 1922, British statisticians Ronald Fisher and Karl Pearson showed how linear regression could fit into the general statistical framework of correlations and distributions, making it useful in all sciences. And, nearly a century later, the advent of computers provided the data and processing power to exploit it to an even greater extent.
Dealing with ambiguity: Of course, data will never be measured perfectly, and some variables are more important than others. These facts of life inspire more complex variations. For example, linear regression with regularization (also known as ridge regression) encourages linear regression models not to rely too much on any one variable, or rather to rely evenly on the most important variable. If for simplicity another form of regularization (L1 instead of L2) produces a lasso (compressed estimate) that encourages as many coefficients as possible to be zero. In other words, it learns to select variables with high predictive power and ignore the rest. Elastic networks combine these two types of regularization. It is useful when the data is sparse or when features appear to be related.
In each neuron: Now, the simple version is still very useful. The most common type of neurons in neural networks are linear regression models, followed by nonlinear activation functions, making linear regression a fundamental component of deep learning.
There was a time when logistic regression was only used to classify one thing: if you drank a bottle of poison, the label you might be labeled with was "Alive" or "dead"?
Now, not only does a call to the emergency center provide a better answer to this question, but logistic regression is at the heart of deep learning.
This function dates back to the 1830s, when Belgian statistician P.F. Verhulst invented it to describe population dynamics: over time, an initial explosion of exponential growth levels off as it consumes available resources, This produces a characteristic logistic curve.
More than a century later, American statistician E. B. Wilson and his student Jane Worcester devised logistic regression to calculate how much of a given hazardous substance was lethal.
Imagine hiking in the mountains after dusk and realizing that you can’t see anything except your feet.
Your phone is out of battery, so you can't use GPS to find your way home.
Perhaps you will find the fastest path through gradient descent, but be careful not to fall off the cliff.
In 1847, French mathematician Augustin-Louis Cauchy invented an algorithm for approximating stellar orbits.
Sixty years later, his compatriot Jacques Hadamard independently developed it to describe the deformation of thin, flexible objects, such as carpets, that might make hiking with your knees down easier.
However, in machine learning, its most common use is to find the lowest point of the loss function of the learning algorithm.
Too bad your phone is out of battery because the algorithm probably didn't push you to the bottom of the mountain.
You can get stuck in a non-convex landscape consisting of multiple valleys (local minima), peaks (local maxima), saddle points (saddle points), and plateaus.
In fact, tasks such as image recognition, text generation, and speech recognition are all non-convex, and many variations of gradient descent have emerged to handle this situation.
For example, the algorithm may have momentum that helps it amplify small ups and downs, making it more likely to reach the bottom.
The researchers devised so many variations that it appears there are as many optimizers as there are local minima.
Fortunately, local minima and global minima tend to be approximately equal.
Gradient descent is the clear choice for finding the minimum of any function. In cases where the exact solution can be calculated directly - such as in linear regression tasks with a large number of variables - it can approximate a value and is often faster and cheaper.
But it does play a role in complex non-linear tasks.
With gradient descent and a sense of adventure, you might be able to make it out of the mountains in time for dinner.
Let’s first clarify a problem. The brain is not a set of graphics processing units. If it were, the software it runs would be much more complex than a typical artificial neural network. .
However, neural networks are inspired by the structure of the brain: layers of interconnected neurons, each calculating its own output based on the state of its neighbors. The resulting chain of activity leads to an idea—or recognition of a picture of a cat.
From biological to artificial: The idea that the brain learns through interactions between neurons dates back to 1873, but it was not until 1943 that American neuroscientists Warren McCulloch and Walter Pitts exploited Simple mathematical rules build biological neural network models.
In 1958, American psychologist Frank Rosenblatt developed the sensor, a single-layer visual network implemented on a punch card machine, with the goal of building a hardware version for the U.S. Navy.
Rosenblatt's invention can only recognize single-line classifications.
Later, Ukrainian mathematicians Alexey Ivakhnenko and Valentin Lapa overcame this limitation by stacking networks of neurons in any number of layers.
In 1985, French computer scientists Yann LeCun, David Parker, and American psychologist David Rumelhart and colleagues, working independently, described the use of backpropagation to effectively train such networks.
In the first decade of the new millennium, researchers including Kumar Chellapilla, Dave Steinkraus, and Rajat Raina (in collaboration with Andrew Ng) pushed neural networks further through the use of graphics processing units, which This enables increasingly larger neural networks to learn from the massive amounts of data generated by the Internet.
The New York Times blazed a trail for artificial intelligence hype when it reported on Rosenblatt’s 1958 sensor, noting that “the U.S. Navy wants a machine that can walk, talk, see, and write.” , self-replicating and self-aware of its own existence."
Although the sensors of the time did not meet this requirement, it produced many impressive models: Convolutional Neural Networks for Images networks; recurrent neural networks for text; and transformers for images, text, speech, video, protein structures, and more.
They are already doing amazing things, like surpassing human-level performance at the game of Go and approaching human-level performance in practical tasks like diagnosing X-ray images.
However, they are still difficult to deal with in terms of common sense and logical reasoning.
What kind of "beast" is Aristotle? Porphyry, a follower of the philosopher who lived in Syria during the third century, came up with a logical way to answer this question.
He combined the "categories of existence" proposed by Aristotle from general to specific, and classified Aristotle into each category in turn:
Aristotle Dodd's existence was material rather than conceptual or spiritual; his body was animate rather than inanimate; his thoughts were rational rather than irrational.
Therefore, his classification is human.
Medieval logic teachers drew this sequence as a vertical flow diagram: an early decision tree.
Fast forward to 1963, when University of Michigan sociologist John Sonquist and economist James Morgan first implemented decision trees in computers when grouping survey respondents.
With the emergence of automatic training algorithm software, this kind of work has become common, and now various machine learning libraries including scikit-learn also use decision trees.
The code was developed by four statisticians at Stanford University and the University of California, Berkeley, over a period of 10 years. Today, writing decision trees from scratch has become a Machine Learning 101 homework assignment.
Decision trees can perform classification or regression. It grows downward, from the roots to the crown, classifying the input examples of a decision hierarchy into two (or more).
Think of the topic of German medical scientist and anthropologist Johann Blumenbach: around 1776, he first distinguished monkeys from apes (leaving aside humans). Before that, monkeys and apes were classified as A kind of.
This classification depends on various criteria, such as whether there is a tail, a narrow or broad chest, whether it is upright or crouching, and the level of intelligence. Use a trained decision tree to label such animals, considering each criterion one by one, and ultimately separate the two groups of animals.
Given Blumenbach's conclusion (later overturned by Charles Darwin) that humans differ from apes by wide pelvises, hands, and clenched teeth, if we want to extend the decision tree to classify not just apes and monkeys, but What happens if we classify humans?
Australian computer scientist John Ross Quinlan made this possible in 1986 with ID3, which extended decision trees to support non-binary outcomes.
In 2008, in the list of the top ten data mining algorithms planned by the IEEE International Data Mining Conference, an extended refinement algorithm named C4.5 ranked at the top.
American statistician Leo Breiman and New Zealand statistician Adele Cutler turned this feature into an advantage and developed random forest (random forest) in 2001 - a collection of decision trees, each Decision trees process different, overlapping selections of examples and vote on the final result.
Random Forest and its cousin XGBoost are less prone to overfitting, which helps make them one of the most popular machine learning algorithms.
It's like having Aristotle, Porphyry, Blumenbach, Darwin, Jane Goodall, Dian Fossey and 1000 other zoologists in the room to make sure your classification is the best it can be.
If you are standing close to other people at a party, then it is likely that you have something in common, which is using k-means clustering to group data points idea.
Whether groups are formed through human agency or other forces, this algorithm will find them.
From explosion to dial tone: American physicist Stuart Lloyd, an alumnus of Bell Labs’ iconic Innovation Factory and the Manhattan Project that invented the atomic bomb, first proposed k-means in 1957 Clustering to distribute information in digital signals, but this work was not published until 1982.
Meanwhile, American statistician Edward Forgy described a similar method in 1965, leading to its alternative name of "Lloyd-Forgy algorithm".
Finding hubs: Consider splitting clusters into like-minded working groups. Given the location of the participants in the room and the number of groups to form, k-means clustering can divide participants into roughly equal-sized groups, with each group clustered around a central point or centroid.
During training, the algorithm initially specifies k centroids by randomly selecting k people. (K must be chosen manually, and finding an optimal value is sometimes very important.) It then grows k clusters by associating each person with the nearest centroid.
For each cluster, it calculates the average position of all people assigned to that group and assigns that average position as the new centroid. Each new center of mass may not be occupied by a person, but so what? People tend to gather around chocolate and fondue.
After calculating the new centroid, the algorithm reassigns individuals to their nearest centroid. It then calculates new centroids, adjusts the clusters, and so on until the centroids (and the groups around them) are no longer moving. Afterwards, assigning new members to the correct cluster is easy. Get them into position in the room and look for the nearest center of mass.
Be forewarned: Given the initial random centroid assignment, you may not end up in the same group as the lovable data-centric AI experts you wish to hang out with. The algorithm does a good job, but it's not guaranteed to find the best solution.
Different distances: Of course, the distance between clustered objects does not need to be large. Any metric between two vectors will do. For example, instead of grouping partygoers based on physical distance, k-means clustering can divide them based on their clothing, occupation, or other attributes. Online stores use it to segment customers based on their preferences or behavior, and astronomers can group stars of the same type together.
The power of data points: This idea yields some notable changes:
K-medoids uses actual data points as centroids, rather than those in a given cluster Average location. The center point is the point that minimizes the distance to all points in the cluster. This change is easier to interpret because the centroid is always the data point.
Fuzzy C-Means Clustering enables data points to participate in multiple clusters to varying degrees. It replaces hard cluster assignments with cluster degrees based on distance from the centroid.
n Dimensional Carnival: Still, the algorithm in its raw form is still widely useful—especially because, as an unsupervised algorithm, it does not require the collection of expensive labeled data. It is also being used faster and faster. For example, machine learning libraries including scikit-learn benefited from the addition of kd-trees in 2002, which can partition high-dimensional data very quickly.
The above is the detailed content of There are six important algorithms hidden in the machine learning toolbox. For more information, please follow other related articles on the PHP Chinese website!