In the past decade, AI has made breakthroughs one after another in the fields of computer vision, speech recognition, machine translation, robotics, medicine, computational biology, protein folding prediction, etc. , and behind these breakthroughs are inseparable from deep learning. So, when and where did deep learning originate, and when did it become the most prominent AI method?
Recently, UC Berkeley professor and deep learning expert Pieter Abbeel conducted an interview with Geoffrey Hinton on his podcast "Robot Brains" .
Hinton, who won the Turing Award in 2018, is known as one of the "Three Giants of Deep Learning" and one of the most important scholars in the history of artificial intelligence. His papers have been cited more than 500,000 times, which means that more than 500,000 research papers are based on his research.
He has been researching in the field of deep learning for about half a century, and has been relatively unknown for most of that time, but in 2012, things took a turn: that year, he won the ImageNet competition with AlexNet He proved that deep learning has greater advantages than other computer vision methods in image recognition. This is called the "ImageNet moment", which changed the entire field of AI and set off a wave of deep learning.
In this conversation, Hinton talks about his journey from academia to working at Google Brain, studying psychology and working as a carpenter, as well as the visualization technology behind the t-SNE algorithm. history, and offers his views on a number of issues, including:
AI Technology Comments on them without changing the original intention The interview was edited and organized:
Abbeel: What is neural network? Why should we care about it?
Hinton: Our brain works like this:
It has many called nerves Every once in a while, a neuron will ping, and the reason it pings is that it hears the pings of other neurons. Every time it hears a "ping" from another neuron, it adds a weight to some input store it gets, and when the weight reaches a certain input, it also pings.
So if you want to know how the brain works, you just need to know how neurons decide to adjust these weights. There are some procedures for adjusting weights, and if we can figure them out, then we will understand how the brain works. I think we'll crack these programs within the next five years.
I believe that all existing artificial intelligence is based on something completely different from what the brain does at a high level. It must tend to be the same. When you have many parameters, let's say you have billions of parameters, and the weights between these neurons adjust the parameters based on your large number of training examples, wonderful things will happen. The brain is like this, and so is deep learning. The problem is how you get the gradient of the parameters you want to adjust, so you have to think about the metrics and the parameters you want to adjust so that they improve what you want to achieve.
But my current belief is that backpropagation, the way deep learning currently works, is completely different from what the brain does, and the brain gets the gradients in a different way.
Abbeel: You wrote a paper on backpropagation for training neural networks, which became the impetus for everything we do today, and now you say Is it time to figure out if we should make some changes? Should it strive to be similar to the brain? Do you think backpropagation can be better than what the brain is doing?
Hinton: Rumelhart, Williams and I did write the paper on backpropagation (below), which is the most cited.
Paper address: http://www.cs.toronto.edu/~hinton/absps/naturebp.pdf
Backpropagation is well known. What we really did was show that it can learn interesting representations, not that we invented backpropagation, but that we redefined it. We propose that it can learn interesting representations, such as word embeddings, and therefore think that backpropagation may be much more efficient than what we have in the brain. Squeezing a lot of information into a handful of connections, and those few connections are only a few billion, so the problem with the brain is that connections are very cheap, there are trillions of connections, and experience is very expensive, so we tend to Yu puts a lot of parameters into a little experience.
And the neural network we are using is basically the opposite. They have a lot of experience and try to relate input and output information to parameters. I think backpropagation is more efficient than the method used by the brain, but is not very good at abstracting a lot of structure from not much data.
Abbeel: Do you have any hypotheses on how to possibly get better performance in this regard?
Hinton: For a long time I have believed that we need unsupervised objective functions. This mainly refers to perceptual learning. If you can learn the model by observing the world, then you can take action based on this model instead of the original data, which has a higher accuracy rate.
I believe that the brain uses many local small objective functions. It is not an end-to-end system chain that optimizes the objective function through training.
For example, if you look at a small patch of an image and try to extract some representation, you can combine the representation you get from that small patch with other nearby patches. Compare the context obtained from the representation to predict what is in the image.
Once you are familiar with the domain, these predictions from context and locally extracted features will often agree. Even if it's inconsistent, you can learn a lot from it.
I think the brain can learn a lot from such local disagreements. It may seem to you that a large image and many small local patches of the image mean a lot of feedback, that is, the consistency of local extraction and context prediction in the image. We can derive richer feedback from these alignments with contextual predictions. It's hard to do that, but I think it's moving along that line now.
Abbeel: What are your thoughts on the work of SimCLR and how it differs from learning more generally? What do you think of the recent MAE (Masked Autoencoders)? How does it relate to what you just described?
Hinton: The relevant evidence I have obtained shows that this objective function is good.
I personally have not written this paper, but a long time ago, I wrote a paper with Sue Becker about getting consistent representations from two different patches of the image. I think that was the origin of the idea of self-supervised learning by achieving agreement between two patch representations of the same image.
Abbeel: Let’s talk about the method you mentioned using end-to-end learning backpropagation to support end-to-end learning. What you're saying is that learning in a way that's close to the brain's—learning from less data and extracting more data—will be key to making progress in understanding how the brain works. Today, many people are working on solving the problem of learning efficiently from unlabeled data because it requires less human effort, but they still use the same mechanism as backpropagation.
Hinton: What I don’t like about MAE is that you have some input patches, go through multiple layers of representations, and try to reconstruct what’s missing in the output of the network input patch.
I think the brain has these levels of representation, but each layer is trying to reconstruct what’s at the level below. It’s not that you go through so many layers and come back, but there are so many layers and each layer is trying to recreate something from the next layer. This looks more like a brain to me, but the question is: can you do this without using backpropagation?
Obviously, if you reconstruct the missing part of the output through multiple layers, you need to go through all the layers to get the information, and backpropagation is already built into all simulators, but the brain is not in this way.
Abbeel: Imagine that the brain has three choices when processing these local goals: First, what is the local goal we want to optimize? Second, what algorithm is used to optimize it? Third, what is the architecture by which we connect neurons together for learning? On all three issues, we don’t seem to be doing a very good job yet. What do you think?
Hinton: If you're interested in cognitive learning, that's very clear.
You want a visual topic map, a hierarchy of written topic maps that is architecturally locally connected. For this, you can solve many credit assignment problems by assuming that something at a certain location on the anti-electronic map is determined by its corresponding optic nerve map. Instead of digging into the system, use local interactions and figure out what the pixels are doing in it.
Currently, what neural networks do is assume that the same function is used on every locality. This is true for convolutional neural networks and the same is true for transformers. The brain can't necessarily do this, since it would involve sharing weights and doing exactly the same calculations everywhere. There is a way to achieve the goal of weight sharing, and that is convolution, which I think can play a more effective role in the brain.
If you want to agree with local extraction via contextual prediction, then imagine having a bunch of columns doing local predictions and looking at nearby columns to get their contextual predictions. You can think of context as the teacher of local predictions and vice versa. Think of the information in context as being distilled into local extractors. What comes out of this is mutual extraction, they are both providing teaching signals to the other, which means that knowledge about what you should extract in one place is being transferred to other places.
When they try to reach agreement, or want things in different locations to agree, for example, if they want the nose and mouth to agree that they are part of the same face, then they should all produce the same representation. , when you try to get the same representation in different places, you need to allow knowledge to be refined from one place to another, which has greater advantages than actual weight sharing.
Obviously, from a biological perspective, one advantage is that the detailed architecture does not need to be the same in different locations, and another advantage is that the front-end processing does not need to be the same.
Take the retina as an example. Different parts of the retina have different sizes of perceptual fields. The convolutional network ignores their multiple different resolutions and performs convolutions on each resolution. , but they cannot perform different front-end processing. And if you extract from one position to another, what you need to do is to obtain the representation of the same function at different positions from the optical array. At this time, you can use different preprocessing on the optical array at different positions. Even if the front-end processing is different, Knowledge that represents the entire function can still be extracted.
So, while extraction is less efficient than actual display weights, it is more flexible and neurologically more believable. This is also an important point I made a year ago, which is that there have to be tricks like weight sharing to improve efficiency, but if you're trying to get adjacent things to agree, then local extraction works.
Abbeel: Since the brain works differently, should we continue to think about weight sharing in another way, or should we not continue with weight sharing?
Hinton: I think we should continue to do convolutions in convolutional networks, share weights in transformers, and share knowledge by sharing weights . Remember, the brain shares knowledge not by sharing weights, but by sharing functions from input to output, using extraction to transfer knowledge.
Abbeel: Now there is another topic being discussed It is widely discussed that the brain is very different from current neural networks. Neurons work with spike signals, which is very different from the artificial neurons in our GPUs. I'm curious about your thoughts on this issue, is this just an engineering difference or maybe we need more knowledge to understand better?
Hinton: It's not just an engineering difference. Once we understand why the hardware is so great, we can understand that it is sensitive to the retina, the brain's geotagged unit. For example, the retina does not use spiking neurons and has a large number of non-spiking neurons for processing. Once we understand why the cerebral cortex works, we see that this is the biologically correct thing to do. I think it depends on how the learning algorithm gets the gradients of the spiking neuron network, but no one really knows at the moment.
Regarding the problem of spiking neurons, there are often two completely different decisions:When will it spike, and whether will it spike. This is discrete decision-making. Various alternative functions have been devised to try to optimize the system.
In 2000, Andy Brown and I had a paper on trying to learn spiked Boltzmann machines. It would be great if there was a learning algorithm suitable for spiked codes. I think this It is the main reason that hinders the development of spiking neuron hardware.Many people have realized that more energy-efficient hardware can be made this way, and huge systems have also been built, but what is really missing is a great learning outcome. So I think,
Until we get a good learning algorithm, we won't be able to really do things with spiking neurons.
So when you take a standard artificial neuron, just ask: can it tell whether two values of input are the same? cannot. But when you use spiking neurons, it's easy to set up a system where two spikes arrive at the same time and they make a sound when they fire, but if they arrive at different times they don't. Therefore, using peak time seems to be a good way to measure consistency.Just like biological systems, the reason why you can see the direction and sound comes from the delay of the signal reaching the two ears. If you take one foot as an example, the light is about one nanosecond. , and the first sound is about one millisecond. But if I move the thing a few inches to your side, the difference in time delay to the two ears, the path length to the two ears is only a fraction of an inch, that is, the difference in time delay of the signal to the two ears is only a millisecond. So we are sensitive to 30 milliseconds in order to get stereo from the sound. We do this by having two axons and spikes going in different directions, one coming from one ear and one coming from the other ear, and when the spikes arrive at the same time, there are cells that send out a signal.
Because rush hour can be used to do very sensitive things, it would be surprising when its precise timing is not used. For a long time, I have thought that if you can use spike time to detect consistency in aspects such as self-supervised learning, or in other words, I extract the information of your mouth and nose and predict your entire face from your mouth and nose, when If your mouth and nose correctly form a face, these predictions will be consistent. It would be nice to use spike times to see if these predictions are consistent, but it's hard to do because we don't know and don't have a good algorithm for training the network, like neurons.
Abbeel: Did you just say that the retina doesn't use all the spiking neurons? The brain has two types of neurons, some are more like our artificial neurons and some are spiking neurons? Hinton: I'm not sure if the retina is more like artificial neurons, but certainly the neocortex has spiking neurons, which are its primary mode of communication, sending spikes from One parameter to another parameter cell. I have a good argument: The brain has a lot of parameters, and there is not much data compared to the typical neural networks we use. In this state it is possible Overfitting unless strong regularization is used. A good regularization technique is that every time you use a neural network, you ignore a whole bunch of units and therefore probably also ignore the fact that the neuron is sending spikes. What they really communicate is the underlying Poisson rate. We assume it is transitive. This process comes at a cost, it sends pulses randomly, and the rate changes during the process, determined by the information input to the neuron. You may want to send the true value of the rate from one urine to another when you If you want to do a lot of regularization, you can add some noise to the rate of the true value. One way to increase the noise is to use pulses that will add a lot of noise. In most cases, this is the motivation to quit. When you look at any time window where most neurons are not involved in anything, you can think of spikes as a representative underlying individual rate. This sounds really bad because it's noisy. But once you understand regularization, it's a very good idea. So I'm still a fan of the idea, but in reality we're not using spike timing at all. It just uses a very noisy individual rate representation to be a good regularizer, and I'm kind of switching between the two. Some years I thought neural networks were deterministic. We should have deterministic neural networks, which is a few years down the road. I think it's a 5-year cycle. The best randomness is also very important, it changes everything. Therefore, Boltzmann machines are stochastic in nature, which is important to them. But it's important not to commit yourself entirely to either scenario, but to be open to both scenarios. Now it's important to think more about the importance of spiking neurons you just said and figure out how to effectively train a network of spiking neurons. Abbeel: If we now say don't worry about the training part (considering that it seems more efficient), wouldn't people want to distribute pure inference chips, i.e. separately Do efficient pre-training and then compile it onto a spiking neuron chip to give it very low-power inference capabilities? Hinton:A lot of people have thought of this, it's very sensible, and it may be valid to use neural networks for reasoning down the road of evolution , and everyone is doing it, and it's also proven to be more efficient, and different companies have produced these big spike systems. Once you're doing these things, you'll become increasingly interested in reasoning as a way to learn to make more use of available energy during peak periods. So you can imagine having a system where you're learning using an auxiliary device, not emulating the hardware, like not on this low-energy hardware, but you can transfer it to the low-energy hardware just fine. Abbeel: What is AlexNet? How did it come about? What was your path from studying restricted Boltzmann machines to trying to understand how the brain works? Hinton: I would say that you may suddenly have proven that the more traditional neural network approach does work. Around 2005 I was fascinated by the idea that I could use a bunch of voltage-limited machines to pre-train feature detectors, which would make it easier to get Backdrop to work, and the result was There is enough data. Later, because Faith Ali and her image recognition team had enough data, pre-training was no longer needed, although pre-training was about to return. GPT-3 has pre-training, pre-training is also a good idea, but once we find out that you can pre-train, this can make the background work better and help a lot with speech. , just like George John and Abdul Rahman Muhammad did in 2009. After that, Alex, a graduate student in my group, started applying the same idea to vision, and we quickly discovered that when you have ImageNet data, you don't need pre-training. I remember Ilya came to the lab one day and said, "Look, now that we have speech recognition, this thing is really working, we have to make ImageNet before anyone else." Janella also walked in The lab expressed this idea, and his students and postdocs said, "Wow, but I'm busy with other things." In fact, he had no way to get anyone to work on this. Later, he convinced Alex to complete the experiment by preprocessing the data, and the data was preprocessed to what he needed. This is just background. I would say that a lot of researchers know this, but maybe not everyone knows that Alex's results were half the error rate compared to Ilya's previous work on the ImageNet image recognition competition. That's why everyone is moving from hand-designed methods to computer vision and trying direct programming. Abbeel: What moment marked a major change in your career? What impact did moving from academia to Google have on you? Why did this shift occur? Hinton: I have a disabled son who needs to be raised, so I need a lot of money. One way is to teach. In 2012, I taught a course on neural networks. But their software isn't very good, so it's very fiddly to do. One chapter a week, I have to provide them with an instructional video, and there will be a ton of people wanting to watch it. Sometimes Yoshua Bengio would ask the next day after class: "Why did you say that (in class)?" My original agreement with the University of Toronto was that if I Any money that is made, I hope the university will share it with the professors, they haven't specified exactly how it will be distributed, but people think it's probably at 50% or something like that, and I have no problem with that. But after I started teaching, the provost made a unilateral decision without consulting me or anyone else: If the funding came from the course platform, the university would take all the money and the professors would get nothing. I asked the school to help me prepare a video, and the people who made the video would come and ask me, do you know how expensive it is to make a video? This made me really angry at school, so I started thinking about other career options besides professor. At that time, we suddenly aroused the interest of various companies, and they wanted to recruit us, either by granting large grants or funding startups. Normally I would say no, I didn't want to try to make extra money from research, but that experience with the school scamming me out of my money made me want to find some other ways to make money. Abbeel: What was the auction like at that time? Hinton: It was at the NIPS conference, and Terry organized some small activities in a casino. In a smoky room in the hotel's basement, with people gambling upstairs, it felt like a movie was being filmed. We have absolutely no idea how much we are worth. I consulted a lawyer and he said I could hire a professional negotiator or just go to auction. As far as I know, this is the first time a small group like this has conducted an auction. We run our auctions via Gmail and people have to email me their bids with a timestamp of the email. The price kept going up, first it was $500,000 and then $1 million, which was exciting and we found out we were worth a lot more than we thought. In retrospect, we probably could have gotten more, but we've seen an amount that we thought was astronomical. We both wanted to work for Google, so we stopped the auction and decided to join Google. Abbeel: As I understand it, you are still at Google today. Hinton: I am still working at Google, and it has been 9 years. The main reason I like Google is that the core team is really good. I get along really well with Jeff Dean, he's very smart and I'm very straightforward. He wanted me to do exactly what I wanted to do, which was basic research. He thought what I should do was try to come up with completely new algorithms, and that's what I wanted to do, which was a good fit. I'm not good at managing a large team to improve speech recognition by one percent. But I'd love to, preferably revolutionize the field again. Abbeel: You were a computer science professor at the University of Toronto, but you never got a degree in computer science, you got a degree in psychology, and you worked as a carpenter. How did you go from studying psychology, to becoming a carpenter, to entering the field of artificial intelligence? Hinton: My last year at Cambridge was very difficult. I dropped out of school after my exams and became a carpenter, which I loved more than anything else. After taking all the classes in high school, I could stay home at night and do carpentry work, so I became a carpenter and worked for about 6 months. But I can’t make a living as a carpenter. I worked as a carpenter and decorator, making money while renovating, and I really enjoyed doing carpentry. It wasn't until I met a real carpenter that I realized I was completely hopeless at woodworking. He can cut a piece of wood completely into squares with a handsaw. Compared with him, I was so desperate that I decided to return to artificial intelligence. Abbeel: As far as I know, your PhD was at the University of Edinburgh. Hinton: Yes, I went there for my PhD in neural networks and did research with the famous professor Christopher Longa Higgins, who was really Excellent, he almost won the Nobel Prize for studying the structure of borohydride when he was in his 30s. He was very interested in neural networks and their relationship to holograms, and about the day I arrived in Edinburgh, he lost interest in neural networks and completely changed his mind after reading Winograd's paper and thinking that neural networks were the wrong way to think. Even though he totally disagreed with what I was doing, he didn't stop me from doing it. Abbeel: In the early 1970s, when others were saying that the neural network proposed by Minsky and Papert was nonsense, why did you do this? Hinton: In fact, the first talk I gave to that group was about how to do true recursion with neural networks. This was a speech given in 1973, 49 years ago. One project I found was that when you want a neural network that can draw a shape, and it divides the shape into different parts, it is possible to have one part of the shape drawn by the same neural hardware, and the entire shape is drawn by the neural center To draw, the nerve center stores the entire shape, and must remember its position in the entire shape, as well as the direction and location of the entire shape. But I have changed my mind now, when you want to use the same neurons to draw part of a shape, you need to remember somewhere what the whole shape is, and where you are How far has it gone. Once you're done with this subroutine, you can jump back to that place. The problem with this part of the shape is, how does the neural network remember it, obviously you can't just copy the neurons, so I managed to get a system working and adapting all the time, letting the neural network remember it by fast heaviness and weighting. So I have a neural network doing true recursion, reusing the same neurons and weights to make the recursive calls as the advanced calls did in 1973. I think people don't understand my presentation because I'm not good at presentation, but they also ask why you are doing recursion in your competition. What they don't understand is that we'll never be able to explain a whole bunch of things unless we have neural networks do things like recursion, and now this becomes an interesting problem again, so I'm going to wait another year until this idea Become a true antique. I will write a research report when it will be 50 years old. Abbeel: When you are a doctoral student like everyone else or have just graduated from a doctoral degree, almost everyone tells you that the work you are doing is a waste of time, and you But you firmly believe that this is not the case. Where does your belief come from? Hinton: I think a big part of it is my schooling. My father sent me to an expensive private school with a good science education and I attended there from the age of 7. It was a Christian school and all the other kids believed in God, but I was taught at home that it was all bullshit, and in my opinion it was, so I was used to everyone else being wrong. . I think this is important. You need to have faith in science and be willing to study things that are obviously true, even if everyone else says it's nonsense, and in fact not everyone thinks so. In the early 1970s, almost everyone working on AI thought (what I was doing) was nonsense. But if you look a little further back, in the 1950s, both von Neumann and Turing believed in neural networks, and Turing especially believed in reinforcement training of neural networks. I still believe that if they had not died young, the entire history of artificial intelligence might have been very different, because they were intelligent people who were powerful enough to dominate the entire field, and they were also very interested in how the brain worked. Abbeel: Today’s deep learning is very effective. Is it all we need, or do we need something else? You once said (perhaps I'm not quoting you) that deep learning can do everything. Hinton: What I really mean by that is that by sending a lot of parameters with random gradients, the way deep learning gets the gradients may not be the reverse way. Propagation, and the gradients you get may not be the final performance measure, but rather these local objective functions. I think that's just how the brain works, and I think that explains everything. Existing computers have a computer program, or a neural network with many weights (that's a different type of program). But if your hardware breaks, you can run the same program on another piece of hardware. This makes knowledge immortal. It does not depend on specific hardware to survive. The cost of immortality is huge, because it means different bits of hardware have to do the exact same thing, which is obviously zero point correction after all the error correction is done. They have to do the exact same thing, which means preferably digital or basically digital, they do things like multiply numbers together, which takes lots and lots of energy to make the operation very discreet, which is not Hardware target. Once you want to make your program or neural network immortal, you're committing to a very expensive computational and manufacturing process.
If we are willing to give up immortality, the reward we will get is very low energy computing and very cheap manufacturing. So, what we should be doing is not building computers, but making them evolve. As an analogy, if you have a potted plant and you pull it out of the pot, you will get a ball of roots, which is exactly the shape of the pot, so all the different potted plants will have the same shape. The roots, and the details of the root system are all different, but they all do the same thing, they extract nutrients from the soil, their function is the same. in the near future we will see non-immortal computers that are cheap to make and they have to learn all their knowledge and they The energy required is very low. When these non-immortal computers die, their knowledge dies with them. Looking at the weights is useless because those weights only apply to the hardware. So what you have to do is extract this knowledge to other computers. Abbeel: The neural networks that dominate the headlines today are very large. In a sense, the scale of large language models is beginning to approach that of the brain, which is very impressive. What do you think of these models? What limitations do you see in it? On the other hand, for example, ants have obviously much smaller brains than humans, but to be fair, our artificially developed visuomotor systems are not yet at the level of ants or bees. So, what are your thoughts on the recent major advances in language models? Hinton: A bee brain may seem small to you, but I think bees have about a million neurons, and bees are closer GPT-3. But a bee's brain is actually a large neural network. My point is that if you have a system with large parameters, and those parameters are tuned using gradient descent in some reasonable objective function, then you will get a lot of Good performance, like GPT-3, and similar models from Google that I've mentioned a lot. That doesn't address the question of whether they can do the same as us, I think we're doing more things like recursion that we do in neural networks. I tried to elaborate on these issues in last year's paper on GLOM, about how to make partial hole hierarchies in neural networks. You must have a structure. If what you mean by symbolic calculation is that you have partial and overall structures, then what we do is symbolic calculation. This is generally not what people call hard symbolic computing, which means that you are using symbols, and the rules you operate on symbols only depend on the form of the string of symbols you are dealing with. What a symbol has The only properties are that it is the same or different from other symbols, and possibly that it can be used as a pointer to something. ##Paper address: https://arxiv.org/pdf/2102.12627.pdf Neural networks are very different from that, so I don't think we're doing the hard symbolic processing, but we're definitely doing hole hierarchies. But we're doing it in giant neural networks, and I'm not quite sure now to what extent GPT-3 actually understands what it's saying. I think this is pretty clear, it's not like the previous chatbot program Eliza that just rearranged strings of symbols without having any idea what it was saying. The reason for believing this is that, for example, if you say "Give me a picture of a hamster wearing a red hat" in English, it will draw a picture of a hamster wearing a red hat, and it has never made this matching before. , so it must understand the relationship between English characters and pictures before giving them pictures. If you ask these neural network skeptics, neural network deniers: "How do you prove that it understands", I think they will accept it. If you ask it to draw a picture and it draws that picture, then it understands. Abbeel: Google’s PaLM model recently demonstrated how it can effectively explain the mechanics of jokes. This seems like a very deep understanding of the language. Hinton: No, it's just rearranging the words in training. I don't agree with the idea that if it doesn't understand what the joke is about, how could it possibly generate those interpretations of the joke? I'm still open to it because the framework is backpropagation and it will Towards an understanding that is completely different from ours. It's clear that adversarial images tell you that objects can be identified by their texture, and you're probably right in the sense that it will generalize to other object instances. But this is a completely different approach than ours, and I like to use insects and flowers as examples. Insects can see ultraviolet light. Two flowers that look the same to us may look completely different to an insect. And now because the flowers look the same to us, can we say that the insect is mistaken? These flowers evolved alongside insects, and ultraviolet light signals to the insects which flower it is. Clearly the insects are right, we just don't see the difference, which is another way of thinking about adversarial examples. So the question is, in the case of insects, who is right? Just because two flowers look the same to us, doesn't mean they really are, in which case the insect may be right about the two flowers being very different. Abbeel: Regarding our current neural networks for image recognition, one might think that since we create them and want them to do something for us, then we really Don't want to just admit, "Well, they were right and we were wrong." I mean, they need to recognize cars and pedestrians. Hinton: Yes. I just want to tell you that it's not as simple as you think who is right or wrong. One of the focuses of my paper on GLOM is to try to build a more human-like perception system. So they'll be more likely to make the same mistakes we do, rather than the unusual ones. For example, if you have a self-driving car and it makes a mistake that any normal human driver would make, that seems to be more acceptable than making a really stupid mistake. Abbeel: As far as I know, Sleep is also something you're thinking about. Can you talk about that in more detail? Hinton: If you deprive people of sleep, they get really weird, like if you keep someone high for three days, They will start to hallucinate. If it lasts a week, they'll go crazy and never recover. The question is, why? What is the computational function of sleep? If depriving you of sleep is driving you completely insane, there's probably something very important to think about. The current theory is that sleep is used to consolidate memories, or to download them from the hippocampus to the cerebral cortex, which is a little weird because it's like you're going through court on campus. A long time ago, in the early 1980s, Terrence Sejnowski and I had a theory called Boltzmann machines, which was based in part on an insight by Francis Crick when he was thinking about Hopfield Nets, He had a paper on sleep, and his point was that you could give the network random things and make it not settle for random things. So in a Hopfield Net, you give it something you want to remember and it changes the weights. So the energy of this vector is lower. If you give it a random vector and make the energy higher, the effect will be better, which leads to the Boltzmann machine, which is how we achieve it. If you give it something other than random, you get something generated by the Markov chain of the model itself. Reducing this possibility and increasing the randomness of the data is actually a form of maximum likelihood learning. We think this is what sleep is all about. Sleep is a negative stage of learning. In contrastive learning, for two patches from the same image, you try to make them behave similarly; for two patches from different images, you try to make them behave very differently. Once they are different, what you have to do is not to make them more different, but to make them no longer too similar. This is the principle of contrastive learning. Now with Boltzmann machines you can't separate positive and negative examples. You have to intersect positive examples with negative examples or the whole thing goes wrong. I've tried not to have them cross over, to do a lot of positive examples and then a lot of negative examples, which is hard. In contrastive learning, you can separate the positive phase from the negative phase. So you can do a lot of examples of positive pairs, and then do a lot of examples of negative pairs. So, if you can separate the positive and negative phases at different times, do a bunch of positive updates, and then do a bunch of negative updates, it makes comparative learning more reasonable. Even standard contrastive learning, which you can do relatively well, you have to use a lot of momentum and stuff like that. So I think that the function of sleep is most likely to do forgetting or to do negative examples, which is why you don't remember your dreams. When you wake up, you will remember the contents in the fast weight, because the fast weight is a temporary storage. But when you wake up, you only remember the last minute of the dream you had when you woke up, which I think is a more believable sleep theory than any other. If the memory of this moment were removed, it would explain why, and the whole system would collapse. You'll make catastrophic mistakes, start hallucinating, and do all kinds of weird things. Let me say a little more about the need for negative examples. If you have a neural network, it's trying to optimize some internal objective function, either with respect to its representation or with respect to the consistency between contextual predictions and local predictions. It wants this consistency to be a property of real data, and the problem inside a neural network is that you can get all kinds of correlations in the input. Let's say I'm a neuron, and there are all kinds of correlations in my input, and these correlations have nothing to do with the real data, they are brought about by the wiring of the network and the way it is in the network. If both neurons are looking at the same pixel, they will be related. But this doesn't tell you anything about the data, so the question is, how to learn to extract the structure about the real data and not about the network connection. The way to do this is to feed it positive examples and find the structure in the positive examples rather than the negative examples, because the negative examples will go through the exact same lines. If the structure is not present in the negative example but is present in the positive example, then the structure is about the difference between the negative example and the positive example, not how you connect it. So, although people don't think much about this, if you have powerful learning algorithms, you're better off not letting them learn the neural network's own weights and connections. Abbeel: You also mentioned the concept of "the student model defeats the teacher model". What does it mean? Hinton: MNIST is a standard digital database. You can replace the training data with wrong labels and get a data set with the labels 20% correct rate and 80% error rate. The question is: can you learn from this data set? What is the effect of learning? The answer is, you can learn with 95% accuracy. Assume that the teacher model has an error rate of 80%, and the student model has an accuracy rate of 95%, so the student is much better than the teacher. You're bundling these training examples all at once, so you can't average the error across different training examples that happen to have similar images. If you ask, how many training cases do you need if you have corrupted cases, it's interesting because the image dataset is small. Would you rather have a million mislabeled images or 10,000 correctly labeled images? I have a hypothesis that what matters is the amount of mutual information between labels and accuracy. If the labels are correctly corrupted, then 90% of the time there is no mutual information between the label and the accuracy; if they are corrupted 80% of the time, there is only a small amount of mutual information. I noticed that you have done some sim-to-real work recently. You use neural networks to label real data with imperfections, and then students learn from these labels. And students learn better than teachers. People are confused, how can a student be better than a teacher? But in neural networks, this is easy to achieve. If there is enough training data, the students will be better than the teachers even if the teachers are unreliable. I wrote a paper on this a few years ago with Melody Y. Guan, which used some medical data. The first part of the paper talks about this, and our rule of thumb is that the mutual information between assigned labels and accuracy is what matters. Paper address: https://www.cs.toronto.edu/~hinton/absps/WhoSaidWhat.pdf Abbeel: Regarding our work you just mentioned and some recent very popular work, teachers usually provide labels with noise, but not all noise labels are used. There is an argument to be made that just look to the teachers who are more confident. Hinton: Here's a good approach, it might be helpful to only look at things that you have reason to believe. In MNIST, Melody drew a graph where as long as you have 20% of the labels correct, the student will be 95% correct. But when it drops to about 15%, you suddenly get a phase change, and all you can do at this point is try your luck, because the student is bound to get it no matter what, and when the teacher is saying these labels, the student is in some way To a certain extent, you need to know which examples are correct and which examples are wrong. The erroneous nature of a mislabeled label becomes obvious once students see the relationship between label and input. So it doesn't matter if it's randomly enabled by mistake, but there's a phase transition where you have to make it good enough for students to understand. This also explains why our students are smarter than us. Abbeel: In learning neural networks, especially when you build a model, you have to understand what it is and what it is learning. Usually people try to visualize what is happening during the learning process, you invented one of the most popular visualization techniquest-SNE(t-Distribution Stochastic Neighbor Embedding), what is the story behind it? Paper address: http://www.cs.toronto.edu/~hinton/absps/tsne.pdf Hinton: If you have some high-dimensional data and you try to draw a two-dimensional or three-dimensional map, you can take the first two main components, draw only the first two principal components. But principal components are concerned with getting large distances correct. So if two things are very different, the principal components will be very worried about them being very different in two dimensions. It doesn't care about small differences at all, because it operates on the square of large differences, so it doesn't preserve high-dimensional similarities very well. But what we are interested in is often just the opposite. For the obtained data, what we are interested in is what is very similar to what, and we don't care whether the large distance is a little wrong, as long as the small distance is correct. I had this idea a long time ago, what if we converted distances into pairwise probabilities? t-SNE has various versions. Suppose we turn them into probability pairs. For example, pairs with a small distance have a high probability, and pairs with a large distance have a low probability. We convert the distance into probability, a small distance corresponds to a large probability, put a Gaussian distribution around a data point, and then calculate the density of other data points under this Gaussian distribution, which is a non-standardized probability, and then normalize it One transformation. These points are then placed in two dimensions to preserve these probabilities. So it doesn't care if two points are far apart. Their pairwise probability is very low and it does not care about the relative position of the two points. This produces very beautiful maps, which is called stochastic neighborhood embedding. We thought of putting in a Gaussian distribution and randomly selecting a neighbor based on the density under the Gaussian distribution, which has a very simple derivative. The map we get tends to crowd things together, which is a basic problem when converting high-dimensional data into low-dimensional data. This is brought about by the nature of high-dimensional space and low-dimensional space. In a high-dimensional space, one data point can be close to many other points, while in a low-dimensional space, these points will not be too close to each other. If they are all close to this data point, they must be close to each other. So you will have problems in embedding from high dimension to low dimension. When I was working on t-SNE, I had the idea to make a hybrid model using probability as an "intermediate currency". We successfully implemented the idea, but we never did it very well. I was disappointed that people didn't take advantage of this idea, and then I made a simpler version that was a mix of a Gaussian distribution and a uniform distribution, which worked better. The idea is: on one map, all pairs are equally likely, which provides a background probability over a large distance, that is, a small background probability; on another map The probability of your contribution is proportional to your squared distance in this map, but that means in this map they can be very far apart if they want to because of the fact that they need some probabilities to be evened out Chemicalized. In fact, this version is a special case of using a Gaussian distribution mixture, which is a very wide Gaussian distribution that is uniform. What if we use an entire hierarchy of Gaussian functions? Many Gaussian functions have different widths, called t-distributions, which brings about t-SNE. t-SNE has a great property in that it can provide multi-scale information about things, and once the distance gets larger, you get many different levels of structure from it, both overall structure and fine structure. The objective function used in all these works is a Gaussian at relative density, derived from an earlier work by myself and Alberto Pacinero The work I was doing was difficult to publish at the time, and I received a comment about being rejected from a conference: "Hinton has been working on this idea for 7 years, but no one is interested." I think what these comments actually tell me is that I'm doing something very innovative, which actually involves a function called NCE that uses a contrastive method, and t-SNE is actually a version of this function. But t-SNE is for making maps, and then I tried to make a hybrid version, which initially didn't work until I finally got a coincidence, which was the t-distribution I wanted to use.
3 The birth background of AlexNet
4 From academia to Google
Non-immortal computer: low cost, acquiring knowledge through learning
6 To what extent do large-scale language models understand language
7 Boltzmann machine, knowledge distillation and t-SNE dimensionality reduction algorithm
The above is the detailed content of Latest interview with Geoffrey Hinton: Within five years we will have cracked the workings of the brain, but not through backpropagation. For more information, please follow other related articles on the PHP Chinese website!