Why do Taobao and Tencent require a large number of top experts to develop websites that do not seem to be very complex? Take Taobao as an example, as a way to provide some popular science to newcomers.
#Let’s talk about the most important ones on the page you see first:
[Search for products]This function, If you have thousands of items, you can use select * from tableXX Where title like %XX% can be done by doing this. But - when you have 10000000000 (10 billion) products, no database can store them. How do you search? A distributed data storage solution is needed here. In addition, this search cannot directly fetch data from the database, and a search engine must be used (to put it simply, search engines are faster). Okay, now that I can find the product, am I done and can I buy one? It's early, whose products appear on the first page? A hugely complex sorting algorithm is required here. If we could make some personalized recommendations based on your purchasing behavior - this would be enough for a bunch of awesome algorithm engineers to work for a lifetime.
[Product details]After the search is completed, if you see what you are interested in, click to view the product page. This page contains product attributes, detailed descriptions, reviews, seller information, etc. The number of daily impressions of this page is more than 3 billion. In the same way, if you build a website with 10 people visiting it every day, you will not feel the pressure on the server at all. But with 3 billion, there are many problems to be solved. First of all, these requests cannot be directly pressed onto the database. Any single-machine or distributed database, withstanding the pressure of 3 billion per day, will collapse to the point where there is no sense of happiness at all. The technology to be used in this case is large-scale distributed Cache, all seller information, evaluation information, and product descriptions are obtained from the cache. Even more extreme information like "the number of views of the product" must be refreshed every time the page is opened. You guessed it can be obtained from the cache. Come and get it? Taobao has done it, and the entire product details are in the cache.
[Product pictures]A product has 5 pictures, and there are more pictures in the product description. You guessed it, Taobao has How many pictures do you need to store? More than 10 billion. If there are so many pictures on your hard drive, how do you find one of them? If your classmate wants to copy your pictures, how many hard drives do you need him to prepare? How much bandwidth do you need to configure? Can your network card handle it? How long does it take for you to copy it to him? At this scale, unfortunately there are no commercial solutions on the market. In the end, we have to develop a storage system ourselves. If you have heard of Google's GFS, we are similar to it, called TFS. By the way, Tencent also has such a system, also called TFS.
[Advertising System]There are many advertisements on Taobao, what, you don’t know? That shows that our advertising is pretty good, but many people don’t think it is advertising. How can sellers bid for advertising space on Taobao? How are ads displayed? How to check the advertising effect? This is another system with sophisticated algorithms.
##[BOSS System]How do Taobao staff manage such a huge system, such as a sudden announcement at a certain moment? All the works of a writer disappeared from Taobao. From the database to the search engine to the advertising system, all the relevant data disappeared within a few minutes, which required an excellent back-end support system.
[Operation and Maintenance System]How many servers do you think are needed to support such a huge website? Thousands of units? That's a fraction. With so many servers, what operating system is deployed on them, and can the kernel of the operating system be optimized? Can the Java virtual machine be optimized? Is there any room to squeeze performance out of the communication module? How to deploy the software? How to roll back if something goes wrong? Have you installed the operating system and optimized it? Have you been tricked by 360 or crashed? There are many doorways here.
I won’t write more. In addition to the above mentioned, there are many, many technologies that need to be done. Of course, it’s not that these things are so complex. It is unattainable. Any complex and huge thing is built from small to large. It requires great efforts from being awesome to being incompetent, and it also requires rookies full of curiosity. In the last sentence, you should think that I have ulterior motives.
#forecho :
I just read a very interesting article, which explains it very clearly -"You just bought it on Taobao Got something》
You realize that the Chinese New Year is coming soon, so you want to buy a sweater for your girlfriend, and you open http://www.taobao.com. At this time, your browser first queries the DNS server and converts http://www.taobao.com into an IP address. But first you will find that when you are in different regions or different networks (Telecom, China Unicom, China Mobile), the converted IP address is likely to be different. This first involves the first step of load balancing, through When DNS resolves domain names, it will allocate your access to different entrances, and try to ensure that the entrance you visit is the fastest one among all entrances. (This is different from the CDN mentioned later).
You successfully accessed the actual entrance IP address of http://www.taobao.com through this entrance. At this time you generate a PV, namely Page View, page visit. The total daily PV volume of each website is an important indicator to describe the size of a website. The PV of the entire Taobao network on weekdays (non-promotion periods) is between 1.6 and 2.5 billion. At the same time, as an independent user, all the pages you visit on Taobao this time are counted as a UV (Unique Visitor user access). The recently infamous http://12306.cn has a peak daily PV volume of around 1 billion, but its UV volume is far less than ten times that of Taobao. I believe everyone will know the reason for this.
Because the number of people visiting http://www.taobao.com at the same time is too huge, even the server that generates the Taobao home page cannot There can't be only one. There may be hundreds or even thousands of servers used only to generate the homepage of http://www.taobao.com, so the task of generating a page for you during your visit will be assigned to one of the servers. This process must be fair, equitable, and even (the number of users each of these hundreds or thousands of servers must be about the same). This very complex process is completed by several systems, the most critical of which is LVS ( Linux Virtual Server), one of the most popular load balancing systems in the world, was developed by Dr. Zhang Wensong, who currently works at Taobao.
After a series of complex logical operations and data processing, the HTML content for the Taobao homepage shown to you this time has been successfully generated. . Anyone who has a little knowledge about the web front-end should know that in the next step, the browser will load the css, js, images, scripts and resource files used in the page. However, relatively few students may know that there is a limit to the number of resources that your browser can load concurrently under the same domain name. For example, IE6-7 has two resources, IE8 has six resources, and each version of Chrome is different. Usually 4-6. I just took a look. When I visit Taobao's homepage, I need to load 126 resources. So such a small number of concurrent connections will naturally take a long time to load. Therefore, front-end developers often distribute the above resource files under multiple domain names, bypassing this restriction of the browser in disguise, and also preparing for the following CDN work.
According to unreliable news, at the peak of Double Eleven, Taobao’s access traffic peaked at 871GB/S. This number means that 1.78 million 4Mb bandwidth home broadband is needed to make it affordable, and it is fully capable of overwhelming the entire Internet bandwidth of a small and medium-sized city. So obviously, these access traffic cannot be concentrated together. And everyone knows that mutual access between different networks (Telecom, China Unicom, etc.) in different regions will be very slow, but you rarely find that access to Taobao is slow. This is CDN (Content Delivery Network), the role of content distribution network. Taobao has established dozens or hundreds of CDN nodes across the country, and uses some means to ensure that the places you visit (here mainly refers to js, css, pictures, etc.) are the CDN nodes closest to you, thus ensuring that large traffic is dispersed everywhere. Access the acceleration node.
There is a problem, that is, if a seller releases a new baby and uploads several new baby pictures, then How does Taobao ensure that these pictures are synchronized in CDN nodes across the country for users to use? This involves a large number of content distribution and synchronization related technologies. Taobao developed the distributed file system TFS (Taobao File System) to deal with such problems.
Okay, now you have finally loaded the Taobao homepage, so you habitually entered the word 'sweater' in the search box on the homepage. And hit Enter, then you generate another PV, and then Taobao's main search system starts to serve you. It first performs a word segmentation operation on the content you input based on a word segmentation database. As we all know, English is based on words, and words are separated by spaces, while Chinese is based on words, and all the words in a sentence can be connected to describe a meaning. For example, the English sentence I am a student, in Chinese it means: "I am a student". The computer can easily know that student is a word through spaces, but it cannot easily understand that the words "learn" and "生" combined represent one word. Splitting the Chinese character sequence into meaningful words is Chinese word segmentation, which some people also call word segmentation. I am a student, and the result of the participle is: I am a student.
After word segmentation, you also need to analyze your shopping intentions based on the search terms you entered. Users often have the following types of intentions when searching: (1) Browsing type: There is no clear shopping object and intention. Users are more casual and emotional when buying while looking. Query, for example: "Ranking of the top 10 perfumes in 2010", "Popular sweaters in 2010", "How many types of zippo are there?"; (2) Query type: There is a certain shopping intention, which is reflected in the requirements for attributes. Query example: "Mobile phone suitable for the elderly", "500 yuan "Watch"; (3) Comparative type: The shopping intention has been narrowed down to certain products. Query, for example: "Nokia E71 E63", "akg k450 px200"; (4) Confirmed type: The basic decision has been made, focusing on Examine a certain object. Query: "Nokia N97", "IBM T60". By analyzing your shopping intentions, the main search will show completely different results.
After a few steps, the main search system lists the search results based on the above and more complex conditions, all of which are completed by more than a thousand search servers, and then you start one by one. Click to browse the searched products. You will start to view the product details page. Those who often shop online will find that after you buy a product, even if the merchant has modified the product details page many times, you can still check the product details page. Baby's view of the snapshot at that time. This is to prevent merchants from denying what they promised in the product details. Obviously, it is not a simple matter to save and quickly recall the product details snapshots of tens of billions of transactions every year. this thing. It also involves the cooperation of several systems, the more important of which is Tair, a distributed KV storage solution developed by Taobao.
Then no matter whether you actually conduct a transaction or not, your access behaviors will be faithfully recorded by the system and used for subsequent business logic and data analysis. Among these records, access log records are one of the most important records. However, we learned earlier that these accesses are distributed on many different servers in various regions, and due to the large number of users, these log records are very large, reaching the TB level. Very normal. In order to quickly and timely transmit and synchronize these log data, Taobao developed TimeTunnel, which is used for real-time data transmission and handed over to the back-end system for calculation of reports and other operations.
Your browsing data, transaction data and many other data records will be retained.
The historical data stored on Taobao can easily reach ten or more PB (1PB=1024TB=1048576GB). Such a huge amount of data is stored in Taobao's data warehouse through extreme compression of 1:120 by the Taobao system. And through a very large-scale data system called Yunlai, which consists of more than 2,000 servers, it is continuously analyzed and mined.
From this data, Taobao can know who you are, what you like, how old your child is, and whether you are in a relationship. , what kind of drinks do people who like to play World of Warcraft like, etc., as well as a huge amount of information such as the retail situation of various industries, the rise and fall of various commodities, and so on.
Having said so much, in fact, I have only described a few of the thousands of systems running on Taobao. Even if you only visit the homepage of Taobao once, the technology and system scale involved are completely unimaginable. They are the brainchild of more than 2,000 top Taobao engineers, including Yangtze River Scholars and the National Science and Technology Supreme Award. Winners and many other great names. Similarly, the business systems of Baidu, Tencent, etc. are by no means simpler than Taobao.What you need to know is that the Internet products you use every day may seem simple and easy to use, but behind them are unimaginable wisdom and labor.
Recommended related articles: