Large website architecture evolution and knowledge system

There have been some articles introducing the evolution of large-scale website architecture before, such as LiveJournal and eBay, which are very worthy of reference, but I feel that they talk more about the results of each evolution rather than being very detailed. It talks about why such an evolution is necessary. In addition, it seems that many students have recently found it difficult to understand why a website requires such complex technology, so I came up with the idea of writing this article. In this article, I will explain a common It is a relatively typical architectural evolution process and the knowledge system that needs to be mastered in the process of the website developing into a large website. I hope it can give some preliminary concepts to students who want to work in the Internet industry:). Please correct me if there are any errors in the article. Give me some more suggestions so that this article can really serve as a starting point.

The first step in architecture evolution: physically separating the webserver and database

At first, because of some ideas, I built a Website, it is even possible that the host is rented at this time, but since this article only focuses on the evolution of the architecture, we assume that a host is already hosted at this time and has a certain bandwidth. At this time, due to the website It has certain features and attracted some people to visit. Gradually, you find that the pressure on the system is getting higher and higher, and the response speed is getting slower and slower. At this time, it is more obvious that the database and the application affect each other. If there is a problem with the application, the database will also be affected. It is easy for problems to occur, and when the database has problems, the application is also prone to problems, so it enters the first evolution stage: physically separating the application and the database into two machines. There is nothing new in technology at this time. requirements, but you find that it does work. The system has returned to its previous response speed and can support higher traffic without affecting each other between the database and the application.

Look at the diagram of the system after this step is completed:

This step involves these knowledge systems:

This One-step architecture evolution has basically no requirements for a technical knowledge system.

The second step of architectural evolution: increase page caching

The good times don’t last long, as more and more people visit, you find that the response The speed started to slow down again. I searched for the reason and found that there were too many operations to access the database, which led to fierce competition for data connections, so the response slowed down. However, you cannot open too many database connections, otherwise the pressure on the database machine will be very high, so I considered using Caching mechanism to reduce competition for database connection resources and pressure on database reading. At this time, you may first choose to use Squid and other similar mechanisms to cache relatively static pages in the system (for example, pages that will be updated in a day or two) ( Of course, you can also use the solution of making the page static), so that the program can be modified without modification, which can greatly reduce the pressure on the webserver and reduce the competition for database connection resources. OK, so I started to use Squid to do relatively static Page caching.

Look at the diagram of the system after this step is completed:

This step involves these knowledge systems:

Front end Page caching technology, such as Squid, if you want to use it well, you must have a deep understanding of Squid's implementation and cache invalidation algorithms.

The third step of architectural evolution: Add page fragment cache

After adding Squid for caching, the overall system The speed has indeed improved, and the pressure on the webserver has begun to decrease. However, as the number of visits increases, I find that the system has become a bit slow again. After tasting the benefits of dynamic caching such as Squid, I began to think about how to We can't cache the relatively static parts of today's dynamic pages, so we considered adopting a page fragment caching strategy like ESI. OK, so we started using ESI to cache the relatively static fragments of dynamic pages.

Look at the diagram of the system after this step is completed:

This step involves these knowledge systems:

page Fragment caching technology, such as ESI, etc., if you want to use it well, you also need to master the implementation of ESI;

The fourth step of architecture evolution: data caching

After using technologies such as ESI to improve the caching effect of the system again, the pressure on the system has indeed been further reduced. However, as the number of visits increases, the system still begins to slow down. After searching, you may find that the system There are some places in the system where data information is repeatedly obtained, such as obtaining user information. At this time, I began to consider whether this data information could also be cached, so I cached the data to local memory. After the changes were completed, it was completely in line with expectations. The system's response speed has been restored, and the pressure on the database has been reduced a lot again.

Look at the diagram of the system after this step is completed:

This step involves these knowledge systems:

Cache Technology, including Map data structure, caching algorithm, implementation mechanism of the selected framework itself, etc.

The fifth step of architecture evolution: Add webserver

The good times did not last long, and I found that with the increase in system visits, Increase, the pressure on the webserver machine will rise to a relatively high level during peak periods. At this time, we began to consider adding a webserver. This is also to solve the availability problem at the same time and avoid being unable to use a single webserver if it goes down. After doing so After these considerations, we decided to add a webserver. When adding a webserver, we will encounter some problems. Typical ones are:
1. How to distribute access to these two machines. The solution usually considered at this time is The load balancing solution that comes with Apache, or a software load balancing solution such as LVS;
2. How to keep the status information synchronized, such as user sessions, etc. The solutions that will be considered at this time include writing to the database, writing to storage, Mechanisms such as cookies or synchronized session information;
3. How to keep data cache information synchronized, such as previously cached user data, etc. The mechanisms usually considered at this time include cache synchronization or distributed cache;
4. How to make similar functions such as uploading files continue to work normally. The mechanism usually considered at this time is to use a shared file system or storage;
After solving these problems, the webserver was finally increased to two, and the system was finally restored. Returned to previous speed.

Look at the diagram of the system after this step is completed:

This step involves these knowledge systems:

Load Balancing technology (including but not limited to hardware load balancing, software load balancing, load algorithm, Linux forwarding protocol, implementation details of the selected technology, etc.), active and backup technologies (including but not limited to ARP spoofing, Linux heart-beat, etc.), Status information or cache synchronization technology (including but not limited to Cookie technology, UDP protocol, status information broadcast, implementation details of the selected cache synchronization technology, etc.), shared file technology (including but not limited to NFS, etc.), storage technology (including but not limited to Not limited to storage devices, etc.).

The sixth step of architecture evolution: Sub-library

Enjoyed the happiness of rapid growth in system visits for a period of time Later, I found that the system started to slow down again. What happened this time? After searching, I found that the resource competition for database connections in some operations of database writing and updating was very fierce, causing the system to slow down. What should I do now? Well, the options available at this time include database clustering and sub-library strategies. In terms of clustering, some databases do not support it very well, so sub-library will become a more common strategy. Sub-library means that the original program must be modified. After modification, after one-pass modification to implement sub-database, yes, the goal has been achieved, and the system recovery is even faster than before.

Look at the diagram of the system after this step is completed:

This step involves these knowledge systems:

This The next step is to make reasonable business divisions to achieve sub-library. There are no other requirements for specific technical details;

But at the same time, with the increase in data volume and the progress of sub-library , we need to do better in database design, tuning and maintenance, so we still have high requirements for these technologies.

The seventh step of architecture evolution: table sharding, DAL and distributed cache
As the system continues to run, the amount of data begins to increase significantly. At this time, it is found that the query after sharding the database still has some problems. It was slow, so I started to work on table subdivision according to the idea of sub-library. Of course, this will inevitably require some modifications to the program. Maybe at this time, you will find that the application itself has to care about the rules of sub-database and sub-table, etc., which are still somewhat Complex, so I thought about whether to add a general framework to achieve data access in sub-databases and sub-tables. This corresponds to DAL in eBay's architecture. This evolution process takes a relatively long time. Of course, there are Maybe this general framework will wait until the sub-table is completed before starting. At the same time, at this stage, you may find problems with the previous cache synchronization solution. Because the amount of data is too large, it is now impossible to store the cache locally and then synchronize it. In this way, a distributed cache solution needs to be adopted. So, after another round of investigation and torture, a large amount of data cache was finally transferred to the distributed cache.

Look at the diagram of the system after this step is completed:

This step involves these knowledge systems:

points Tables are also divided into business aspects, and the technologies involved include dynamic hash algorithm, consistent hash algorithm, etc.;

DAL involves more complex technologies, such as database connection management (timeout, exception), database operation control (timeout, exception), encapsulation of sub-database and sub-table rules, etc.;

The eighth step of architecture evolution: Add More webservers

After completing the tasks of sub-database and table, the pressure on the database has dropped to a relatively low level, and I have begun to live a happy life again watching the number of visits increase every day. , suddenly one day, I found that the access to the system began to slow down again. At this time, I first checked the database and found that the pressure was normal. Then I checked the webserver and found that apache blocked a lot of requests, and the application server was relatively fast for each request. Yes, it seems that the number of requests is too high, which causes the need to wait in line and the response speed is slow. This is easy to handle. Generally speaking, you will have some money at this time, so you add some webserver servers. In the process of adding webserver servers, Several challenges may arise:
1. Apache’s soft load or LVS soft load cannot handle the huge amount of web traffic (number of requested connections, network traffic, etc.). At this time, if funds allow, it will The solution is to purchase hardware loads, such as F5, Netsclar, Athelon, etc. If funds do not allow, the solution will be to logically classify the applications and then distribute them to different soft load clusters;
2. Some of the original status information synchronization, file sharing and other solutions may have bottlenecks and need to be improved. Maybe at this time, a distributed file system that meets the business needs of the website will be written according to the situation;
After completion After these efforts, we began to enter an era of seemingly perfect infinite scalability. When website traffic increased, the solution was to continuously add webservers.

Look at the diagram of the system after this step is completed:

This step involves these knowledge systems:

Arrived At this step, as the number of machines continues to grow, the amount of data continues to grow, and the requirements for system availability become higher and higher, this time requires a deeper understanding of the technology used, and needs to be based on the needs of the website. To make more customized products.

The ninth step of architecture evolution: separation of data reading and writing and cheap storage solution

Suddenly one day, I found this The perfect era is coming to an end, and the nightmare of the database appears again. Because too many webservers have been added, the database connection resources are still not enough. At this time, the database and tables have been divided into databases and tables, and the database has begun to be analyzed. Under pressure, you may find that the read-write ratio of the database is very high. At this time, you usually think of a solution to separate data reading and writing. Of course, this solution is not easy to implement. In addition, you may find that some data is wasted in the database. In other words, it takes up too much database resources, so the architecture evolution that may be formed at this stage is to separate data reading and writing, and at the same time write some cheaper storage solutions, such as BigTable.

Look at the diagram of the system after this step is completed:

This step involves these knowledge systems:

The separation of data reading and writing requires an in-depth mastery and understanding of database replication, standby and other strategies, and also requires self-implementation technology;

The cheap storage solution requires an in-depth understanding of OS file storage Mastery and understanding also require a deep understanding of the implementation of the language used in the document.

The tenth step of architecture evolution: Entering the era of large-scale distributed applications and the dream era of cheap server farms

Go through the above This long and painful process has finally ushered in the perfect era again. Continuously adding webservers can support higher and higher visits. For large websites, there is no doubt that popularity is important. As the popularity increases, The demand for various functions also began to increase explosively. At this time, it was suddenly discovered that the web application originally deployed on the web server was already very large. When multiple teams began to make changes to it, it could be It is quite inconvenient, and the reusability is also quite bad. Basically, every team has done more or less repeated things, and deployment and maintenance are also quite troublesome, because the huge application package is copied on N machines. It takes a lot of time to start up, and it is not easy to check when there is a problem. Another worse situation is that a bug in a certain application may cause the entire site to be unavailable, and there are other Factors such as difficulty in tuning (because the applications deployed on the machine have to do everything, and targeted tuning cannot be performed at all). Based on this analysis, I began to make up my mind to split the system according to responsibilities. So a large-scale distributed application was born. Usually, this step takes a long time because it encounters many challenges:
1. After being split into a distributed application, a high-performance and stable communication framework needs to be provided. , and need to support a variety of different communication and remote calling methods;
2. Splitting a huge application takes a long time, requiring business organization and system dependency control;
3 , how to operate and maintain (dependency management, health management, error tracking, tuning, monitoring and alarming, etc.) this huge distributed application.
After this step, the system architecture has entered a relatively stable stage. At the same time, it can also start to use a large number of cheap machines to support the huge amount of visits and data. Combined with this architecture and the experience learned from so many evolutions To use various other methods to support increasing traffic.

Look at the diagram of the system after this step is completed:

This step involves these knowledge systems:

This This step involves a lot of knowledge systems, requiring an in-depth understanding and mastery of communication, remote calling, message mechanisms, etc. It requires a clear understanding of the theory, hardware level, operating system level, and implementation of the language used. understand.

Operation and maintenance also involves a lot of knowledge systems. In most cases, you need to master distributed parallel computing, reporting, monitoring technology, rules and strategies, etc.

It’s really not that difficult to say. The classic evolution process of the entire website architecture is similar to the above comparison. Of course, the plan taken at each step and the steps of evolution There may be differences. In addition, because the business of the website is different, there will be different professional and technical needs. This blog explains the evolution process more from the perspective of architecture. Of course, there are still many technologies that are not yet included. This mentions things like database clusters, data mining, search, etc., but in the real evolution process, we will also use things like improving hardware configuration, network environment, transforming operating systems, CDN mirrors, etc. to support greater traffic. Therefore, in the real evolution process There will be many differences in the development process. Another large website has to do far more than just the above, but also security, operation and maintenance, operations, services, storage, etc. It is really difficult to do a good job in a large website. It’s not easy. I wrote this article more in the hope that it can lead to more introductions to the evolution of large-scale website architecture, :).

ps: Finally, here are a few articles on the evolution of LiveJournal architecture:
Looking at large-scale website performance optimization methods from the background development of LiveJournal
http://blog.zhangjianfeng.com/article/743
In addition, you can find more information about the current LiveJournal website architecture here: http://www.danga.com/words/.

Large website architecture evolution and knowledge system_PHP tutorial