The large-scale website architecture here only includes high-interaction and high-interaction data-based large-scale websites. For reasons well-known to everyone, we will not talk about news and some architectures that can be achieved by relying on HTML staticization. We will use high-level Take websites with high load of data exchange and high data mobility as an example, such as Domestic, Kaixin.com and other similar web2.0 series architectures. We are not talking about PHP, JSP or .NET environment here. We look at the problem from the aspect of architecture. The implementation language is not a problem. The advantage of the language lies in the implementation rather than the quality. No matter which language you choose, the architecture is a must. face.
Here we discuss the issues that large websites need to pay attention to and consider
1. Processing of massive data
As we all know, for some relatively small sites, the amount of data is not very large. Select and update can solve the problems we face. The load itself is not very large, and it can be solved by adding a few indexes at most. For large websites, the amount of data may be in the millions every day. If there is a poorly designed many-to-many relationship, there will be no problems in the early stage. However, as the number of users increases, the amount of data will increase geometrically. At this time, the cost of selecting and updating a table (not to mention joint query of multiple tables) is very high.
2. Data concurrency processing
In some cases, the CTO of 2.0 has a Shangfang sword, which is caching. Caching is also a big problem when there is high concurrency and high processing. The cache is globally shared throughout the application. However, when we make modifications, if two or more requests request updates to the cache at the same time, the application will die directly. At this time, a good data concurrency processing strategy and caching strategy are needed.
In addition, there is the problem of deadlock in the database. Maybe we don’t feel it at ordinary times. The probability of deadlock in high concurrency situations is very high. Disk cache is a big problem.
3. File storage issues
For some 2.0 sites that support file upload, when we are fortunate that hard disk capacity is getting larger and larger, we should consider more about how files should be stored and effectively indexed. A common solution is to store files by date and type. But when the file volume is massive data, if a hard disk stores 500 G trivial files, then the Io of the disk will be a huge problem during maintenance and use. Even if your bandwidth is sufficient, but you The disk may not respond. If uploading is also involved at this time, the disk will easily become over.
Maybe using RAID and dedicated storage servers can solve the current problem, but there is another problem, the access problem from various places. Maybe our server is in Beijing, Yunnan or Xinjiang. How to solve the access speed? If it is distributed , then how to plan our file index and architecture.
So we have to admit that file storage is a very difficult problem
4. Processing of data relationships
We can easily plan a database that conforms to the third paradigm, which is full of many-to-many relationships, and can also use GUID to replace INDENTIFY COLUMN. However, in the 2.0 era where many-to-many relationships are abundant, the third paradigm is The first one should be discarded. Multi-table joint queries must be effectively reduced to a minimum.
5. Problems with data index
As we all know, indexing is the cheapest and easiest way to improve database query efficiency. However, in the case of high UPDATE, the cost of update and delete will be unimaginably high. The author encountered a situation where updating a focused index took 10 minutes to complete. So for the site, these basic It's unbearable.
Indexing and updating are natural enemies. Issues A, D, and E are issues that we have to consider when doing architecture, and they may also be the issues that take the most time.
6. Distributed processing
For 2.0 websites due to their high interactivity, the effect achieved by CDN is basically 0. The content is updated in real time and we handle it conventionally. In order to ensure the access speed in various places, we need to face a huge problem, which is how to effectively achieve data synchronization and update. Real-time communication of servers in various places is an issue that must be considered.
7. Analysis of pros and cons of Ajax
Success is AJAX, failure is AJAX, AJAX has become the mainstream trend, and suddenly I found that post and get based on XMLHTTP is so easy. The client gets or posts data to the server, and the server returns it after receiving the data request. This is a normal AJAX request. But during AJAX processing, if we use a packet capture tool, the data return and processing will be clear at a glance. For some computationally intensive AJAX requests, we can construct a packet sending machine, which can easily kill a webserver.
8. Analysis of data security
For the HTTP protocol, data packets are transmitted in clear text. Maybe we can say that we can use encryption, but for the G problem, the encryption process may be in clear text (such as the QQ we know, You can easily judge its encryption and effectively write an encryption and decryption method that is the same as his). When your site traffic is not very large, no one will care about you, but when your traffic increases, so-called plug-ins and so-called mass messages will follow one after another (you can see the clues from the mass messages at the beginning of QQ). Perhaps we can safely say that we can use higher-level judgment or even HTTPS to implement it. Note that when you do these processes, you will pay massive database, IO and CPU costs. For some mass sending, it is basically impossible. The author has been able to achieve group messaging for Baidu space and qq space. If you are willing to give it a try, it is actually not difficult.
9. Data synchronization and cluster processing issues
When one of our database servers is overwhelmed, we need to do database-based load and clustering at this time. This may be the most troubling problem at this time. Data transmission is based on network transmission. Depending on the design of the database, data delay is a terrible problem and an inevitable problem. In this case, we need to use other means to solve the problem. Ensure that effective interaction is achieved within this delay of a few seconds or longer. Such as data hashing, segmentation, content processing and other issues.
10. Data sharing channels and OPENAPI trends
Openapi has become an inevitable trend. From google, facebook, myspace to domestic schools, everyone is considering this issue. It can retain users more effectively, stimulate more interest in users, and attract more people. Help you do the most effective development. At this time, an effective data sharing platform and data open platform have become indispensable. Ensuring data security and performance in the context of open interfaces is another issue that we must seriously consider.