10-minute quick solution | Large-scale distributed e-commerce system architecture-Common Problem-php.cn

#This article is a technical summary of learning large-scale distributed website architecture. Provides a brief description of the architecture of a high-performance, high-availability, scalable and extensible distributed website, and gives an architectural reference. Part of the article is reading notes, and part is a summary of personal experience, which has good reference value for large-scale distributed website architecture.

Large-scale distributed website architecture technology

Many users, widely distributed
Large traffic, high concurrency
Massive data, High service availability
Poor security environment and vulnerable to network attacks
Multiple functions, faster changes, frequent releases
From small to large, progressive development
User-centered
Free service, paid experience

2. Large website architecture goals

High performance: Provide a fast access experience.
High availability: The website service can always be accessed normally.
Scalable: increase/decrease processing power through hardware increase/decrease.
Security: Provides strategies for secure website access, data encryption, and secure storage.
Extensibility: Conveniently add/remove new functions/modules by adding/removing them.
Agility: on-demand, quick response;

10-minute quick solution | Large-scale distributed e-commerce system architecture

3 , Large website architecture model

Layering: generally can be divided into application layer, service layer, data layer and management layer and analysis layer;
Separation: Generally divided according to business/module/functional characteristics, for example, the application layer is divided into homepage and user center.
Distributed: Deploy applications separately (such as multiple physical machines) and work together through remote calls.

Cluster: An application/module/function is deployed in multiple copies (such as multiple physical machines) to jointly provide external access through load balancing.

Caching: Place data closest to the application or user to speed up access.

Asynchronous: Asynchronousize synchronous operations. The client sends a request without waiting for the server to respond. After the server completes processing, it uses notification or polling to inform the requester. Generally refers to: request-response-notification mode.

Redundancy: Add copies to improve availability, security and performance.

Security: Have effective solutions to known problems and establish discovery and defense mechanisms for unknown/potential problems.

Automation: Use machines to complete repetitive tasks that do not require human intervention through tools.

Agility: Actively accept changes in requirements and respond quickly to business development needs.

4. High-performance architecture

is user-centered and provides a fast web access experience. The main parameters are short response time, large concurrent processing capability, high throughput and stable performance parameters.

It can be divided into front-end optimization, application layer optimization, code layer optimization and storage layer optimization.

Front-end optimization: the part before the website’s business logic;
Browser optimization: reduce the number of HTTP requests, use browser cache, enable compression , CSS JS location, JS asynchronous, reduce cookie transmission; CDN acceleration, reverse proxy;
Application layer optimization: server that handles website business. Use cache, asynchronous, cluster
Code optimization: reasonable architecture, multi-threading, resource reuse (object pool, thread pool, etc.), good data structure, JVM tuning, single For example, Cache, etc.;
Storage optimization: cache, solid state drive, fiber optic transmission, optimized read and write, disk redundancy, distributed storage (HDFS), NoSQL, etc.

5. High-availability architecture

Large websites should be accessible at all times and provide normal external services. . Because of the complexity, distribution, cheap servers, open source databases, operating systems and other characteristics of large websites, it is difficult to ensure high availability, which means website failures are inevitable.

How to improve availability is a problem that needs to be solved urgently. First of all, we need to consider it from the architectural level, and consider availability when planning. In the industry, several nines are generally used to represent availability indicators, such as four nines (99.99), and the allowed unavailability time in a year is 53 minutes.

Different strategies are used at different levels. Redundant backup and failover are generally used to solve high availability problems.

Application layer: Generally designed to be stateless, for each request, it has no impact which server is used to process it. Generally, load balancing technology (which needs to solve the Session synchronization problem) is used to achieve high availability.
Service layer: load balancing, hierarchical management, fast failure (timeout setting), asynchronous call, service degradation, idempotent design, etc.
Data layer: redundant backup (cold, hot standby [synchronous, asynchronous], warm standby), failover (confirmation, transfer, recovery). The famous theoretical basis for data high availability is the CAP theory (persistence, availability, data consistency [strong consistency, user consistency, eventual consistency])

6 , Scalable architecture

Scalability refers to increasing/reducing the system's processing capabilities by adding/reducing hardware (servers) without changing the original architecture design.

Application layer: Split the application vertically or horizontally. Then load balance against a single function (DNS, HTTP [reverse proxy], IP, link layer).
Service layer: similar to the application layer;
Data layer: sub-database, sub-table, NoSQL, etc.; commonly used algorithm Hash, consistency Hash.

7. Scalable architecture

can easily add/remove functional modules and provide code /Good extensibility at module level.

Modularization and componentization: high cohesion, low coupling, improved reusability and scalability.
Stable interface: Define a stable interface. When the interface remains unchanged, the internal structure can change "at will".
Design pattern: Apply object-oriented ideas and principles, use design patterns to design at the code level.
Message queue: A modular system that interacts through message queues to decouple dependencies between modules.
Distributed services: Public modules are service-oriented to provide use by other systems to improve reusability and scalability.

8. Security architecture

Have effective solutions to known problems and establish unknown/potential problems Discovery and defense mechanisms. For security issues, we must first improve security awareness and establish an effective security mechanism to ensure it from the policy level and organizational level. For example, server passwords cannot be leaked, passwords are updated monthly, and cannot be repeated within three times; weekly security scans, etc. Strengthen the construction of the safety system in an institutionalized manner. At the same time, attention needs to be paid to all aspects related to safety. Security issues cannot be ignored, including infrastructure security, application system security, data confidentiality and security, etc.

Infrastructure security: hardware procurement, operating system, and network environment security. Generally, use formal channels to purchase high-quality products, choose a safe operating system, patch vulnerabilities in a timely manner, and install anti-virus software and firewalls. Protect against viruses and backdoors. Set firewall policies, establish DDOS defense systems, use attack detection systems, and perform subnet isolation.
Application system security: During program development, use correct methods to solve known common problems at the code level. Prevent cross-site scripting attacks (XSS), injection attacks, cross-site request forgery (CSRF), error messages, HTML comments, file uploads, path traversal, etc. You can also use a web application firewall (such as ModSecurity) to perform security vulnerability scanning and other measures to strengthen application-level security.
Data confidentiality and security: storage security (stored in reliable equipment, real-time, scheduled backup), storage security (important information is encrypted and saved, selecting appropriate personnel for complex storage and detection, etc.) , transmission security (preventing data theft and data tampering);

Commonly used encryption and decryption algorithms (single hash encryption [MD5, SHA], symmetric encryption [DES, 3DES, RC]) , asymmetric encryption [RSA], etc.

9. Agility

The architectural design and operation and maintenance management of the website must adapt to changes and provide high scalability and scalability. Conveniently cope with rapid business development, sudden increase in high-traffic access and other requirements.

In addition to the architectural elements introduced above, it is also necessary to introduce the ideas of agile management and agile development. Unify business, products, technology, and operation and maintenance, adapt to needs, and respond quickly.

10. Example of large-scale architecture

The above uses a seven-layer logical architecture, the first layer is the customer layer , the second layer of front-end optimization layer, the third layer of application layer, the fourth layer of service layer, the fifth layer of data storage layer, the sixth layer of big data storage layer, and the seventh layer of big data processing layer.

Customer layer: supports PC browser and mobile APP. The difference is that the mobile APP can be accessed directly through IP and reverse proxy server.
Front-end layer: using DNS load balancing, CDN local acceleration and reverse proxy services;
Application layer: website application cluster; according to business Perform vertical splitting, such as product applications, member centers, etc.;
Service layer: Provide public services, such as user services, order services, payment services, etc.;
Data layer: supports relational database cluster (supports read-write separation), NOSQL cluster, distributed file system cluster; and distributed Cache;
Big data storage layer : Supports log data collection in the application layer and service layer, structured and semi-structured data collection in relational databases and NOSQL databases;
Big data processing layer: offline data analysis through Mapreduce Or Storm real-time data analysis, and store the processed data into a relational database. (In actual use, offline data and real-time data will be classified and processed according to business requirements and stored in different databases for use by the application layer or service layer).

The evolution process of the system architecture of large-scale e-commerce websites

A mature large-scale website ( The system architecture of Taobao, Tmall, Tencent, etc.) is not designed with complete features such as high performance, high availability, and high scalability from the beginning. It gradually evolves and improves as the number of users increases and business functions expand. , in this process, the development model, technical architecture, and design ideas have also undergone great changes. Even the technical staff has developed from a few people to a department or even a product line.

So the mature system architecture is gradually improved with the expansion of the business, and is not achieved overnight; systems with different business characteristics will have their own focuses, such as Taobao, which needs to solve the search for massive product information. , place orders, and pay; for example, Tencent needs to handle real-time message transmission for hundreds of millions of users; Baidu needs to handle massive search requests.

They all have their own business characteristics and the system architecture is also different. Despite this, we can also find common technologies from these different website backgrounds. These technologies and methods are widely used in the architecture of large-scale website systems. Let’s understand these technologies and methods by introducing the evolution process of large-scale website systems. means.

The initial website architecture

In the initial architecture, applications, databases, and files are all deployed on one server, as shown in the figure:

2. Separation of applications, data and files

With the expansion of business, one server can no longer satisfy Performance requirements, so applications, databases, and files are deployed on separate servers, and different hardware is configured according to the server's purpose to achieve the best performance results.

3. Use caching to improve website performance

While optimizing performance through hardware, we also Software performs performance optimization. In most website systems, caching technology is used to improve system performance. The use of caching is mainly due to the existence of hot data. Most website visits follow the 28 principle (that is, 80% of access requests are eventually fulfilled). on 20% of the data), so we can cache hotspot data, reduce the access paths of these data, and improve user experience.

#The common ways to implement cache are local cache and distributed cache. Of course, there are also CDNs, reverse proxies, etc., which will be discussed later. Local cache, as the name suggests, caches data locally on the application server. It can be stored in memory or in files. OSCache is a commonly used local cache component. The characteristic of local cache is that it is fast, but because the local space is limited, the amount of cached data is also limited. The characteristic of distributed cache is that it can cache massive amounts of data and is very easy to expand. It is often used in portal websites and is not as fast as local cache. Commonly used distributed caches are Memcached and Redis.

4. Use clusters to improve application server performance

The application server, as the entrance to the website, will bear a large number of requests. We often use the application server to cluster to share the number of requests. A load balancing server is deployed in front of the application server to schedule user requests and distribute the requests to multiple application server nodes according to the distribution policy.

Commonly used load balancing technology hardware includes F5, which is relatively expensive, and software includes LVS, Nginx, and HAProxy. LVS is a four-layer load balancing, which selects internal servers based on the target address and port. Nginx and HAProxy are seven-layer load balancing, which can select internal servers based on message content. Therefore, the LVS distribution path is better than Nginx and HAProxy, and the performance is higher. Nginx and HAProxy are more configurable, and can be used for dynamic and static separation (choose a static resource server or an application server based on the characteristics of the request message).

5. Database read-write separation and database and table sharding

With the increase in the number of users, the database has become the biggest bottleneck. Commonly used methods are used to improve database performance. The method is to separate read and write and sub-database and table. As the name suggests, read-write separation is to divide the database into a read database and a write database, and achieve data synchronization through the main and backup functions. Database sharding and table sharding are divided into horizontal sharding and vertical sharding. Horizontal sharding is to split a very large table in a database, such as a user table. Vertical segmentation is based on different businesses. For example, tables related to user business and product business are placed in different databases.

6. Use CDN and reverse proxy to improve website performance

If our servers are deployed In the computer room in Chengdu, access is faster for users in Sichuan, but slower for users in Beijing. This is because Sichuan and Beijing belong to different developed regions of China Telecom and China Unicom respectively, and users in Beijing need to access the Internet through the Internet. The router takes a long path to access the server in Chengdu, and the return path is the same, so the data transmission time is relatively long. For this situation, CDN is often used to solve the problem. CDN caches the data content to the operator's computer room. When users access the data, they first obtain the data from the nearest operator, which greatly reduces the network access path. More professional CDN operators include Lanxun and Wangsu.

The reverse proxy is deployed in the computer room of the website. When the user request arrives, the reverse proxy server is first accessed. The reverse proxy server returns the cached data to the user. If there is no cached data, it will continue. Access the application server to obtain, which reduces the cost of obtaining data. Reverse proxies include Squid and Nginx.

7. Use distributed file system

The number of users is increasing day by day, and the business volume is increasing Large, more and more files are generated, and a single file server can no longer meet the demand. At this time, the support of a distributed file system is needed. Commonly used distributed file systems include GFS, HDFS, and TFS.

8. Use NoSQL and search engines

For query and analysis of massive data, we use NoSQL databases plus search engines can achieve better performance. Not all data needs to be placed in relational data. Commonly used NoSQL include MongoDB, HBase, and Redis, and search engines include Lucene, Solr, and Elasticsearch.

9. Split the application server business

As the business further expands, the application becomes very Bloated, then we need to split the application into services, such as Baidu into news, web pages, pictures and other services. Each business application is responsible for relatively independent business operations. Businesses communicate through messages or share databases.

10. Build distributed services

At this time we found that each business application will use some Basic business services, such as user services, order services, payment services, and security services, are the basic elements that support various business applications. We extract these services and use the distributed service framework to build distributed services. Ali’s Dubbo is a good choice.

A picture explains the e-commerce architecture

Large e-commerce website architecture case

1. Reasons for the e-commerce case

Distributed large-scale websites currently have several main categories:

Large portals, such as NetEase, Sina, etc.;
SNS websites, such as campus, Kaixin.com, etc.;
E-commerce websites, such as Alibaba, JD.com, Gome Online, Autohome, etc.

Large portals are generally news information, which can be optimized using CDN, static and other methods. Kaixin.com and other websites are more interactive and may introduce more NoSQL and distributed cache. Use high-performance communication frameworks, etc. E-commerce websites have the characteristics of the above two categories. For example, product details can use CDN and are static. Those with high interactivity need to use NoSQL and other technologies. Therefore, we use e-commerce websites as a case for analysis.

2. E-commerce website needs

Customer needs:

Establish a full category E-commerce website (B2C), users can purchase goods online, pay online, or pay on delivery;
Users can communicate with customer service online when purchasing;
After receiving the goods, users can rate and evaluate the goods;
Currently there is a mature purchase, sale and inventory system; it needs to be connected with the website;
Hope to be able to support business development in 3~5 years;
It is expected that the number of users will reach 10 million in 3~5 years;
Regularly holds Double 11, Double 12, March 8th Men's Day and other activities;
For other functions, please refer to websites such as JD.com or Gome Online.

Customers are customers. They will not tell you what they want specifically, they will only tell you what they want. We often need to guide and explore customer needs. Fortunately, a clear reference website is provided. Therefore, the next step is to conduct a lot of analysis, combine the industry, and reference websites to provide customers with solutions. Requirements Function MatrixThe traditional approach to requirements management is to use use case diagrams or module diagrams (requirements lists) to describe requirements. Doing so often overlooks a very important requirement (non-functional requirement), so it is recommended that you use the requirements function matrix to describe requirements. The demand matrix of this e-commerce website is as follows:

##3. Primary structure of the website

For general websites, the initial approach is to use three servers, one to deploy applications, one to deploy databases, and one to deploy NFS file systems.

This is a relatively traditional approach in the past few years. I have seen a website with more than 100,000 members, a vertical clothing design portal, and numerous pictures. A server is used to deploy applications, databases and image storage. There were a lot of performance issues. As shown below:

However, the current mainstream website architecture has undergone earth-shaking changes. Generally, a cluster approach is used for high-availability design. At least it looks like this:

##Using clusters Redundant application servers to achieve high availability; (Load balancing equipment can be deployed together with the application)
Use the database active and backup mode to achieve data backup and high availability;

4. System capacity estimation

Estimation steps:

Registered user Number - Daily average UV amount - Daily PV amount - Daily concurrent amount;
Peak estimate: 2~3 times the normal amount;
Calculate the system capacity based on the amount of concurrency (concurrency, number of transactions) and storage capacity.

According to customer needs: the number of users reaches 10 million registered users in 3 to 5 years, and the number of concurrency per second can be estimated:

The daily UV is 2 million (28 principles);
Click and browse 30 times a day;
PV volume: 200*30=60 million;
Concentrated visits: 60 million in 240.2=4.8 hours 0.8=48 million (two Eight principles);
Concurrency per minute: 4.8*60=288 minutes, 4800/288=167,000 visits per minute (approximately equal to);
Concurrency per second: 167,000/60=2780 (approximately equal to);
Assumption: The peak period is three times the normal value, then the number of concurrency per second can be Reached 8340 times.
1 millisecond = 1.3 visits;

Do you regret not studying mathematics well? ! (I don’t know if the above calculation is wrong, haha~~) Server estimate: (take tomcat server as an example) According to a web server, it supports 300 concurrent calculations per second. Normally 10 servers are needed (approximately equal to); [tomcat default configuration is 150], 30 servers are needed during peak periods; capacity estimate: 70/90 principle The system CPU is generally maintained at about 70%, and reaches 90% during peak periods. This does not waste resources and is relatively stable. Memory and IO are similar. The above estimates are for reference only, because server configuration, business logic complexity, etc. all have an impact. Here the CPU, hard disk, network, etc. are no longer evaluated. 5. Website architecture analysis Based on the above estimation, there are several problems:

A large number of servers need to be deployed. During peak periods, 30 Web servers may be deployed. Moreover, these thirty servers are only used during flash sales and events, which is a lot of waste.
All applications are deployed on the same server, and the coupling between applications is serious. Vertical and horizontal slicing are required.
There are redundant codes in a large number of applications
Server Session synchronization consumes a lot of memory and network bandwidth
Data requires frequent access to the database, and the database access pressure is huge.

Large websites generally need to do the following architecture optimization (optimization is something that needs to be considered when designing the architecture. It is usually solved from the architecture/code level. Tuning is mainly the adjustment of simple parameters, such as JVM tuning; if tuning involves a lot of code modification, it is not tuning, but refactoring):

Business split
Application cluster deployment (distributed deployment, cluster deployment and load balancing)
Multi-level cache
Single sign-on (distributed Session)
Database cluster (read-write separation, sub-database and sub-table)
Service-based
Message queue
Other technologies

6. Website structure optimization

6.1 Business split

Based on business attributes Vertical segmentation is divided into product subsystem, shopping subsystem, payment subsystem, review subsystem, customer service subsystem, and interface subsystem (interconnecting with external systems such as purchase, sale, inventory, SMS, etc.). Based on the level definition of business subsystems, they can be divided into core systems and non-core systems. Core system: product subsystem, shopping subsystem, payment subsystem; non-core: review subsystem, customer service subsystem, interface subsystem.

The role of business splitting: upgrading to subsystems can be handled by specialized teams and departments. Professional people do professional things and solve coupling and scalability issues between modules; each subsystem Deploy separately to avoid the problem of centralized deployment causing one application to hang and all applications to be unavailable.
Level definition function: Used to protect key applications during traffic bursts and achieve graceful degradation; protect key applications from being affected.

Architecture diagram after splitting:

Reference deployment plan 2

As shown above, each application is deployed separately, and the core system and non-core system are deployed in combination

6.2 Application cluster deployment (distributed, clustered, load balancing)

Distributed deployment: after splitting the business Applications are deployed individually, and applications communicate remotely through RPC directly;
Cluster deployment: For high availability requirements of e-commerce websites, each application must deploy at least two servers for cluster deployment;
Load balancing: It is necessary for high availability systems. General applications achieve high availability through load balancing, distributed services achieve high availability through built-in load balancing, and relational databases achieve high availability through active and backup methods. .

Architecture diagram after cluster deployment:

##6.3 More Level cache

Cache can generally be divided into two types: local cache and distributed cache according to the storage location. This case uses the second-level cache method to design the cache. The first-level cache is a local cache, and the second-level cache is a distributed cache. (There are also page caching, fragment caching, etc., which are more fine-grained divisions) The first-level cache, cache data dictionary, and commonly used hotspot data and other basically immutable/regularly changing information, the second-level cache caches all the cache needed. When the first-level cache expires or is unavailable, the data in the second-level cache is accessed. If there is no second-level cache, the database is accessed. The cache ratio is generally 1:4, and you can consider using cache. (Theoretically 1:2 is enough).

The following cache expiration strategies can be used according to business characteristics:

Cache automatically expires;
Cache trigger expiration;

The system is divided into multiple subsystems. After independent deployment, session management problems will inevitably be encountered. Generally, Session synchronization, Cookies, and distributed Session methods can be used. E-commerce websites are generally implemented using distributed Sessions. Furthermore, a complete single sign-on or account management system can be established based on the distributed session.

Flow Description

When the user logs in for the first time, the session information (user ID and user information), such as using the user ID as the key, is written into the distributed Session;
When the user logs in again, obtain the distributed Session to see if there is session information. If not, transfer to the login page;
is generally implemented using Cache middleware, and it is recommended to use Redis, so it It has a persistence function, which facilitates the loading of session information from the persistent storage after the distributed Session is down;
When saving the session, you can set the session retention time, such as 15 minutes , automatically times out after exceeding;

Combined with Cache middleware, the distributed Session implemented can simulate the Session session very well.

6.5 Database cluster (separation of reading and writing, sub-database and sub-table)

Large websites need to store massive amounts of data. In order to achieve massive data storage , High availability and high performance generally adopt a redundant approach to system design. Generally, there are two ways to separate reading and writing and sub-database and table. Read and write separation: Generally, to solve the scenario where the read ratio is much greater than the write ratio, one primary and one standby, one primary and multiple standbys, or multiple primary and multiple standbys can be used. This case is based on business splitting, combined with database sharding, table sharding and read-write separation. As shown below:

After business split: each subsystem requires a separate library;
If a single database is too large, it can be divided into databases again according to business characteristics, such as commodity classification database and product database;
After the database is divided, if If there is a large amount of data in the table, it will be divided into tables. Generally, it can be divided according to Id, time, etc.; (the advanced usage is consistent Hash)
In the sub-database, On the basis of separate tables, read and write are separated;

Related middleware can refer to Cobar (Alibaba, currently no longer maintained), TDDL (Alibaba), Atlas (Qihoo 360), MyCat. Problems with sequences, JOIN, and transactions after sharding databases and tables will be introduced in the theme sharing of sharding databases and tables.

6.6 Servitization

Extract functions/modules common to multiple subsystems and use them as public services. For example, the membership subsystem in this case can be extracted as a public service.

6.7 Message Queue

Message queue can solve subsystems/modules The coupling between them achieves an asynchronous, highly available, and high-performance system. It is the standard configuration of distributed systems. In this case, message queue is mainly used in shopping and delivery links.

After the user places an order, it is written to the message queue and then returned directly to the client;
Inventory subsystem: reads the message queue information, Complete inventory reduction;
Distribution subsystem: read message queue information and perform distribution;

Currently, the most commonly used MQs include Active MQ, Rabbit MQ, Zero MQ, MS MQ, etc. You need to choose according to the specific business scenario. It is recommended to study Rabbit MQ.

6.8 Other architectures (technologies)

In addition to the business split introduced above, application clusters, multi-level cache, single sign-on, database Clustering, servitization, and message queues. There are also CDN, reverse proxy, distributed file system, big data processing and other systems. I won’t introduce it in detail here. You can ask Du Niang/Google. If you have the opportunity, you can also share it with everyone.

Architecture Summary

## The architecture of large websites is constantly improved based on business needs, and specific designs and considerations are made based on different business characteristics. This article only describes some of the technologies and methods involved in a conventional large website, and I hope it can inspire everyone.

The above is the detailed content of 10-minute quick solution | Large-scale distributed e-commerce system architecture. For more information, please follow other related articles on the PHP Chinese website!