Data assets have become a key tool in product and service design, but centralized collection of user data puts personal privacy at risk and, in turn, exposes organizations to legal risks. Starting in 2016, people began to explore how to use the ownership and origin of data under the protection of user privacy, which made federated learning and federated analysis a hot spot of concern. As the scope of research continues to expand, federated learning has begun to be applied to broader fields such as the Internet of Things.
So, what is federated learning?
Federated learning is a machine learning setting in which multiple entities collaborate to solve machine learning problems under the coordination of a central server or service provider. Raw data is stored locally for each client and is not exchanged or transferred; instead, focused data updates are used for instant aggregation to achieve learning goals.
Similarly, generating analytical insights from the combined information of dispersed data sets is called federated analysis. The scenarios encountered in federated learning also apply to federated analysis.
This article provides a brief introduction to key concepts in federated learning and analytics, focusing on how to integrate privacy technologies with real-world systems, and how these technologies can be used to achieve social benefits through aggregated statistics in new areas. , and minimize risks to individuals and data custodians.
Privacy is essentially a multi-faceted concept with three key components: transparency and user permission; data minimization ization; and anonymization of data.
Transparency and user consent are the foundation of privacy protection: they are the way users understand and acknowledge the use of their data. Privacy-preserving technologies cannot replace transparency and user consent, but they make it easier to infer what types of data can be used or are excluded by design, making privacy statements easier to understand, verify, and enforce. The main goals of data usage are to generate federated learning models and calculate metrics or other aggregate statistics of user data (such as federated analysis).
Data minimization applied to aggregation includes collecting only the data required for a specific calculation, limiting access to that data at all stages, processing personal data as early as possible, and retaining data to a minimum. That is, data minimization means restricting access to all data to the smallest possible group of people, usually through security mechanisms such as encryption, access control, and secure multi-party computation and trusted execution environments.
Data anonymization means that the final output of the calculation does not reveal anything unique to the individual. When used for anonymous aggregation, the data provided by any individual user to the calculation has little impact on the final aggregate output. For example, when releasing aggregate statistics to the public, the aggregate statistics, including model parameters, should not differ significantly depending on whether a specific user's data is included in the aggregate.
That is, data minimization involves the execution of calculations and the processing of data, while data anonymization involves what is calculated and published.
Federated learning structurally embodies data minimization. It is important to note that data collection and aggregation are inseparable in a federated approach, client data is transformed and collected for immediate aggregation, and analysts do not have access to each client's messages. Federated learning and federated analysis are examples of general federated computing patterns that embody data minimization practices. The traditional approach is centralized processing, which replaces preprocessing and aggregation on the device with data collection. During the processing of log data, data minimization occurs on the server.
The goals of federated learning and federated analysis are consistent with the goals of anonymous aggregation. With machine learning, the goal is to train a model that accurately predicts all users without overfitting. Likewise, for statistical queries, the goal is to estimate statistics, which should not be greatly affected by any one user's data.
Federated learning combined with privacy-preserving techniques such as differential privacy can ensure that published aggregations have sufficient anonymity. In many cases, data anonymity may not apply and direct access by the service provider to an individual's sensitive data is unavoidable, but in these interactions the service provider should only use the data for the intended purpose.
The characteristics of federated learning are maintaining the decentralization of original data and learning through aggregation. Locally generated data is heterogeneous in distribution and quantity, which distinguishes federated learning from traditional data center-based distributed learning environments. The latter's data can be arbitrarily distributed and cleaned, and any node in the calculation can Any data can be accessed. In practice, the role of a control center is significant and often necessary, for example, for mobile devices that lack fixed IP addresses and require a central server to communicate.
Two federated scenarios have received special attention:
Cross-device federated learning, where the client is a large number of mobile devices or IoT devices .
For federated learning across organizations, the client is usually a smaller organization, institution, or other data island.
Table 1, adapted from Kairouz et al.,10 summarizes the key characteristics of FL settings and highlights some key differences between cross-device and cross-silo settings, as well as comparison with distributed learning in data centers.
Cross-device federated learning has been used on Android and iOS phones respectively for many applications such as keyboard prediction. Federated learning across organizations is used in problems such as health research. Another application on the rise is finance, with investments from WeBank, Credit Suisse, Intel and others.
The characteristics of typical federated learning scenarios are compared in the following table:
##Project | Data center distributed learning | Federated learning across organizations | Cross-device federated learning |
Train the model on a large flat data set, the client is a Nodes in a cluster or a data center | train models across data islands, and the clients are data centers in different organizations or different regions | The client is a massive mobile device or IoT device | |
The data is stored centrally and can be used across Client cleaning and balancing. Any client can access any part of the data set. | Data is generated and stored locally, maintaining decentralization. Each client cannot access the data of other clients, and the data is not independent or homogeneously distributed | The data is generated and stored locally, maintaining decentralization. Each client cannot access the data of other clients, and the data is not independent or homogeneously distributed | |
Orchestration |
Centralized Orchestration |
Centralized orchestration service but the original data is not visible |
Centralized orchestration services but the original data is not visible |
Distribution scale |
1~1000 |
2~100 |
hundreds of millions |
Client properties |
The client is authentic and trustworthy, always participates in calculations, and maintains state during calculations. |
The client is authentic and trustworthy, always participates in calculations, and maintains state during calculations. |
Not available for all clients, usually randomly sampled from available devices. A large number of clients participate only once in a calculation. |
Machine learning, especially deep learning, is generally data hungry and computationally intensive, so the feasibility of jointly training quality models is far from being achieved predetermined conclusion. The federated learning algorithm is based on the classic stochastic gradient descent algorithm, which is widely used to train machine learning models in traditional environments. The model is a function from training samples to predictions, parameterized by a vector of model weights, and a loss function that measures the error between the predictions and the true output. By sampling a batch of training samples (usually from tens to thousands), calculate the average gradient of the loss function relative to the model weights, and then adjust the model weights in the opposite direction of the gradient. By appropriately adjusting the step size of each iteration, satisfactory convergence can be obtained even for non-convex functions.
Extension to federated learning is to broadcast the current model weight to a group of random clients, let them each calculate the loss gradient on local data, average these gradients on the clients on the server, and then update Global model weights. However, more iterations are usually required to produce a highly accurate model. A rough calculation shows that in a federated learning environment, an iteration can take several minutes, which means that federated training can take anywhere from a month to a year, beyond the scope of practicality.
The key idea of federated learning is intuitive, reducing communication and startup costs by performing multiple steps of stochastic gradient descent locally on each device, which then reduces the average number of model updates. If the model averages after each local step, it may be too slow; if the model averages too few, it may diverge, and averaging may produce a worse model.
Model training can be simplified to the application of federated aggregation, i.e. the average of model gradients or updates.
Having a viable federated algorithm is a necessary starting point, however, if cross-device federated learning is to be an effective one for driving product teams method, you need something more. For cross-device federated learning, a typical workflow is usually as follows:
(1) Identify the problem
Usually this means requiring a medium-sized (1-50MB ) model on the device; the potential training data available on the device is richer or more representative than the data available in the data center; there are privacy or other reasons to prefer not to centralize the data; the feedback signals needed to train the model are readily available on the device get.
(2) Model development and evaluation
As with any machine learning task, choose the correct model structure and hyperparameters (learning rate, batch size, regularization) Critical to the success of machine learning. In federated learning, the challenge may be greater, which introduces many new hyperparameters, such as the number of clients participating in each round, how many local steps need to be performed, etc. A common starting point is simulation using federated learning based on available agent data in the data center, with coarse model selection and tuning. Final tuning and evaluation must be performed using federated training on real equipment. The evaluation must also be done in a federated manner: independently of the training process, candidate global models are sent to the devices so that accuracy metrics can be calculated on the local datasets of those devices and aggregated by the server, e.g. a simple average of each client's performance and histograms are both important. These demands create two key infrastructure requirements: (1) provide a high-performance federated learning simulation infrastructure that allows a smooth transition to running on real devices; (2) a cross-device infrastructure that makes it easy to manage multiple simultaneous Training and evaluation tasks performed.
(3) Deployment
Once a high-quality candidate model is selected in step 2, deployment of the model usually follows the same procedure as the data center trained model , including additional validation and testing (which may include manual quality assurance), live A/B testing against previous production models, and a phased rollout to the entire device fleet (which may include several more devices than actually participate in model training) Magnitude).
It is worth noting that all the work in step 2 has no impact on the user experience of the devices involved in training and evaluation; models trained using federated learning will not let users see predictions unless they complete the deployment step . Ensuring that this processing does not negatively impact the equipment is a key infrastructure challenge. For example, intensive calculations may only be performed when the device is idle and the network is idle.
These workflows represent a significant challenge for building scalable infrastructure and APIs.
Federated learning provides various privacy advantages out of the box. Following the principle of data minimization, raw data remains on the device and updates sent to the server are focused on a specific target and aggregated as quickly as possible. In particular, no non-aggregated data is stored on the server, end-to-end encryption protects data in transit, and both decryption keys and decrypted values are only temporarily stored in RAM. Machine learning engineers and analysts who interact with the system only have access to aggregated data. Aggregation is a fundamental role in federated approaches, making it natural to limit the impact of any single client on the output, but if the goal is to provide more formal guarantees, such as differential privacy, then the algorithm needs to be carefully designed.
Although basic federated learning methods have been proven to be feasible and have been widely adopted, they are still far from being used by default due to the inherent trade-offs between fairness, accuracy, development speed and computational cost. Tensions may hinder data minimization and anonymization approaches. Therefore, we need composable privacy-enhancing techniques. Ultimately, decisions about privacy technology deployment are made by the product or service team in consultation with privacy, policy, and legal experts in the specific area. Products can provide more privacy protections through available federated learning systems and, perhaps more importantly, help policy experts strengthen privacy definitions and requirements over time.
When considering the privacy features of federated systems, it is useful to consider access points and threat models. Do participants have access to a physical device or network? Via root or physical access to the server serving FL? Release models and metrics to machine learning engineers? Final deployed model? As information flows through this system, the number of potentially malicious parties varies greatly. Therefore, privacy statements must be evaluated as a complete end-to-end system. If appropriate security measures are not taken to protect the raw data on the device or the intermediate computational state in transit, then the guarantee of whether the ultimately deployed model stores user data may not matter.
Data minimization addresses potential threats to devices, networks and servers by increasing security and minimizing the retention of data and intermediate results. When models and metrics are published to model engineers or deployed into production environments, anonymous aggregation protects personal data from parties accessing these published outputs.
At several points in federated computing, participants expect each other to take appropriate actions, and only those actions. For example, servers expect clients to perform their preprocessing steps accurately; clients expect servers to keep their individual updates private until they are aggregated; both clients and servers expect that neither data analysts nor users of deployed machine learning models can Extraction of personal data; etc.
Privacy-preserving technology supports the structural enforcement of these components and prevents participants from deviating. In fact, the federated system itself can be viewed as a privacy-preserving technology that structurally prevents the server from accessing any client data that is not included in updates submitted by the client.
Take the aggregation stage as an example. An ideal system would imagine a fully trusted third party aggregating the client's updates, and only showing the final aggregation to the server. In practice, such a mutually trusted third party usually does not exist to play this role, but various techniques allow federated learning systems to simulate such a third party under various conditions.
For example, a server can run the aggregation process within a secure enclave, which is a specially constructed piece of hardware that not only proves to the client what code it is running, but also ensures that no one can observe or Tampering with code execution. However, currently, the availability of secure environments, whether in the cloud or on consumer devices, is limited and the available security environments may only implement a few specified attribute domains. Additionally, even when available and fully functional, secure environments may impose additional constraints, including very limited memory or speed; be vulnerable to data exposed through side channels (e.g., cache timing attacks); be difficult to verify correct nature; relies on the authentication services provided by the manufacturer (such as key confidentiality), etc.
Distributed encryption protocols for multi-party secure computation can be used collaboratively to simulate a trusted third party without the need for specialized hardware, as long as the participants are honest enough. While multi-party secure computation of arbitrary functions remains a computational hurdle in most cases, specialized aggregation algorithms for vector summation in federated environments have been developed that preserve privacy even against an adversary that observes the server and controls a majority of the clients, While maintaining robustness to client exit calculations:
Communication efficiency - O (log n l) communication per client, where n represents the number of users and l represents the vector length. In a wide range of applications, Small constants generate less than twice the aggregate traffic;
Computational efficiency - O (log2n llogn) calculations per client
Cryptographically secure aggregation protocols have been deployed extensively in commercial federated computing systems. In addition to private aggregations, privacy-preserving techniques can be used to protect other parts of the federated system. For example, a secure environment or cryptographic techniques (e.g., zero-knowledge proofs) can ensure that the server can trust that the client has truthfully performed the preprocessing. Even the model broadcast stage can benefit: for many learning tasks, a single client may only have data relevant to a small part of the model, in which case the client can privately retrieve that part of the model for training, again Use a secure environment or encryption techniques to ensure that the server does not learn any part of the model that has training data associated with the client.
While secure environments and privacy aggregation techniques can enhance data minimization, they are not designed specifically to generate anonymous aggregations. For example, limit user influence on the model being trained. In fact, the learned model can leak sensitive information in some cases.
The standard method of data anonymity is differential privacy. For the general process of aggregating records in a database, differential privacy requires limiting the contribution of any record to the aggregate, and then adding an appropriately proportional random perturbation. For example, in the differentially private stochastic gradient descent algorithm, the norm of the gradient is clipped, the clipped gradients are aggregated, and Gaussian noise is added at each training epoch.
Differential privacy algorithms are necessarily stochastic, so the distribution of the model produced by the algorithm on a specific data set can be considered. Intuitively, this distribution between models is similar when a differentially private algorithm is run on a single input dataset with different records. Formally, differential privacy is quantified by a privacy loss parameter (ε, δ), where smaller (ε, δ) corresponds to increased privacy. This goes beyond simply limiting the sensitivity of the model to each record, by adding noise proportional to the impact of any record, thereby ensuring enough randomness to mask any one record's contribution to the output.
In the scenario of cross-device federated learning, records are defined as all training instances of a single user/client. Differential privacy can be user-level or proportional. Even in centralized configurations, federated learning algorithms are well suited for training with user-level privacy guarantees because they compute a single model update from all data for a user, making it easier to bind each user's contribution to the model update. total impact.
Providing formal (ε, δ) guarantees in the context of cross-device federated learning systems can be particularly challenging because the set of all eligible users is dynamic and not known in advance, and participating users may be training Exiting at any point in the stage, building an end-to-end protocol suitable for production federated learning systems remains an important problem that needs to be solved.
In the scenario of cross-organizational federated learning, privacy units can have different meanings. For example, records can be defined as all examples in a data silo if participating institutions want to ensure access to model iterations or if the final model is unable to determine whether a particular institution's dataset was used for training the model. User-level differential privacy still makes sense in cross-organizational settings. However, if multiple institutions hold records from the same user, enforcing user-level privacy can be more challenging.
Differential privacy data analysis in the past has mainly been used for central or trusted aggregators, where raw data is collected by trusted service providers that implement differential privacy algorithms. Local differential privacy avoids the need for a fully trusted aggregator, but results in a drastic drop in accuracy.
In order to restore the utility of centralized differential privacy without having to rely on a fully trusted central server, some emerging methods can be used, often called distributed differential privacy. The goal is to make the output differently private before the server sees it (in clear text). Under distributed differential privacy, the client first computes application-specific minimal data, slightly perturbs these data with random noise, and executes the privacy aggregation protocol. The server then only has access to the output of the privacy aggregation protocol. The noise added by a single customer is usually not enough to provide meaningful guarantees for local differentiation. However, after privacy aggregation, the output of the privacy aggregation protocol provides stronger DP guarantees based on the sum of noise across all clients. Based on the security assumptions required by the privacy aggregation protocol, this even applies to people with server access.
For an algorithm to provide formal user-level privacy guarantees, not only must the sensitivity of the model be tied to each user's data, but also noise proportional to that sensitivity must be added. While sufficient random noise needs to be added to ensure that the differential privacy definition has a small enough ε to provide strong guarantees, limiting sensitivity even with small noise can significantly reduce deciphering. Because differential privacy assumes a "worst-case adversary" with unlimited computation and access to information on either side. These assumptions are often unrealistic in practice. Therefore, there are substantial advantages to training with differentially private algorithms that limit the influence of each user. However, designing practical federated learning and federated analysis algorithms to achieve small ε guarantees is an important area of research.
Model auditing techniques can be used to further quantify the benefits of training with differential privacy. They include quantifying the extent to which a model overlearns or rare training examples, and quantifying the extent to which it can be inferred whether a user used the technique during training. These auditing techniques are useful even when using large ε, and they can quantify the gap between differentially private worst-case adversaries and realistic adversaries with limited computational power and side information. They can also serve as complementary techniques to stress testing: unlike formal mathematical claims about differential privacy, these auditing techniques apply to complete end-to-end systems, potentially catching software bugs or incorrect parameter choices.
In addition to learning machine learning models, data analysts are often interested in applying data science methods to analyze raw data on local user devices. For example, analysts might be interested in aggregated model metrics, popular trends and activity, or geospatial location heat maps. All of this can be accomplished using federated analytics. Similar to federated learning, federated analytics works by running local calculations on each device's data and only providing aggregated results. However, unlike federated learning, federated analytics is designed to support basic data science needs such as counts, averages, histograms, quantiles, and other SQL-like queries.
For an application where an analyst wants to use federated analysis to learn the top 10 most played songs in a music library shared by many users. This task can be performed using the federation and privacy techniques discussed above. For example, clients could encode the songs they have listened to into a binary vector of length equal to the size of the library, and use distributed differential privacy to ensure that the server can only see one value of these vectors, given how many users played each song The differential privacy histogram of .
However, federated analysis tasks differ from federated learning tasks in several aspects:
Federated analysis algorithms are typically non-interactive and involve a large number of clients. In other words, unlike federated learning applications, there are no diminishing returns in having more clients in a round. Therefore, applying differential privacy in federated analysis is less challenging since each round can include a larger number of clients and requires fewer rounds.
The same customers do not need to participate again in subsequent rounds. In fact, re-engaging customers may also bias the algorithm's results. Therefore, the federated analysis task is best served by an infrastructure that limits the number of times any individual can participate.
Federated analysis tasks are often sparse, making efficient privacy sparse aggregation a particularly important topic.
It is worth noting that although restricted client participation and sparse aggregation are particularly relevant to federated analysis, they can also be applied to federated learning problems.
Federated learning is being applied to more types of data and problem areas, and has even been considered an important method of privacy computing, that is, a privacy protection method for AI, personal energy Due to limitations, this article does not cover the challenges of personalization, robustness, fairness, and system implementation in federated learning. Regarding the practice of federated learning, TensorFlow Federated may be a good starting point.
The above is the detailed content of Federated Learning in Privacy Computing. For more information, please follow other related articles on the PHP Chinese website!