Federated Learning Record
level: author: , Anit Kumar Sahu(Bosch Center for AI) date: 2019
Paper: Federated Learning
Summary
- discuss the unique characteristics and challenges of federated learning, providing a broad overview of current approaches, and outline several directions of future work that are relevant to a wide range of research communities.
Research Objective
- Application Area: learning sentiment, semantic location, activities of mobile phone users, adapting to pedestrian behavior in autonomous vehicles, predicting health events like heart attack risk from wearable devices;
- Smart phones: next-word prediction; voice recognition; while protect their personal privacy or to save the limited bandwidth/battery power of their phone;
- Organizations: reduce strain on the network and enable private learning between divices/organizations;
- Internet of things: Modern IoT networks, such as wearable devices, autonomous vehicles, or smart homes containing numerous sensors.
Proble Statement
- aim to learn the model under the constraint that device-generated data is stored and processed locally, with only intermediate updates being communicated periodically with a central server.
- m: the total number of devices;
- $p_k>0 $ and $\sum_kp_k=1$, and $F_k$ is the local objective function for the kth device;
- $p_k$: the relative impact of each device, like $p_k=1/n$ or $p_k=n_k/n$;
- $n=\sum_kn_k$: is the total number of samples;
$$ min_wF(w), where F(w):=\sum_{k=1}^{m}p_kF_k(w)\ F_k(w)=1/n_k\sum_{j_k=1}^{n_k}f_{j_k}(w;x_{j_k};y_{j_k}) $$
Core Problem :
- Expensive Communications: to develop communication-efficient methods that iteratively send small messages or model updates as part of the training process, as for reducing communication:
- reducing the total number of communication rounds;
- reducing the size of transmitted message at each round;
- Systems Heterogeneity: the storage, computational and communication capabilities of each device in FL may differ due to variability in hardware, network connectivity, and power.
- anticipate a low amount of participation;
- tolerate heterogeneous hardware;
- be robust to dropped devices in the network;
- Statistical Heterogeneity: devices frequently generate and collect data in a non-identically distributed manner across the netwok;
- multi-task and meta-learning perspectives enable personalized or device-specific modeling;
- Privacy Concerns: tools like secure multiparty computation or differential privacy, which provide privacy at the cost of reduced model performance or system efficiency which should be understand and balance the trade-off;
Previous work:
【Communication-efficiency】
- Local Updating methods
Mini-batch optimization methods to process multiple data points at a time;
distributed approaches like ADMM[4] in real-word data center environments;
variable number of local updates to be applied on each machine in parallel at each communication round,
- Compression schemes
- model compression such as sparsification, subsampling, and quantization; [119], [135] for detail;
- the low participation of devices, non-identically distributed local data, and local updating schemes pose noval challenges to these model compression approaches;
- errors accumulated locally may be stale if the devices are not frequently sampled;
- forcing the updating models to be sparse and low-rank;
- performing quantization with structured random rotations;
- using lossy compression and dropout to reduce server-device communication;
- applying Golomb lossless encoding;
- Decentralized training
【Systems Heterogeneity】
- Asynchronous Communication:
bounded-delay assumptions can be unrealistic in federated settings, where the delay may be on the order of hour to days, or completely unbounded.
- Active Sampling
- activately selecting participating devices at each round;
- based on systems resources, with the aim being for the server to aggregate as many device updates within pre-defined time window.
- based on the data quality.
- how to extend these approaches to handle real-time, device-specific fluctuations in computation and communication delays
- how to actively sampling a set of small but sufficiently representative devices based on the underlying statistical structure
- Fault Tolerance
- ignore such device failure, which may introduce bias;
- code computation by introducing algorithmic redundancy:
- using gradient coding and its variants
【Statistical Heterogeneity】
the data is not identically distributed across devices, both in terms of modeling the data and in analyzing the convergence behavior of associated training procedures.
- Modeling Heterogeneous Data 统计异质性
using meta-learning and multi-task learning.
MOCHA[106], an optimization framework designed for the federated setting, allow for personalization by learning separate but relate models for each device while leveraging a shared representation via multi-task learning;
[26] models the star topology and perform variational inference;
- Convergence Guarantees for Non-IID Data
FedProx makes a small modification to the FedAvg method to help ensure convergence, both theoretically and in practice. FedProx can also be interpreted as a generalized, reparameterized version of FedAvg that has practical ramifications in the context of accounting for systems heterogeneity across devices.
【Privacy】
sharing other information such as model updates can also leak sensitive user information;
- Privacy in Machine Learning
- differential privacy to communicate noisy data sketches;
- homomorphic encryption to operate on encrypted data;
- secure function evaluation or multi-party computation;
- Privacy in FL
develop methods that are computationally cheap, commucation-efficient and tolerant to dropped devices, all without overly compromising accuracy.
- SMC is a lossless method, retain the original accuracy with a very high privacy guarantee, incuring significant extra communication cost;
- differential privacy: using hyperparameters that affect communication and accuracy that must be carefully chosen.
Future Directions
- Extreme communication schemes:
- Communication reduction and the Pareto frontier
- Novel medels of asynchrony
- Heterogeneity diagnostics
- Granular privacy constraints
- Beyond supervised learning
- Productionizing federated learning
- concept drift(the underlying data-generation model changes over time)
- diurnal variations( the devices exhibit different behavior at different times of the day or week)
- cold start problems(new devices enter the network)
- Benchmarks
Notes 去加强了解
- non-convex problems
- [114]
- differential privacy[32, 33, 34]
- . A firm foundation for private data analysis
- The algorithmic foundations of differential privacy
- Calibrating noise to sensitivity in private data analysis
level: cited by 750 author: H.Brendan McMahan ,Google Inc
date: 2017,2,28 keyword:
- federal learning, distributed optimization and estimation, communication efficiency
Paper: Communication-Efficient
Summary
- introduce the FederatedAveraging algorithm, which combines local SGD on each client with a server that performs model averaging.
- learned the base concept of Federated learning. learned the code but don’t run it successfully.
Research Objective
- Application Area: Modern mobile devices (phones,tablets) have access to a wealth of data suitable for learning models, which in turn improve the user experience on the devices.
- Language models: improve speech recognition and text entry on touch-screen keyboards by improving decoding,next-word-prediction,even the whole replies.
- Image modes: automatically select good photos
- Purpose: advocate an alternative that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally-computed updates.
Proble Statement
Opportunities:
- the device we carried contained many sensors, create much private data.
- the data driven many intelligent applications to improve usability.
- centrally store these data cost a lot, eg: memory, communicating.
- decouple the model training from the need for direct access to the raw data, reduce privacy and security risks by limiting the attack surface to only the device.
Previous work:
- Federated Learning:
- training on real-world data from mobile devices provides a distinct advantage over training on proxy data that is generally available in the data center.
- data is privacy sensitive or large in size
- for supervised tasks, labels on the data can be inferred naturally from user interaction.
- Privacy: the information transmitted for federated learning is the minimal update necessary to improve a particular model. (有本书是关于数据隐私的,以及各种保护隐私算法)
- Federated Optimization: several key properties that differentiate it from typical distributed optimization problem.
- Non-IID: the data based on the usage of particular user will not presentative of the population distribution
- Unbalanced: the usage of device is different, leading to different amount of data.
- Massively distributed: the number of clients participating in an optimization is much larger than average number of examples per client.
- Limited communication: mobile devices are frequently offline or on slow or expensive connections.
- previous work don’t consider unbalanced and non-IID data and the empirical evaluation is limited, we focus on the noe-IID and unbalanced properties of optimization, communication constraints.
Methods
- system overview:
【Challenge 1】client availability and unbalanced and non-IID data
-
assume a synchronous update scheme that proceeds in rounds of communication.
-
propose Federated Averaging algorithm
-
Federated SGD &&Federated AVG
【Challenge 2】while in data center optimization, communication costs are relatively small, and computation costs dominate with GPU to low down, in federated optimization, communication costs dominate.
- the clients will only volunteer to participate in the optimization when they are charged, plugged-in, and on an unmetered wifi connection.
- use additional computation to decrease the number of rounds of communication
- increase parallelism
- increase computation on each client: client perform more complex computation between each communication round.
【Challenge 3】Non-Convex的问题,会得到任意坏的结果 每一轮的各个客户端的起始参数值相同(也就是前一轮的全局参数值)
Evaluation
-
Accuracy&&Communication rounds
-
local epoch test:
-
CIFAR experiments:
-
Large-scaleLSTMexperiments
Conclusion
- the identification of the problem of training on decentralized data from mobile devices as an important research direction
- the selection of straightforward and practical algorithm that can be applied to this setting
- an extensive empirical evaluation of the proposed approach.
Notes
- 传统的分布式优化问题的独立同分布假设
level: SysML Conference author: Keith Bonawitz date: 2019, 5, 22 keyword:
- Federated Learning, System Design
Paper: FL System Design
Summary
- 谷歌基于TensorFlow构建了全球首个产品级可扩展的大规模移动端联合学习系统,目前已在数千万台手机上运行。这些手机能协同学习一个共享模型,所有的训练数据都留在设备端,确保了个人数据安全,手机端智能应用也能更快更低能耗更新。
- Federated Learning progress:
- Server elements:
level: author: AndrewHard, GoogleLLC date: 2019, 2,28 keyword:
- Federated Learning, language modeling, CIFG, NLP, next word prediction
Paper: Mobile Key Prediction
Summary
- Gboard provides auto-correction, word completion, and next-word prediction features.
Methods
- system overview:
- clients process their local data and share model updates with the server, weights from a large population of clients are aggregated by the server and combined to create an improved global model.
- for every client: learnign rate $\epsilon $,the local client update $w_{t+1}^k=w_t-\epsilon g_k$ , $g_k$ is the average gradient.
- for server: $w_{t+1}=\sum_{k=1}^K(n_k/N)w_{t+1}^k$
- updates are processed in memory and are immediately discarded after accumulation in a weight vector.
Notes 去加强了解
- FSA是一个FSM(有限状态机)的一种,特性如下:
- 确定:意味着指定任何一个状态,只可能最多有一个转移可以访问到。
- 无环: 不可能重复遍历同一个状态
- 接收机:有限状态机只“接受”特定的输入序列,并终止于final状态。
- 查找这个key是否在集合内的时间复杂度,取决于key的长度,而不是集合的大小。
- 构建上: TRIE只共享前缀,而FSA不仅共享前缀还共享后缀。
- FST是也一个有限状态机(FSM),具有这样的特性:
- 确定:意味着指定任何一个状态,只可能最多有一个转移可以遍历到。
- 无环: 不可能重复遍历同一个状态
- transducer:接收特定的序列,终止于final状态,同时会输出一个值。
- 不但能共享前缀还能共享后缀。不但能判断查找的key是否存在,还能给出响应的输入output
- 构建: 在FSA上,多了转移上放置和共享outputs
- paper 2, “Finite-state transducers in language and speech processing
- 3 “Mobile keyboard input decoding with finite-state transducers
- CIFG for next-word prediction
- Character-aware neural language models
level: CCF_A AAAI author: Yang Liu, Anbu Huang, Yun Luo, He Huang, Youzhi Liu, Yuanyuan Chen date: 2020, 1,17 keyword:
- Federated Learning, Object Detection
Paper: Fed-Vision
Research Objective
- Application Area: federated learning in Computer Vision
- Purpose: use federated learning to solve the problem in computor vision.
Proble Statement
-
Computer vision: privacy concerns, high cost of transmitting video data
-
traditional typical workflow for centralized training of an object detector.
-
difficult to share data across organizations due to liability concerns
-
the whole process takes a long time and depends on when the next round of off-line training occurs, new data must wait for next round of training.
-
communication cost in transmitting data.
-
Methods
- system overview: HFL (horizontal federated learning)
【Module 1】 Crowdsourced Image Annotation
- use annotate the picture with { label, x, y, w, h}
【Module 2】Federated ModelTraining
- Configuration: configure training information, eg. number of iterations, reconnections, the server URL for uploading…
- Task Scheduler: global dispatch scheduling, coordinate communications between the server and clients to balance the utilization of local computational resources.
- Task Manager: coordinates the concurrent federated model training processes.
- Explorer: monitors the resource utilization situation on the client side, eg: CPU usage, memory usage, network load, etc. to inform the task scheduler on its load-balancing decisions.
- FL_SERVER: responsible for model parameter uploading, model aggregation, and model dispatch
- FL_CLIENT: hosts the task manager and explorer components and performs local model training.
【Module 2】Federated Model Update
-
the number of model parameter files, and thus the storage size required, increases with the rounds of training operations
-
Using Cloud Object Storage(COS) to store practically limitless amounts of data easily and at an affordable cost
【Module 3】Computer Vision Object Detect FedYOLOv3
- in general two-stage approaches produce more accurate object detecion results(r-cnn), while one-stage approaches are more efficient,YOLOv3
- process of YOLOv3:
- Predicting the positions of B bounding boxes<x,y,w,h>
- Estimating the confidence score for the B predicted bounding boxes, whether a bounding box contains the target object $p(obj)=1$ meaning contain object else 0; how precise the boundary of box is. $\theta=p(obj)*IOU$, IOU meaning the intersection-over-union.
- computing the class conditional probability,$p(c_i|obj)\epsilon[0,1]$ for each of the C classes.
- The loss function:
【Module 4】Model Compression
- $M^{i,k}$ define the i-th user after the k-th iteration of FL training, $M_j^{i,k}$ define the j-th layer of $M^{i,k}$, $|\sum M_j^{i,k}|$ the sum of the absolute values of parameters in the j-th layer, the j-th layer to the overall model perfomance: $v(j)=|\sum M_j^{i,k}-\sum M_j^{i,k-1}|$, the larger the value of $v(j)$ , the greater the impact of layer j on the model, FL_CLIENT ranks the $v(j) values of ally layers in the model in descending order and selects only the parameters of first n layers to upload for aggregation.
Evaluation
- Experiment company introduce:
- CRC has business interests in consumer products, health-care, energy services, urban construction and operation, technology and finance. detect multiple types of safety hazards via cameras in more than 100 factories.
- GRG Banking has more than 300,000 equipment deployed in over 80 countries. monitor suspicious transaction behaviours via cameras on the equipment
- SPIC the world’s largest photovoltaic power generation company. monitor the safety of more than 10,000 square meters of photovoltaic panels
- get good performance in Efficiency, data privacy, cost.
Conclusion
- report FedVision– a machine learning engineering platform to support the development of federated learning powered computer vision applications.
- the first industrial application platform in computer vision-based tasks developed by webank and Extreme Vision to help customers develop computer vision based safety monitoring solutions in smart city applications.
- the platform help improve their operational efficiency and reduce their costs, while eliminating the need to transmit sensitive data around.
level: ICLR 2020 author: Hongyi Wang*(Wisconsin-Madison), Mikhail Yurochkin(MIT-IBM Watson AI lab), Yuekai Sun(Michigan University) date: 2020 keyword:
- data privacy, federated learning
Paper: FedMatchedAvg
Summary
- FedMA constructs the shared global model in a layer-wise manner by matching and averaging hidden elements(i.e. channels for convolutional layers; hidden states for LSTM; neurons for fully connected layers) with similar feature extraction signatures.
- demostrate how PFNM can be applied to CNNs and LSTMs, but find it only gives very minor improvements over weight averging.
- propose FedMA(Federated Matched Averaging) a new layer-wise federated learning algorithm for modern CNNs and LSTMs that appeal to Bayesian nonparametric methods to adapt to heterogeniety in the data.
Proble Statement
- coordinate-wise averaging of weights may have drastic detrimental effects on the performance of the averaged model and adds significantly to the communication burden.(permutation invariance of NN parameters)
previous work:
- FedAvg(MCMahan et al.2017): the parameters of local models are averaged element-wise with weights proportional to sizes of the client datasets.
- FedProx(Sahu et al.) adds a proximal term to the client cost functions, limiting the impact of local updates by keeping them close to the global model.
- Agnostic Federated Learning(Mohri et al.): optimizes a centralized distribution that is a mixture of the client distributions.
- Probabilistic Federated Neural Matching(Yurochkin et al.2019): matching the neurons of client NNs before averaging them,using Bayesian non-parametric methods to adapt to global model size and to heterogeneity in the data. only work with simple architectures
Methods
- Problem Formulation: 这里不懂,先记录着
- system overview:
First, data center gathers only the weights of the first layers from the clients and performs one-layer matching described previously to obtain the first layer weights of the federeated model. Data center then broadcasts these weights to the clients, which proceed to train all consecutive layers on their datasets, keeping the matched federated layers frozen. And repeated up to the last layer for which we conduct a weighted averaging based on the class proportions of data points per client.
FedMA with communication, where local clients receive the matched global model at the beginning of a new round and reconstruct their local models with the size equal to the original local models based on the matching results of the previous round.
Evaluation
- Environment:
- Dataset:
Conclusion
- present Federated Matched Averaging(FedMA), a layer-wise federated learning algorithm designed for modern CNNs and LSTMs architectures that accounts for permutation invariance of the neurons and permits global model size adaptation.
Notes 去加强了解
level: author: Wei Chen(Carnegie Mellon Univertity) Radu Mar date: 2020 keyword:
- Federated Learning; activation-divergence issue; Maximum Entropy; Non-IID;
Chen, Wei, Kartikeya Bhardwaj, and Radu Marculescu. “FedMAX: Mitigating Activation Divergence for Accurate and Communication-Efficient Federated Learning.” arXiv preprint arXiv:2004.03657 (2020).
Paper: FedMax
Summary
- introduce a prior based on the principle of maximum entropy, which assumes minimal information about the per-device activation vectors and aims at making the activation vectors of same classes as similar as possible across multiple devices.
- makes activation vector across multiple devices more similar(for same classes); improving the accuracy;
- significantly reduces the number of total communication rounds needed to save energy when training on edge devices.
Proble Statement
- the activation vectors in FL can diverge, even if subsets of users share a few common classes with data residing on different devices.
previous work:
- FedAvg: not designed to handle the statistical heterogeneity in federated settings, when data is not independent and identically distributed across diferent devices.
- Data-sharing strategy: distributes global data across the local devices, but obtaining this common global data is problematic in practice.
- FedProx[4]: targets the weight-divergence problem, the local-weights diverge from the global model due to non-IID data at local devices; by introducing a new loss function which constrains the local models to stay close to the global model.
Methods
- Problem Formulation:
- $g_k(w_k)$: local objective which is typically the loss function of the prediction made with model parameters w;
- m: the number of devices selected at any given communication round; m=C*M
- C: the proportion of selected devices;
- $\sum_{k=1}^Mp_k=1,p_k=n_k/n$,;
- $n_k$: is the number of samples available at the device k;
- $n=\sum_{k=1}^Mn_k$: total number of samples;
$$ min_w g(w)=\sum_{k=1}^m P_k*g_k(w_k) $$
-
system overview:
-
$L^2$ Norm regularization: reduce the activation-divergence across different devices by preventing the activation vectors from taking large values;
- $F_k(w_k)$: the cross entropy loss on local data;
- $\alpha_i^k$: the activation vectors at the input of the last fully-connected layer.
$$ min_wg_k(w_k)=F_k(w_k)+\beta ||a_i^k||_2\ $$
-
Maximum Entropy Regularization: for we don’t know which users have data from which classes, in other words, we don’t have any prior information about how the activation vectors at different user should be distributed.
- N: the min-batch size of local training data;
- $H$: the entropy of activation vectors;
- U: the uniform distribution over the activation vectors;
- KL(.||.): the KL divergence;
$$ min_w g_k(w_k)=F_k(w_k)-\beta 1/N \sum_{i=1}^NH(\alpha_i^k)\ using Kullback-Leibler: min_w g_k(w_k)=F_k(w_k)+\beta 1/N \sum_{i=1}^N KL(\alpha_i^k||U) $$
-
- $w^0$: the initial model and weights generated on a remote server;
- $E$: local epochs;
- $B$: the local training batch size;
Evaluation
- Effects of maximum entropy regularization with different distribution of synthetic data
- Comparison of $L^2$-norm against Maximum Entropy
- Different Dataset Comparison
- Medical data experiment-
- Mitigating Activation Divergence
Notes 去加强了解
- differential privacy [15]
- KL divergence
level: ICLR2020, CCF_A author: Sean Augenstein(Google Inc), H.Brendan McMahan(Google Inc) date: 2020 keyword:
- federated learning
Paper: EffectiveML on Private
Summary
- demostrates that generative models-trained using federated methods and with formal differential privacy guarantees–can be used effectively to debug many commonly occurring data issues even when the data cannot be directly inspected.
- Identifying key challenges in implementing end-to-end workflows with non-inspectable data, e.g. for debugging a “primary” ML model used in a mobile application.
- Proposing a methodology that allows auxiliary generative models to resolve these challenges.
- demonstrating how privacy preserving federated generative models-RNNs for text and GANs for images-can be trained to high enough fidelity to discover introduced data errors matching those encountered in real world scenarios.
Proble Statement
- challenges in model development and debugging.
previous work:
- manual data inspection is problematic for privacy-sensitive datasets:
- identifying and fixing problems in the data
- generating new modeling hypotheses
- assigning or refining human-provided labels
Methods
- 没有看懂