liudongdong1 收录于 Math&Model

2020-09-19 约 290 字预计阅读 1 分钟 - 次阅读

https://cdn.pixabay.com/photo/2023/10/05/17/54/geese-8296524_640.jpg

level: author: Zhiqiang Shen Marios Savvides(Carnegie Mello University) date: 2020,9,17 keyword:

knowledge distillation; discriminators;

Paper: MEALv2

Summary

simplify MEAL by several methods:
1. adopting the similarity loss and discriminator only on the final outputs;
2. using the average of softmax probabilities from all teacher ensembles as the stronger supervision for distillation;
the first to boost vanilla resnet-50 to surpass 80% on ImageNet without architecture modification or additional training data;
only relies on teacher-student paradigm,
1. no architecture modification;
2. no outside training data;
3. no cosine learning rate;
4. no extra data augmentation;
5. no label smoothing;

Methods

system overview:

【Module One】Teachers Ensemble: adopt the average of softmax probabilities from multiple pre-trained teachers as an ensemble;

$p_t^{T_\theta}$: the t-th teacher’s softmax prediction;
X: the inout image;
K: the number of total teachers;

$$ p_e^{T_\theta}(x)=1/K\sum_{t=1}^Kp_t^{T_\theta}(x) $$

【Module Two】KL-divergence: measure metric of how one probability distribution is different from another reference distribution;

$p^{s_\theta}(x_i)$: the student output probability;
N: the number of samples;

【Module Three】Discriminator: a binary classifier to distinguish the input feature are from teacher ensemble or student network, consist of a sigmoid function following the binary cross entropy loss:

$x_t,x_s$: the teacher and student input features;
$f_\theta$: a three-fc-layer subnetwork;
$\sigma(x)=1/(1+exp(-x))$: logistic function;
$y\epsilon[0,1]$: the label;

$$ P^D(x;\theta)=\sigma(f_\theta({x_t,x_s}))\ L_D=-1/N\sum_{i=1}^N[y_i*logp_i^D+(1-y_i)*log(1-P^D_i)] $$

Evaluation

Notes 去加强了解

https://github.com/szq0214/MEAL-V2.
学习代码，学习知识蒸馏网络结构