ModelEvaluation

liudongdong1 收录于 Categories Demo

2022-05-20 约 3855 字预计阅读 8 分钟 - 次阅读

https://cdn.stocksnap.io/img-thumbs/280h/ZFIQC5CZRP.jpg

0. Sklearn Metric

.1. Classification metrics

See the Classification metrics section of the user guide for further details.

`metrics.accuracy_score`(y_true, y_pred, *[, …])	Accuracy classification score.
`metrics.auc`(x, y)	Compute Area Under the Curve (AUC) using the trapezoidal rule.
`metrics.average_precision_score`(y_true, …)	Compute average precision (AP) from prediction scores.
`metrics.balanced_accuracy_score`(y_true, …)	Compute the balanced accuracy.
`metrics.brier_score_loss`(y_true, y_prob, *)	Compute the Brier score loss.
`metrics.classification_report`(y_true, y_pred, *)	Build a text report showing the main classification metrics.
`metrics.cohen_kappa_score`(y1, y2, *[, …])	Cohen’s kappa: a statistic that measures inter-annotator agreement.
`metrics.confusion_matrix`(y_true, y_pred, *)	Compute confusion matrix to evaluate the accuracy of a classification.
`metrics.dcg_score`(y_true, y_score, *[, k, …])	Compute Discounted Cumulative Gain.
`metrics.det_curve`(y_true, y_score[, …])	Compute error rates for different probability thresholds.
`metrics.f1_score`(y_true, y_pred, *[, …])	Compute the F1 score, also known as balanced F-score or F-measure.
`metrics.fbeta_score`(y_true, y_pred, *, beta)	Compute the F-beta score.
`metrics.hamming_loss`(y_true, y_pred, *[, …])	Compute the average Hamming loss.
`metrics.hinge_loss`(y_true, pred_decision, *)	Average hinge loss (non-regularized).
`metrics.jaccard_score`(y_true, y_pred, *[, …])	Jaccard similarity coefficient score.
`metrics.log_loss`(y_true, y_pred, *[, eps, …])	Log loss, aka logistic loss or cross-entropy loss.
`metrics.matthews_corrcoef`(y_true, y_pred, *)	Compute the Matthews correlation coefficient (MCC).
`metrics.multilabel_confusion_matrix`(y_true, …)	Compute a confusion matrix for each class or sample.
`metrics.ndcg_score`(y_true, y_score, *[, k, …])	Compute Normalized Discounted Cumulative Gain.
`metrics.precision_recall_curve`(y_true, …)	Compute precision-recall pairs for different probability thresholds.
`metrics.precision_recall_fscore_support`(…)	Compute precision, recall, F-measure and support for each class.
`metrics.precision_score`(y_true, y_pred, *[, …])	Compute the precision.
`metrics.recall_score`(y_true, y_pred, *[, …])	Compute the recall.
`metrics.roc_auc_score`(y_true, y_score, *[, …])	Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.
`metrics.roc_curve`(y_true, y_score, *[, …])	Compute Receiver operating characteristic (ROC).
`metrics.top_k_accuracy_score`(y_true, y_score, *)	Top-k Accuracy classification score.
`metrics.zero_one_loss`(y_true, y_pred, *[, …])	Zero-one classification loss.

.2. Regression metrics

See the Regression metrics section of the user guide for further details.

`metrics.explained_variance_score`(y_true, …)	Explained variance regression score function.
`metrics.max_error`(y_true, y_pred)	max_error metric calculates the maximum residual error.
`metrics.mean_absolute_error`(y_true, y_pred, *)	Mean absolute error regression loss.
`metrics.mean_squared_error`(y_true, y_pred, *)	Mean squared error regression loss.
`metrics.mean_squared_log_error`(y_true, y_pred, *)	Mean squared logarithmic error regression loss.
`metrics.median_absolute_error`(y_true, y_pred, *)	Median absolute error regression loss.
`metrics.mean_absolute_percentage_error`(…)	Mean absolute percentage error regression loss.
`metrics.r2_score`(y_true, y_pred, *[, …])	R2 (coefficient of determination) regression score function.
`metrics.mean_poisson_deviance`(y_true, y_pred, *)	Mean Poisson deviance regression loss.
`metrics.mean_gamma_deviance`(y_true, y_pred, *)	Mean Gamma deviance regression loss.
`metrics.mean_tweedie_deviance`(y_true, y_pred, *)	Mean Tweedie deviance regression loss.

.3. Multilabel ranking metrics

See the Multilabel ranking metrics section of the user guide for further details.

`metrics.coverage_error`(y_true, y_score, *[, …])	Coverage error measure.
`metrics.label_ranking_average_precision_score`(…)	Compute ranking-based average precision.
`metrics.label_ranking_loss`(y_true, y_score, *)	Compute Ranking loss measure.

.4. Clustering metrics

The sklearn.metrics.cluster submodule contains evaluation metrics for cluster analysis results. There are two forms of evaluation:

supervised, which uses a ground truth class values for each sample.
unsupervised, which does not and measures the ‘quality’ of the model itself.

`metrics.adjusted_mutual_info_score`(…[, …])	Adjusted Mutual Information between two clusterings.
`metrics.adjusted_rand_score`(labels_true, …)	Rand index adjusted for chance.
`metrics.calinski_harabasz_score`(X, labels)	Compute the Calinski and Harabasz score.
`metrics.davies_bouldin_score`(X, labels)	Computes the Davies-Bouldin score.
`metrics.completeness_score`(labels_true, …)	Completeness metric of a cluster labeling given a ground truth.
`metrics.cluster.contingency_matrix`(…[, …])	Build a contingency matrix describing the relationship between labels.
`metrics.cluster.pair_confusion_matrix`(…)	Pair confusion matrix arising from two clusterings.
`metrics.fowlkes_mallows_score`(labels_true, …)	Measure the similarity of two clusterings of a set of points.
`metrics.homogeneity_completeness_v_measure`(…)	Compute the homogeneity and completeness and V-Measure scores at once.
`metrics.homogeneity_score`(labels_true, …)	Homogeneity metric of a cluster labeling given a ground truth.
`metrics.mutual_info_score`(labels_true, …)	Mutual Information between two clusterings.
`metrics.normalized_mutual_info_score`(…[, …])	Normalized Mutual Information between two clusterings.
`metrics.rand_score`(labels_true, labels_pred)	Rand index.
`metrics.silhouette_score`(X, labels, *[, …])	Compute the mean Silhouette Coefficient of all samples.
`metrics.silhouette_samples`(X, labels, *[, …])	Compute the Silhouette Coefficient for each sample.
`metrics.v_measure_score`(labels_true, …[, beta])	V-measure cluster labeling given a ground truth.

.5. Pairwise metrics

`metrics.pairwise.additive_chi2_kernel`(X[, Y])	Computes the additive chi-squared kernel between observations in X and Y.
`metrics.pairwise.chi2_kernel`(X[, Y, gamma])	Computes the exponential chi-squared kernel X and Y.
`metrics.pairwise.cosine_similarity`(X[, Y, …])	Compute cosine similarity between samples in X and Y.
`metrics.pairwise.cosine_distances`(X[, Y])	Compute cosine distance between samples in X and Y.
`metrics.pairwise.distance_metrics`()	Valid metrics for pairwise_distances.
`metrics.pairwise.euclidean_distances`(X[, Y, …])	Considering the rows of X (and Y=X) as vectors, compute the distance matrix between each pair of vectors.
`metrics.pairwise.haversine_distances`(X[, Y])	Compute the Haversine distance between samples in X and Y.
`metrics.pairwise.kernel_metrics`()	Valid metrics for pairwise_kernels.
`metrics.pairwise.laplacian_kernel`(X[, Y, gamma])	Compute the laplacian kernel between X and Y.
`metrics.pairwise.linear_kernel`(X[, Y, …])	Compute the linear kernel between X and Y.
`metrics.pairwise.manhattan_distances`(X[, Y, …])	Compute the L1 distances between the vectors in X and Y.
`metrics.pairwise.nan_euclidean_distances`(X)	Calculate the euclidean distances in the presence of missing values.
`metrics.pairwise.pairwise_kernels`(X[, Y, …])	Compute the kernel between arrays X and optional array Y.
`metrics.pairwise.polynomial_kernel`(X[, Y, …])	Compute the polynomial kernel between X and Y.
`metrics.pairwise.rbf_kernel`(X[, Y, gamma])	Compute the rbf (gaussian) kernel between X and Y.
`metrics.pairwise.sigmoid_kernel`(X[, Y, …])	Compute the sigmoid kernel between X and Y.
`metrics.pairwise.paired_euclidean_distances`(X, Y)	Computes the paired euclidean distances between X and Y.
`metrics.pairwise.paired_manhattan_distances`(X, Y)	Compute the L1 distances between the vectors in X and Y.
`metrics.pairwise.paired_cosine_distances`(X, Y)	Computes the paired cosine distances between X and Y.
`metrics.pairwise.paired_distances`(X, Y, *[, …])	Computes the paired distances between X and Y.
`metrics.pairwise_distances`(X[, Y, metric, …])	Compute the distance matrix from a vector array X and optional Y.
`metrics.pairwise_distances_argmin`(X, Y, *[, …])	Compute minimum distances between one point and a set of points.
`metrics.pairwise_distances_argmin_min`(X, Y, *)	Compute minimum distances between one point and a set of points.
`metrics.pairwise_distances_chunked`(X[, Y, …])	Generate a distance matrix chunk by chunk with optional reduction.

.6. Plotting

`metrics.plot_confusion_matrix`(estimator, X, …)	Plot Confusion Matrix.
`metrics.plot_det_curve`(estimator, X, y, *[, …])	Plot detection error tradeoff (DET) curve.
`metrics.plot_precision_recall_curve`(…[, …])	Plot Precision Recall Curve for binary classifiers.
`metrics.plot_roc_curve`(estimator, X, y, *[, …])	Plot Receiver operating characteristic (ROC) curve.

`metrics.ConfusionMatrixDisplay`(…[, …])	Confusion Matrix visualization.
`metrics.DetCurveDisplay`(*, fpr, fnr[, …])	DET curve visualization.
`metrics.PrecisionRecallDisplay`(precision, …)	Precision Recall visualization.
`metrics.RocCurveDisplay`(*, fpr, tpr[, …])	ROC Curve visualization.

1. IOU

预测框与标注框的交集与并集之比，数值越大表示该检测器的性能越好。

2. Precision

查准率或者是精确率,是指在所有系统判定的“真”的样本中，确实是真的的占比

3. Accuracy

accuracy针对所有样本

4. Recall

在所有确实为真的样本中，被判为的“真”的占比.

5. PRC图例

以查准率为Y轴，、查全率为X轴做的图。它是综合评价整体结果的评估指标。所以，哪总类型（正或者负）样本多，权重就大。在进行比较时，若一个学习器的PR曲线被另一个学习器的曲线完全包住，则可断言后者的性能优于前者。比较PR曲线下的面积，该指标在一定程度上表征了学习器在查准率和查全率上取得相对“双高”的比例。因为这个值不容易估算，所以人们引入“平衡点”(BEP)来度量，他表示“查准率=查全率”时的取值，值越大表明分类器性能越好

F1-score 就是一个综合考虑precision和recall的指标，比BEP更为常用。

6. ROC&AUC&K-S曲线

ROC全称是“受试者工作特征”（Receiver Operating Characteristic）曲线，ROC曲线以“真正例率”（TPR）为Y轴，以“假正例率”（FPR）为X轴，对角线对应于“随机猜测”模型，而（0,1）则对应“理想模型”。ROC形式如下图所示。针对二分类

若一个学习器的ROC曲线被另一个学习器的曲线包住，那么我们可以断言后者性能优于前者；若两个学习器的ROC曲线发生交叉，则难以一般性断言两者孰优孰劣。此时若要进行比较，那么可以比较ROC曲线下的面积，即AUC，面积大的曲线对应的分类器性能更好。

AUC（Area Under Curve）的值为ROC曲线下面的面积，若分类器的性能极好，则AUC为1。一般AUC均在0.5到1之间，AUC越高，模型的区分能力越好.0.85 – 0.95：效果很好0.95 – 1： 效果非常好，但一般不太可能

KS值越大，说明模型能将两类样本区分开的能力越大。

先将实例按照模型输出值进行排序，通过改变不同的阈值得到小于（或大于）某个阈值时，对应实例集合中正（负）样本占全部正（负）样本的比例（即TPR 和 FPR，和 ROC 曲线使用的指标一样，只是两者的横坐标不同）。由小到大改变阈值从而得到多个点，将这些点连接后分别得到正、负实例累积曲线。正、负实例累积曲线相减得到KS曲线， KS曲线的最高点即KS值，该点所对应的阈值划分点即模型最佳划分能力的点。

7. Confusion Matrix

8. 泛化能力

泛化能力指的是训练得到的模型对未知数据的预测能力。

损失函数：度量预测错误程度的函数

训练误差：训练数据集上的平均损失，虽然有意义，但本质不重要

测试误差：测试数据集上的平均损失，反应了模型对未知数据的预测能力

9. 过拟合&欠拟合

当机器学习模型对训练集学习的太好的时候，此时表现为训练误差很小，而泛化误差会很大，这种情况我们称之为过拟合：

模型记住了数据中的噪音 意味着模型受到噪音的干扰，导致拟合的函数形状与实际总体的数据分布相差甚远。这里的噪音可以是标记错误的样本，也可以是少量明显偏离总体分布的样本（异常点）。通过清洗样本或异常值处理可以帮助缓解这个问题。

训练数据过少 导致训练的数据集根本无法代表整体的数据情况，做什么也是徒劳的。需要想方设法增加数据，包括人工合成假样本。

模型复杂度过高 导致模型对训练数据学习过度

当模型在数据集上学习的不够好的时候，此时训练误差较大，这种情况我们称之为欠拟合：

模型过于简单 即模型形式太简单，以致于无法捕捉到数据特征，无法很好的拟合数据

10. 偏差和方差

**偏差：**the difference between your model’s expected predictions and the true values. 衡量了模型期望输出与真实值之间的差别，刻画了模型本身的拟合能力。

**方差：**refers to your algorithm’s sensitivity to specific sets of training data. High variance algorithms will produce drastically different models depending on the training set. 度量了训练集的变动所导致的学习性能的变化，刻画了模型输出结果由于训练集的不同造成的波动。

**噪音：**度量了在当前任务上任何学习算法所能达到的期望泛化误差的下界，刻画了学习问题本身的难度。

11. 回归度量

.1. 平均绝对误差MAE

缺点是该误差形式没有二阶导数，导致不能用某些方法优化。

.2. 均方根误差RMSE

对大误差的样本有更多的惩罚，因此也对离群点更敏感。

.3. 均方根对数误差RMSLE

当真实值的分布范围比较广时（如：年收入可以从 0 到非常大的数），如果使用MAE、MSE、RMSE 等误差，这将使得模型更关注于那些真实标签值较大的样本。而RMSLE 关注的是预测误差的比例，使得真实标签值较小的样本也同等重要。当数据中存在标签较大的异常值时，RMSLE 能够降低这些异常值的影响。

12. PSI(模型稳定性)

稳定度指标(population stability index ,PSI)可衡量测试样本及模型开发样本评分的的分布差异，为最常见的模型稳定度评估指针。其实PSI表示的就是按分数分档后，针对不同样本，或者不同时间的样本.

13. 验证测试集

测试集通常用于对模型的预测能力进行评估，它提供了模型预测能力的无偏估计。如果你不需要对模型预测能力的无偏估计，则不需要测试集。
验证集用于超参数的选择，因为模型依赖于超参数，而超参数依赖于验证集。因此验证集参与了模型的构建，这意味着模型已经考虑了验证集的信息。所以我们需要一份单独的测试集来估计模型的泛化能力。
如果未设置验证集，则将数据三七分：70% 的数据用作训练集、30% 的数据用作测试集。
如果设置验证集，则将数据划分为：60% 的数据用作训练集、20%的数据用过验证集、20% 的数据用作测试集。
必须保证验证集、测试集的分布一致，它们都要很好的代表你的真实应用场景中的数据分布。
如果训练集和验证集的分布一致，那么当训练误差和验证误差相差较大时，我们认为存在很大的方差问题。
如果训练集和验证集的分布不一致，那么当训练误差和验证误差相差较大时，有两种原因：
- 第一个原因：模型只见过训练集数据，没有见过验证集的数据导致的，是数据不匹配的问题。
- 第二个原因：模型本来就存在较大的方差。