本文最后更新于 1666 天前,其中的信息可能已经有所发展或是发生改变。
LogisticRegression不同正则化方法、正则化强度、求解器与多分类方法的简单比较与总结
1. 数据处理
1.1 导入数据
from sklearn.datasets import load_iris
import numpy as np
data = load_iris()
# 输入参数: return_X_y=True 时, 只返回data, target
print('type(data): ', type(data), '\ndir(data): ', dir(data))
# data类似字典, data.data 是iris的特征, data.target 是iris的类别, data.target_names 是类别的全名
X = data.data
y = data.target
print('X.shape: ', X.shape, '\ny.shape: ', y.shape)
print('X[:5]: ', X[:5])
type(data): <class 'sklearn.utils.Bunch'>
dir(data): ['DESCR', 'data', 'feature_names', 'target', 'target_names']
X.shape: (150, 4)
y.shape: (150,)
X[:5]: [[ 5.1 3.5 1.4 0.2]
[ 4.9 3. 1.4 0.2]
[ 4.7 3.2 1.3 0.2]
[ 4.6 3.1 1.5 0.2]
[ 5. 3.6 1.4 0.2]]
1.2 数据集划分
from sklearn.model_selection import train_test_split
# help(train_test_split)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print('X_train.shape: ', X_train.shape, '\nX_test.shape: ', X_test.shape)
X_train.shape: (112, 4)
X_test.shape: (38, 4)
1.3 特征多项式化
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
print('X_train_poly.shape: ', X_train_poly.shape)
X_train_poly.shape: (112, 15)
1.4 归一化
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train_poly)
X_test_std = scaler.transform(X_test_poly)
print('X_train_std[:2]: ', X_train_std[:2])
X_train_std[:2]: [[ 0. 0.80347326 -0.53260596 0.45425177 0.3800208 0.76434725
0.19304301 0.4909547 0.41205426 -0.56916951 0.32124649 0.27321335
0.28336979 0.21071914 0.10785009]
[ 0. 1.04344393 -1.25712258 1.1318482 0.7693104 1.03118022
-0.25151435 1.17193559 0.84830534 -1.19111041 0.62917065 0.39964178
1.26094208 0.95381193 0.64458339]]
2. 导入模型
from sklearn.linear_model import LogisticRegression
import time
import numpy as np
import pandas as pd
# help(LogisticRegression)
# multinomial的 solver 只能使用:'lbfgs', 'sag', 'newton-cg'
# 'newton-cg', 'sag', 'lbfgs'仅支持 L2 正则化。详细使用组合见下文
models = (
('ovr', 'l1', 'liblinear'),
('ovr', 'l1', 'saga'),
('ovr', 'l2', 'newton-cg'),
('ovr', 'l2', 'sag'),
('multinomial', 'l2', 'newton-cg'),
('multinomial', 'l2', 'sag')
)
train_time = []
test_score = []
for model in models:
for c in (0.01, 1, 100):
# print('multi_class= {0[0]},\tpenalty= {0[1]},\tsolver= {0[2]},\tc={1}'.format(model, c))
logreg = LogisticRegression(
multi_class=model[0],
penalty=model[1],
solver=model[2],
random_state=1,
C=c
)
t1 = time.clock()
logreg.fit(X_train_std, y_train)
# print('train_time: ', time.clock() - t1)
# print('Training set score: {:.5f}'.format(logreg.score(X_train_std, y_train)))
# print('Test set score: {:.5f}\n\n'.format(logreg.score(X_test_std, y_test)))
train_time.append(time.clock() - t1)
test_score.append(logreg.score(X_test_std, y_test))
# 把上面数据整理成 pandas 表格
train_time_array = np.array(train_time).reshape((6, 3))
test_score_array = np.array(test_score).reshape((6, 3))
df_value = np.hstack((train_time_array, test_score_array))
df1 = pd.DataFrame({'multi_class': ['ovr']*4 + ['multinomial']*2,
'penalty': ['l1']*2 + ['l2']*4,
'solver': ['liblinear', 'saga'] + ['newton-cg', 'sag']*2,
})
df2 = pd.DataFrame(df_value,
columns=['train_time(c=0.01)', 'train_time(c=1)', 'train_time(c=100)', 'test_score(c=0.01)', 'test_score(c=1)', 'test_score(c=100)']
)
df = pd.concat([df1, df2], axis=1)
df
d:\anaconda3\lib\site-packages\sklearn\linear_model\sag.py:326: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
"the coef_ did not converge", ConvergenceWarning)
multi_class | penalty | solver | train_time(c=0.01) | train_time(c=1) | train_time(c=100) | test_score(c=0.01) | test_score(c=1) | test_score(c=100) | |
---|---|---|---|---|---|---|---|---|---|
0 | ovr | l1 | liblinear | 0.001772 | 0.021796 | 0.047019 | 0.342105 | 0.973684 | 0.973684 |
1 | ovr | l1 | saga | 0.002186 | 0.054482 | 0.061748 | 0.236842 | 0.973684 | 1.000000 |
2 | ovr | l2 | newton-cg | 0.027565 | 0.039025 | 0.084678 | 0.789474 | 0.973684 | 1.000000 |
3 | ovr | l2 | sag | 0.010913 | 0.062106 | 0.044849 | 0.789474 | 0.973684 | 1.000000 |
4 | multinomial | l2 | newton-cg | 0.027572 | 0.018871 | 0.045772 | 0.868421 | 0.973684 | 0.973684 |
5 | multinomial | l2 | sag | 0.003055 | 0.014107 | 0.015142 | 0.868421 | 0.973684 | 1.000000 |
从上面表格可以看出:
1. C
越大,正则化强度越小, LogisticRegression 将尽可能将训练集拟合哦更好;C
越小,正则化强度越大,模型将使系数向量(w) 接近于0,决策边界越接近一条直线。
2. L1 正则化可用的两个 solver
中 liblinear
速度快于 saga
,其实样本数据巨大时 saga
速度快于 liblinear
。
3. 在此小样本数据里 L2 正则化 sag
速度快于 newton-cg
。
3.关于正则化方法,求解器和分类方式的总结:
3.1 正则化:
正则化方法 L1:
原损失函数加上权值绝对值的平均数
$$
C = C_0 + \frac{\lambda}{n}\sum_{w}|w|
$$
正则化方法 L2:
原损失函数加上权值平方的平均数
$$
C = C_0 + \frac{\lambda}{2n}\sum_{w}w^{2}
$$
其中上式 $C_0$ 是原损失函数,$C$是加上正则化后的损失函数,L2正则项的 $\frac{1}{2}$ 是为了与 $w$ 的二次方求导数时相消。$\lambda$ 是正则化强度,可调节,过大导致模型欠拟合,过小导致过拟合。
3.2 求解器(优化算法):
- liblinear:使用了开源的liblinear库实现,内部使用了坐标轴下降法来迭代优化损失函数。
- lbfgs:拟牛顿法的一种,利用损失函数二阶导数矩阵(Hessian阵)来迭代优化损失函数。
- newton-cg:也是牛顿法家族的一种,利用损失函数二阶导数矩阵(Hessian阵)来迭代优化损失函数。
- sag:即随机平均梯度下降(详见机器学习-梯度下降法)每次迭代仅仅用一部分的样本来计算梯度,适合于大样本的时候。
- saga: 是 “sag” 的一类变体,它支持非平滑(non-smooth)的函数求解。
由于中间三种优化算法(lbfgs, newton-cg, sag)需要计算一阶导数或者二阶导数,而 L1 正则化方法构成的损失函数并不连续可导,所以 L1 正则化不能使用这三种方法。
正则化方法,求解器和分类方式的组合:
case | multi_class | solver |
---|---|---|
L1 | ovr | liblinear, saga |
L1[注] | multinomial | saga |
L2 | multinomial | lbfgs, newton-cg, sag |
注:在 scikit-learn 0.19 中允许了 L1 + multinomial + saga