您当前的位置:首页 >> 家居图库

基于纯SQL基础训练机器学习模型

2024-11-07 12:17:35

from iris group by type;

+-------------------+------------------+

type | count(*) |

+-------------------+-----------------+

Iris-versicolor | 50 |

Iris-setosa | 50 |

Iris-virginica | 50 |

+-------------------+----------------+

3 rows in set (0.00 sec)

Softmax语义重回

我选取了一个比较简单的机器深造方:主要用途多类分类的Softmax语义重回。在Softmax重回中会:

成本算子是:

梯度是:

因此,我们可以运主要用途梯度减少来升级梯度:

方悬疑

我编撰了一条SQL上下文来借助于悬疑。基于前面假定的方和完整数据,匹配完整数据x有五个维度(SL、SW、PL、PW和一个下式1.0),负载运主要用途了一种热力区块。SQL脚本语言如下:

脱氧核糖核酸

create table data(

x0 decimal(35, 30), x1 decimal(35, 30), x2 decimal(35, 30), x3 decimal(35, 30), x4 decimal(35, 30),

y0 decimal(35, 30), y1 decimal(35, 30), y2 decimal(35, 30)

);

insert into data

select

sl, sw, pl, pw, 1.0,

case when type='Iris-setosa'then 1 else 0 end,

case when type='Iris-versicolor'then 1 else 0 end,

case when type='Iris-virginica'then 1 else 0 end

from iris;

共有15个参数(3种一般来说*5个维度)。SQL脚本语言如下:

脱氧核糖核酸

create table weight(

w00 decimal(35, 30), w01 decimal(35, 30), w02 decimal(35, 30), w03 decimal(35, 30), w04 decimal(35, 30),

w10 decimal(35, 30), w11 decimal(35, 30), w12 decimal(35, 30), w13 decimal(35, 30), w14 decimal(35, 30),

w20 decimal(35, 30), w21 decimal(35, 30), w22 decimal(35, 30), w23 decimal(35, 30), w24 decimal(35, 30));

我将匹配完整数据初始化为0.1、0.2、0.3。为了便于演示,我运主要用途了多种不同的数字。将它们全部初始化为0.1是可以的。SQL脚本语言如下:

脱氧核糖核酸

insert into weight values (

0.1, 0.1, 0.1, 0.1, 0.1,

0.2, 0.2, 0.2, 0.2, 0.2,

0.3, 0.3, 0.3, 0.3, 0.3);

紧接著,我编撰了一条SQL上下文来测算完整数据断定结果的准确性。为了更是好地理解,我运主要用途伪区块来叙述这个必需:

脱氧核糖核酸

weight = (

w00, w01, w02, w03, w04,

w10, w11, w12, w13, w14,

w20, w21, w22, w23, w24

for data(x0, x1, x2, x3, x4, y0, y1, y2) in all Data:

exp0 = exp(x0 * w00, x1 * w01, x2 * w02, x3 * w03, x4 * w04)

exp1 = exp(x0 * w10, x1 * w11, x2 * w12, x3 * w13, x4 * w14)

exp2 = exp(x0 * w20, x1 * w21, x2 * w22, x3 * w23, x4 * w24)

sum_exp = exp0 + exp1 + exp2

// softmax

p0 = exp0 / sum_exp

p1 = exp1 / sum_exp

p2 = exp2 / sum_exp

//悬疑结果

r0 = p0> p1 and p0> p2

r1 = p1> p0 and p1> p2

r2 = p2> p0 and p2> p1

data.correct = (y0 == r0 and y1 == r1 and y2 == r2)

return sum(Data.correct) / count(Data)

在前面的区块中会,我测算了每行完整数据中会的元素。为了对样品透过断定:

我测算出加权乘积的EXP。 并且测算出softmax值。 然后,选取p0、p1和p2中会最主要的一个作为1,并将其余的另设为0。

如果样品的断定结果与其完整分类完全一致,则预测正确。然后,我将所有样品的正确使用量个数,获得最终的抽样。

示例的区块推测了SQL上下文的借助于。我将每一行完整数据舍弃一个方差(只有一行完整数据),测算每一行的断定结果,并将正确的样品数个数:

脱氧核糖核酸

select sum(y0 = r0 and y1 = r1 and y2 = r2) / count(*)

from

(select

y0, y1, y2,

p0> p1 and p0> p2 as r0, p1> p0 and p1> p2 as r1, p2> p0 and p2> p1 as r2

from

(select

y0, y1, y2,

e0/(e0+e1+e2) as p0, e1/(e0+e1+e2) as p1, e2/(e0+e1+e2) as p2

from

(select

y0, y1, y2,

exp(

w00 * x0 + w01 * x1 + w02 * x2 + w03 * x3 + w04 * x4

) as e0,

exp(

w10 * x0 + w11 * x1 + w12 * x2 + w13 * x3 + w14 * x4

) as e1,

exp(

w20 * x0 + w21 * x1 + w22 * x2 + w23 * x3 + w24 * x4

) as e2

from data, weight) t1

)t2

)t3;

前面的SQL上下文几乎一步一步地借助于了伪区块的测算必需。我获得了如下结果:

脱氧核糖核酸

+-------------------------------------------------------------+

sum(y0 = r0 and y1 = r1 and y2 = r2)/count(*) |

+-------------------------------------------------------------+

0.3333 |

+-------------------------------------------------------------+

1 row in set (0.01 sec)

紧接著,我开始深造方参数。

方受训

注意:为了简化问题,我不会回避“受训集”和“验证集”问题,而是把所有的完整数据大部分主要用途透过受训。

我编撰了伪区块,然后在此基础上编撰了一条SQL上下文:

脱氧核糖核酸

weight = (

w00, w01, w02, w03, w04,

w10, w11, w12, w13, w14,

w20, w21, w22, w23, w24

for iter in iterations:

sum00 = 0

sum01 = 0

sum23 = 0

sum24 = 0

for data(x0, x1, x2, x3, x4, y0, y1, y2) in all Data:

exp0 = exp(x0 * w00, x1 * w01, x2 * w02, x3 * w03, x4 * w04)

exp1 = exp(x0 * w10, x1 * w11, x2 * w12, x3 * w13, x4 * w14)

exp2 = exp(x0 * w20, x1 * w21, x2 * w22, x3 * w23, x4 * w24)

sum_exp = exp0 + exp1 + exp2

// softmax

p0 = y0 - exp0 / sum_exp

p1 = y1 - exp1 / sum_exp

p2 = y2 - exp2 / sum_exp

sum00 += p0 * x0

sum01 += p0 * x1

sum02 += p0 * x2

sum23 += p2 * x3

sum24 += p2 * x4

w00 = w00 + learning_rate * sum00 / Data.size

w01 = w01 + learning_rate * sum01 / Data.size

w23 = w23 + learning_rate * sum23 / Data.size

w24 = w24 + learning_rate * sum24 / Data.s

因为我手动构建了sum和w乘积,所以这段区块看似有点棘手。然后,我开始编撰SQL受训区块。首先,我编撰了一条要用一次给定的SQL上下文。

我另设了如下下图的深造流速和样品数:

脱氧核糖核酸

set @lr = 0.1;

Query OK, 0 rows affected (0.00 sec)

set @dsize = 150;

Query OK, 0 rows affected (0.00 sec)

区块给定了一次:

脱氧核糖核酸

select

w00 + @lr * sum(d00) / @dsize as w00, w01 + @lr * sum(d01) / @dsize as w01, w02 + @lr * sum(d02) / @dsize as w02, w03 + @lr * sum(d03) / @dsize as w03, w04 + @lr * sum(d04) / @dsize as w04 ,

w10 + @lr * sum(d10) / @dsize as w10, w11 + @lr * sum(d11) / @dsize as w11, w12 + @lr * sum(d12) / @dsize as w12, w13 + @lr * sum(d13) / @dsize as w13, w14 + @lr * sum(d14) / @dsize as w14,

w20 + @lr * sum(d20) / @dsize as w20, w21 + @lr * sum(d21) / @dsize as w21, w22 + @lr * sum(d22) / @dsize as w22, w23 + @lr * sum(d23) / @dsize as w23, w24 + @lr * sum(d24) / @dsize as w24

from

(select

w00, w01, w02, w03, w04,

w10, w11, w12, w13, w14,

w20, w21, w22, w23, w24,

p0 * x0 as d00, p0 * x1 as d01, p0 * x2 as d02, p0 * x3 as d03, p0 * x4 as d04,

p1 * x0 as d10, p1 * x1 as d11, p1 * x2 as d12, p1 * x3 as d13, p1 * x4 as d14,

p2 * x0 as d20, p2 * x1 as d21, p2 * x2 as d22, p2 * x3 as d23, p2 * x4 as d24

from

(select

w00, w01, w02, w03, w04,

w10, w11, w12, w13, w14,

w20, w21, w22, w23, w24,

x0, x1, x2, x3, x4,

y0 - e0/(e0+e1+e2) as p0, y1 - e1/(e0+e1+e2) as p1, y2 - e2/(e0+e1+e2) as p2

from

(select

w00, w01, w02, w03, w04,

w10, w11, w12, w13, w14,

w20, w21, w22, w23, w24,

x0, x1, x2, x3, x4, y0, y1, y2,

exp(

w00 * x0 + w01 * x1 + w02 * x2 + w03 * x3 + w04 * x4

) as e0,

exp(

w10 * x0 + w11 * x1 + w12 * x2 + w13 * x3 + w14 * x4

) as e1,

exp(

w20 * x0 + w21 * x1 + w22 * x2 + w23 * x3 + w24 * x4

) as e2

from data, weight) t1

)t2

)t3;

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.

一次给定后,负载结果是方参数,如下下图:

请注意是当前区块之外,我运主要用途正则公式CTE透过给定受训:

脱氧核糖核酸

set @num_iterations = 1000;

Query OK, 0 rows affected (0.00 sec)

其当前思就让是,每次给定的匹配都是前一次给定的结果,此外我移除了一个增量给定参数来操纵给定数目。总体方区块是:

脱氧核糖核酸

with recursive cte(iter, weight) as

select 1, init_weight

union all

select iter+1, new_weight

from cte

where ites

紧接著,我将给定的SQL上下文与这个给定方结合在一起。为了减少测算精度,我在中会间结果中会移除了一般来说转换:

脱氧核糖核酸

with recursive weight( iter,

w00, w01, w02, w03, w04,

w10, w11, w12, w13, w14,

w20, w21, w22, w23, w24) as

select 1,

cast(0.1 as DECIMAL(35, 30)), cast(0.1 as DECIMAL(35, 30)), cast (0.1 as DECIMAL(35, 30)), cast(0.1 as DECIMAL(35, 30)), cast(0.1 as DECIMAL(35, 30)),

cast(0.1 as DECIMAL(35, 30)), cast(0.1 as DECIMAL(35, 30)), cast(0.1 as DECIMAL(35, 30)), cast(0.1 as DECIMAL(35, 30)), cast(0.1 as DECIMAL(35, 30)),

cast(0.1 as DECIMAL(35, 30)), cast(0.1 as DECIMAL(35, 30)), cast(0.1 as DECIMAL(35, 30)), cast(0.1 as DECIMAL(35, 30)), cast(0.1 as DECIMAL(35, 30))

union all

select

iter + 1,

w00 + @lr * cast(sum(d00) as DECIMAL(35, 30)) / @dsize as w00, w01 + @lr * cast(sum(d01) as DECIMAL(35, 30)) / @dsize as w01, w02 + @lr * cast(sum(d02) as DECIMAL(35, 30)) / @dsize as w02, w03 + @lr * cast(sum(d03) as DECIMAL(35, 30)) / @dsize as w03, w04 + @lr * cast(sum(d04) as DECIMAL(35, 30)) / @dsize as w04 ,

w10 + @lr * cast(sum(d10) as DECIMAL(35, 30)) / @dsize as w10, w11 + @lr * cast(sum(d11) as DECIMAL(35, 30)) / @dsize as w11, w12 + @lr * cast(sum(d12) as DECIMAL(35, 30)) / @dsize as w12, w13 + @lr * cast(sum(d13) as DECIMAL(35, 30)) / @dsize as w13, w14 + @lr * cast(sum(d14) as DECIMAL(35, 30)) / @dsize as w14,

w20 + @lr * cast(sum(d20) as DECIMAL(35, 30)) / @dsize as w20, w21 + @lr * cast(sum(d21) as DECIMAL(35, 30)) / @dsize as w21, w22 + @lr * cast(sum(d22) as DECIMAL(35, 30)) / @dsize as w22, w23 + @lr * cast(sum(d23) as DECIMAL(35, 30)) / @dsize as w23, w24 + @lr * cast(sum(d24) as DECIMAL(35, 30)) / @dsize as w24

from

(select

iter, w00, w01, w02, w03, w04,

w10, w11, w12, w13, w14,

w20, w21, w22, w23, w24,

p0 * x0 as d00, p0 * x1 as d01, p0 * x2 as d02, p0 * x3 as d03, p0 * x4 as d04,

p1 * x0 as d10, p1 * x1 as d11, p1 * x2 as d12, p1 * x3 as d13, p1 * x4 as d14,

p2 * x0 as d20, p2 * x1 as d21, p2 * x2 as d22, p2 * x3 as d23, p2 * x4 as d24

from

(select

iter, w00, w01, w02, w03, w04,

w10, w11, w12, w13, w14,

w20, w21, w22, w23, w24,

x0, x1, x2, x3, x4,

y0 - e0/(e0+e1+e2) as p0, y1 - e1/(e0+e1+e2) as p1, y2 - e2/(e0+e1+e2) as p2

from

(select

iter, w00, w01, w02, w03, w04,

w10, w11, w12, w13, w14,

w20, w21, w22, w23, w24,

x0, x1, x2, x3, x4, y0, y1, y2,

exp(

w00 * x0 + w01 * x1 + w02 * x2 + w03 * x3 + w04 * x4

) as e0,

exp(

w10 * x0 + w11 * x1 + w12 * x2 + w13 * x3 + w14 * x4

) as e1,

exp(

w20 * x0 + w21 * x1 + w22 * x2 + w23 * x3 + w24 * x4

) as e2

from data, weight where iter

)t2

)t3

hing count(*)> 0

select * from weight where iter = @num_iterations;

这个区块块和前面一次给定的区块块错综复杂有两个多种不同之处。在此区块块中会:

在 data join weight后面,我移除了where iter <@num_iterations以便操纵给定数目和要负载的iter + 1 as iter列。 移除了count(*)>0,以可避免剪切在再次不会匹配完整数据时负载完整数据。此误判可能会导致给定惨败。

上述区块运行结果是:

脱氧核糖核酸

ERROR 3577 (HY000): In recursive query block of Recursive Common Table Expression 'weight', the recursive table must be referenced only once, and not in any subquery

这所列明正则公式CTE不而无须在正则公式之外运主要用途弟查询。不过,我可以改组前面所有的弟查询。但是,即使在我手动改组它们在此之后还是获得了请注意误判示意:

脱氧核糖核酸

ERROR 3575 (HY000): Recursive Common Table Expression 'cte' can contain neither aggregation nor window functions in recursive query block

这所列明不而无须运主要用途剪切算子。然后,我同意扭转TiDB的借助于区块。

根据决议中会的参考,正则公式CTE的借助于遵循了TiDB的基本拒绝执行方。在咨询PingCAP的研发人员黄文军(Wenjun Huang)在此之后,我了解到弟查询和剪切算子不被而无须的主因有两个:

MySQL不而无须这样动手。 如果而无须,会有很多复杂的相同一般而言须要关键在于。

但我只是就让次测试一下这些基本功能。为此,我暂时封禁了diff中会对弟查询和剪切算子的检查。

再次,我再次拒绝执行重写后的区块,负载结果如下:

出乎意料了!经过1000次给定,我获得了参数。

紧接著,我运主要用途取而代之参数重取而代之测算正确的流速:

脱氧核糖核酸

+--------------------------------------------------------------+

sum(y0 = r0 and y1 = r1 and y2 = r2) / count(*) |

+--------------------------------------------------------------+

0.9867 |

+--------------------------------------------------------------+

1 row in set (0.02 sec)

这一次,抽样达到了98%。

得出结论

通过运主要用途TiDB 5.1中会的正则公式CTE,我出乎意料地运主要用途正因如此SQL在TiDB上受训了softmax语义重回方。

在次测试在此之后,我推测TiDB的正则公式CTE不而无须弟查询和剪切算子,所以我重写了TiDB的区块以直抵这些限制。再次,我出乎意料地受训了一个方,并在Iris完整数据集上获了98%的抽样。

再次,作为补充,在我的上述工作中会还总结了示例几个就让法:

在动手了一些次测试在此之后,我推测PostgreSQL和MySQL都不支持正则公式CTE中会的剪切算子,可能是因为有一些棘手的一般而言难以处理事件吧。 在这次次测试中会,我手动构建了乘积的所有维度。事实上,我还编撰了一个不须要构建所有维度的借助于。例如,完整数据所列的种系统是(idx,dim,value),但在这个借助于中会,方差所列须要相互连接两次。这并不一定须要在CTE中会访问两次,为此还须要重写TiDB拒绝执行器的借助于区块。由于这一主因,我不会在本文中会探讨这一借助于。但事实上,这种借助于更是CE,可以用它来处理事件MNIST完整数据集等更是多维度的方。

译文曲名:I Trained a Machine Learning Model in Pure SQL,写作者:Mingcong Han

广州白癜风
芜湖白癜风医院哪家更好
郑州癫痫医院挂号咨询
肾癌怎么治疗
太原男科医院哪个最好
咳嗽药
先诺欣
治疗类风湿关节疼痛有哪些药物
类风湿关节晨僵用啥药管用
颈椎病用的止痛药
友情链接