Bluedeck Library 圖書館卡
Bluedeck Library 圖書館卡
存檔UTC時間 2022年6月24日 04:26 存檔編者 Kcx36 當前版本號 72322439
{{vfd|沒有足夠的可靠資料來源能夠讓這個條目符合Wikipedia:關注度 中的標準|date=2022/06/24}}
{{Notability|time=2022-05-25T01:09:01+00:00}}
帶有經驗回放的演員-評論家算法 ({{lang-en|Actor-Critic with Experience Replay}}),簡稱ACER 。是2017年由DeepMind 團隊在提出的算法。其論文發表在{{le|ICLR|International Conference on Learning Representations}}上。該文提出了一種基於深度強化學習Actor-Critic下帶有經驗回放 的算法,能夠在變化的環境中取得不錯的效果,其中包括了57個Atari遊戲以及一些需要持續控制的問題。[ 1]
特色
在強化學習 中,環境交互需要付出極大的代價;這與普通的分類、回歸問題不同,會消耗大量的時間和系統資源。有效採樣({{lang-en|Sample Efficiency}})的方法可以使得算法在與環境交互較少的情況下獲得較好的結果。其中,為了提高有效採樣,使用經驗回放是一個很好的方法。而在強化學習中,如果採樣時所選取的策略{{lang-en|policy}}與選取動作時所用到的策略不同,我們將這種情況稱之為離軌策略({{lang-en|off-policy}})控制。
ACER就是一種離軌策略下的演員評論家算法({{lang-en|off-policy actor-critic}})。
理論依據
對於離軌策略而言,我們採樣所得到的軌跡是與同軌策略({{lang-en|on-policy}})不同的。這裏同軌是指採樣時所用的策略與選取動作時的策略相同。所以需要用到重要性採樣來對其進行調整。加上重要性採樣的權重後策略梯度可以被寫作
g
^
=
(
∏
t
=
0
k
ρ
t
)
∑
t
=
0
k
(
∑
i
=
0
k
−
t
γ
i
r
t
+
i
)
∇
θ
log
π
θ
(
a
t
|
x
t
)
{\displaystyle {\hat {g}}=(\prod _{t=0}^{k}\rho _{t})\sum _{t=0}^{k}(\sum _{i=0}^{k-t}\gamma ^{i}r_{t+i})\nabla _{\theta }\log \pi _{\theta }(a_{t}|x_{t})}
據Off-Policy Actor-Critic稱,離線策略的策略梯度可以拆解為
g
=
E
β
[
ρ
t
∇
θ
(
a
t
|
x
t
)
Q
π
(
x
t
,
a
t
)
]
{\displaystyle g=\mathbb {E_{\beta }} [\rho _{t}\nabla _{\theta }(a_{t}|x_{t})Q^{\pi }(x_{t},a_{t})]}
[ 2]
過程
由於重要性採樣的參數
ρ
t
=
π
(
a
t
|
x
t
)
μ
(
a
t
|
x
t
)
{\displaystyle \rho _{t}={\pi (a_{t}|x_{t}) \over \mu (a_{t}|x_{t})}}
是一個比值,有可能非常大或者非常小,嚴重影響算法的穩定性,所以使用了帶有偏差矯正的重要性權重截斷技術,使得
E
μ
[
ρ
t
⋅
⋅
⋅
]
=
E
μ
[
ρ
¯
t
⋅
⋅
⋅
]
+
E
a
∼
π
[
[
ρ
t
(
a
)
−
c
ρ
t
]
+
⋅
⋅
⋅
]
{\displaystyle \mathbb {E} _{\mu }[\rho _{t}\cdot \cdot \cdot ]=\mathbb {E} _{\mu }[{{\bar {\rho }}_{t}\cdot \cdot \cdot }]+\mathbb {E} _{a\sim \pi }[[{\rho _{t}(a)-c \over \rho _{t}}]+\cdot \cdot \cdot ]}
,其中
ρ
¯
t
=
m
i
n
(
c
,
ρ
t
)
{\displaystyle {\bar {\rho }}_{t}=min(c,\rho _{t})}
,這樣的變換既不會產生額外的偏差,而且產生的兩項各自都是有界的,第一項
E
μ
[
ρ
¯
t
⋅
⋅
⋅
]
<
c
{\displaystyle \mathbb {E} _{\mu }[{{\bar {\rho }}_{t}\cdot \cdot \cdot }]<c}
,第二項
E
a
∼
π
[
[
ρ
t
(
a
)
−
c
ρ
t
]
+
⋅
⋅
⋅
]
<
1
{\displaystyle \mathbb {E} _{a\sim \pi }[[{\rho _{t}(a)-c \over \rho _{t}}]+\cdot \cdot \cdot ]<1}
動作值函數
Q
π
(
x
t
,
a
t
)
{\displaystyle Q^{\pi }(x_{t},a_{t})}
的估計使用了回溯技術。
Q
r
e
t
(
x
t
,
a
t
)
=
r
t
+
γ
ρ
t
+
1
¯
[
Q
r
e
t
(
x
t
+
1
,
a
t
+
1
)
−
Q
(
x
t
+
1
,
a
t
+
1
)
]
+
γ
V
(
x
t
+
1
)
{\displaystyle Q^{ret}(x_{t},a_{t})=r_{t}+\gamma {\bar {\rho _{t+1}}}[Q^{ret}(x_{t+1},a_{t+1})-Q(x_{t+1},a_{t+1})]+\gamma V(x_{t+1})}
以上的Q函數和V函數的估計使用了dueling network的結構。使用採樣的方法計算
Q
~
θ
v
(
x
t
,
a
t
)
∼
V
θ
v
(
x
t
)
+
A
θ
v
(
x
t
,
a
t
)
−
1
n
∑
i
=
1
n
A
θ
v
(
x
t
,
u
i
)
,
a
n
d
u
i
∼
π
θ
(
⋅
|
x
t
)
{\displaystyle {\tilde {Q}}_{\theta _{v}}(x_{t},a_{t})\sim V_{\theta _{v}}(x_{t})+A_{\theta _{v}}(x_{t},a_{t})-{1 \over n}\sum _{i=1}^{n}A_{\theta _{v}}(x_{t},u_{i}),and\ \ u_{i}\sim \pi _{\theta }(\cdot |x_{t})}
這樣輸出的網絡為
Q
θ
v
{\displaystyle Q_{\theta _{v}}}
和
A
θ
v
{\displaystyle A_{\theta _{v}}}
綜合前三項,最終得到了ACER的離線策略梯度解析失败 (语法错误): {\displaystyle \widehat{g_t}^{acer} = \bar{\rho_t}\nabla _{\phi _\theta(x_t)}\log f(a_t|\phi_\theta(x))[Q^ret(x_t,a_t)- V_{\theta_v}(x_t)]+\mathbb{E}_{a\sim\pi}([{{((}}\rho_t(a)-c}\over{\rho_t(a)}}]_+\nabla_{\phi_\theta(x_t)} \log f(a_t|\phi_\theta(x))[Q_{\theta_v}(x_t,a)-V_{\theta_v}(x_t)]}
通過寫出信賴域最優化問題
m
i
n
i
m
i
z
e
z
1
2
‖
g
t
^
a
c
e
r
−
z
‖
2
2
{\displaystyle minimize_{z}\ \ {1 \over 2}\left\Vert \ {\hat {g_{t}}}^{acer}-z\right\Vert _{2}^{2}}
s
u
b
j
e
c
t
t
o
∇
ϕ
θ
(
x
t
)
D
K
L
[
f
(
⋅
|
ϕ
θ
a
(
x
t
)
)
|
|
f
(
⋅
|
ϕ
θ
(
x
t
)
)
]
T
z
≤
δ
{\displaystyle subject\ \ to\ \ \nabla _{\phi _{\theta }(x_{t})D_{KL}}[f(\cdot |\phi _{\theta _{a}}(x_{t}))||f(\cdot |\phi _{\theta }(x_{t}))]^{T}z\leq \delta }
直接解析求得最優解解析失败 (语法错误): {\displaystyle z^* = \hat{g_t}^{acer}-max\{ 0,{{((}}k^T \hat{g_t}^{acer}-\delta}\over {||k||^2_2}} \}k}
得到參數更新公式解析失败 (语法错误): {\displaystyle \theta\leftarrow \theta +{{((}}\partial \phi_\theta(x)}\over{\partial\theta}}z^*}
算法流程
算法1:對於離散動作情況下ACER的主算法
初始化全局共享參數向量
θ
a
n
d
θ
v
{\displaystyle \theta \ \ and\ \ \theta _{v}}
設置回放率
r
{\displaystyle r}
在達到最大迭代次數或者時間用完前:
調用算法2中的在線策略ACER
n
←
P
o
s
s
i
o
n
(
r
)
{\displaystyle n\leftarrow \ \ Possion(r)}
對於
i
∈
{
1
,
⋅
⋅
⋅
,
n
}
{\displaystyle i\in \{1,\cdot \cdot \cdot ,n\}}
執行:
調用算法2中的離線策略ACER
算法2:離散動作下的ACER
重置梯度
d
θ
←
0
a
n
d
d
θ
v
←
0
{\displaystyle d\theta \leftarrow 0\ \ and\ \ d\theta _{v}\leftarrow 0}
初始化參數
θ
′
←
θ
a
n
d
θ
v
′
←
θ
v
{\displaystyle \theta '\leftarrow \theta \ \ and\ \ \theta '_{v}\leftarrow \theta _{v}}
如果不是在線策略:
從經驗回放中採樣軌跡
{
x
0
,
a
0
,
r
0
,
μ
(
⋅
|
x
)
,
⋅
⋅
⋅
,
x
k
,
a
k
,
r
k
,
μ
(
⋅
|
x
k
)
}
{\displaystyle \{x_{0},a_{0},r_{0},\mu (\cdot |x),\cdot \cdot \cdot ,x_{k},a_{k},r_{k},\mu (\cdot |x_{k})\}}
否則,獲取狀態
x
0
{\displaystyle x_{0}}
對於
i
∈
{
0
,
⋅
⋅
⋅
,
k
}
{\displaystyle i\in \{0,\cdot \cdot \cdot ,k\}}
執行:
計算
f
(
⋅
|
ϕ
θ
′
(
x
i
)
)
,
Q
θ
v
′
(
x
i
,
⋅
)
{\displaystyle f(\cdot |\phi _{\theta '}(x_{i})),Q_{\theta '_{v}}(x_{i},\cdot )}
和
f
(
⋅
|
ϕ
θ
a
(
x
i
)
)
{\displaystyle f(\cdot |\phi _{\theta _{a}}(x_{i}))}
如果是在線策略則
依據
f
(
⋅
|
ϕ
′
(
x
i
)
)
{\displaystyle f(\cdot |\phi '(x_{i}))}
執行動作
a
i
{\displaystyle a_{i}}
得到回報
r
i
{\displaystyle r_{i}}
和新的狀態
x
i
+
1
{\displaystyle x_{i+1}}
μ
(
⋅
|
x
i
)
←
f
(
⋅
|
ϕ
θ
′
(
x
i
)
)
{\displaystyle \mu (\cdot |x_{i})\leftarrow f(\cdot |\phi _{\theta '}(x_{i}))}
解析失败 (语法错误): {\displaystyle \bar{\rho_i}\leftarrow min\{1,{{((}}f(a_i|\phi_{\theta'}(xi))}\over{\mu(a_i|x_i)}} \}}
Q
r
e
t
←
{
0
f
o
r
t
e
r
m
i
n
i
a
l
x
k
∑
a
Q
θ
v
′
(
x
k
,
a
)
f
(
a
|
ϕ
θ
′
(
x
k
)
)
o
t
h
e
r
w
i
s
e
{\displaystyle Q^{ret}\leftarrow {\begin{cases}0\ \ for\ \ terminial\ \ x_{k}\\\sum _{a}Q_{\theta '_{v}}(x_{k},a)f(a|\phi _{\theta '}(x_{k}))\ \ otherwise\end{cases}}}
對於
i
∈
{
k
−
1
,
⋅
⋅
⋅
,
0
}
{\displaystyle i\in \{k-1,\cdot \cdot \cdot ,0\}}
執行
Q
r
e
t
←
r
i
+
γ
Q
r
e
t
{\displaystyle Q^{ret}\leftarrow r_{i}+\gamma Q^{ret}}
V
i
←
∑
a
Q
θ
v
′
(
x
i
,
a
)
f
(
a
|
ϕ
θ
′
(
x
i
)
)
{\displaystyle V_{i}\leftarrow \sum _{a}Q_{\theta '_{v}}(x_{i},a)f(a|\phi _{\theta '}(x_{i}))}
計算信賴域更新所需的:
解析失败 (语法错误): {\displaystyle g \leftarrow min \{ c,\rho_i(a_i)\} \nabla_{\phi_'(x_i)}\log f(a_i|\phi_{\theta'}(x_i))(Q^{ret}-V_i)+ \sum_a[1-{{((}}c}\over{\rho_i(a)}}]_+ f(a|\phi_{\theta'}(x_i))\nabla_{\phi_{\theta'}(x_i)}\log f(a|\phi_{\theta'}(x_i))(Q_{\theta'_v}(x_i,a_i)-V_i)}
k
←
∇
ϕ
θ
′
(
x
i
)
D
K
L
[
f
(
⋅
|
ϕ
θ
a
(
x
i
)
)
|
|
f
(
⋅
|
ϕ
θ
′
(
x
i
)
]
{\displaystyle k\leftarrow \nabla _{\phi _{\theta '}(x_{i})}D_{KL}[f(\cdot |\phi _{\theta _{a}}(x_{i}))||f(\cdot |\phi _{\theta '}(x_{i})]}
累積梯度解析失败 (语法错误): {\displaystyle \theta':d\theta'\leftarrow +{{((}}\partial \phi_{\theta'}(x_i)}\over{\partial\theta'}}(g-max\{ 0,{{((}}k^Tg-\delta}\over{||k||^2_2}}k \})}
累積梯度
θ
v
′
:
d
θ
v
+
∇
θ
v
′
(
Q
r
e
t
−
Q
θ
v
′
(
x
i
,
a
i
)
)
+
V
i
{\displaystyle \theta '_{v}:d\theta _{v}+\nabla _{\theta '_{v}}(Q^{ret}-Q_{\theta '_{v}}(x_{i},a_{i}))+V_{i}}
用
d
θ
,
d
θ
v
{\displaystyle d\theta ,d\theta _{v}}
分別異步更新
θ
,
θ
v
{\displaystyle \theta ,\theta _{v}}
更新平均策略網絡:
θ
a
←
α
θ
a
+
(
1
−
α
)
θ
{\displaystyle \theta _{a}\leftarrow \alpha \theta _{a}+(1-\alpha )\theta }
參考文獻
延伸閱讀