ETH官方钱包

前往
大廳
主題

道爾靜立3 (SAC)

夏洛爾 | 2022-11-17 00:51:54 | 巴幣 2 | 人氣 173


目前全部的訓練都是在PPO完成的
朋友很好奇為什麼沒有使用SAC,尤其在學術研究上,似乎有些問題SAC的效果遠勝PPO
因此特地以和道爾靜立3完全相同的程式,改以SAC訓練

//除了SAC部分,其餘和道爾靜立3完全相同
Doyle Stand v3 (SAC)
實驗目標:
1.進入站立瞬間後,由於其實可能仍處於不穩定狀態,要再進入靜立狀態
2.進入站立瞬間後,可能面向並沒有瞄準目標,要轉向目標
3.以SAC訓練

實驗設計:
1.任何弱點觸地皆失敗 (尾巴和劍並非弱點)
2.
if(weaknessOnGround){if(inferenceMode){brainMode = DoyleMode.GetUp;SetModel("DoyleGetUp", getUpBrain);behaviorParameters.BehaviorType = BehaviorType.InferenceOnly;}else{AddReward(-1f);judge.outLife++;judge.Reset();return;// brainMode = DoyleMode.GetUp;// SetModel("DoyleGetUp", getUpBrain);// behaviorParameters.BehaviorType = BehaviorType.InferenceOnly;}}else if(doyleRoot.localPosition.y < -10f){if(inferenceMode){brainMode = DoyleMode.GetUp;SetModel("DoyleGetUp", getUpBrain);behaviorParameters.BehaviorType = BehaviorType.InferenceOnly;}else{AddReward(-1f);judge.outY++;judge.Reset();return;// brainMode = DoyleMode.GetUp;// SetModel("DoyleGetUp", getUpBrain);// behaviorParameters.BehaviorType = BehaviorType.InferenceOnly;}}else{targetSmoothPosition = targetPositionBuffer.GetSmoothVal();headDir = targetSmoothPosition - stageBase.InverseTransformPoint(doyleHeadRb.position);rootDir = targetSmoothPosition - stageBase.InverseTransformPoint(doyleRootRb.position);flatTargetVelocity = rootDir;flatTargetVelocity.y = 0f;targetDistance = flatTargetVelocity.magnitude;Vector3 flatLeftDir = Vector3.Cross(flatTargetVelocity, Vector3.up);lookAngle = Mathf.InverseLerp(180f, 0f, Vector3.Angle(doyleHead.up, headDir));upAngle = Mathf.InverseLerp(180f, 0f, Vector3.Angle(doyleHead.right * -1f, Vector3.up));//LeanVector3 leanDir = rootAimRot * flatTargetVelocity;spineLookAngle = Mathf.InverseLerp(180f, 0f, Vector3.Angle(doyleSpine.forward, flatLeftDir));spineUpAngle = Mathf.InverseLerp(180f, 30f, Vector3.Angle(doyleSpine.right * -1f, leanDir));rootLookAngle = Mathf.InverseLerp(180f, 0f, Vector3.Angle(doyleRoot.right * -1f, flatLeftDir));rootUpAngle = Mathf.InverseLerp(180f, 20f, Vector3.Angle(doyleRoot.up, leanDir));// float velocityReward = Mathf.InverseLerp(0f, 10f, doyleRootRb.velocity.magnitude) * 0.5f + Mathf.InverseLerp(0f, 10f, doyleSpineRb.velocity.magnitude) * 0.3f + Mathf.InverseLerp(0f, 10f, doyleHeadRb.velocity.magnitude) * 0.2f;// float angularReward = Mathf.InverseLerp(0f, 6.28f, doyleRootRb.angularVelocity.magnitude) * 0.2f + Mathf.InverseLerp(0f, 6.28f, doyleSpineRb.angularVelocity.magnitude) * 0.3f + Mathf.InverseLerp(0f, 6.28f, doyleHeadRb.angularVelocity.magnitude) * 0.5f;float velocityReward = GetVelocityReward(8f);float angularReward = GetAngularVelocityReward(10f);float standReward = (doyleLeftFeetBody.isStand? 0.5f : 0f) + (doyleRightFeetBody.isStand? 0.5f : 0f);lastReward = (1f-velocityReward) * 0.015f + (1f-angularReward) * 0.015f+ (lookAngle + upAngle + spineLookAngle + spineUpAngle + rootLookAngle + rootUpAngle) * 0.008f + standReward * 0.01f+ (1f - exertionRatio) * 0.002f;totalReward += lastReward;AddReward( lastReward );if(Time.fixedTime - landingMoment > landingBufferTime){bool outVelocity = velocityReward > Mathf.Lerp(1f, 0.3f, (Time.fixedTime - landingMoment - landingBufferTime)/3f);bool outAngularVelocity = angularReward > Mathf.Lerp(1f, 0.5f, (Time.fixedTime - landingMoment - landingBufferTime)/3f);bool outSpeed = outVelocity || outAngularVelocity;float aimLimit = Mathf.Lerp(0f, 0.7f, (Time.fixedTime - landingMoment - landingBufferTime)/3f);float aimLimit2 = Mathf.Lerp(0f, 0.9f, (Time.fixedTime - landingMoment - landingBufferTime)/3f);bool outDirection = lookAngle < aimLimit2 || upAngle < aimLimit2 || spineLookAngle < aimLimit2 || rootLookAngle < aimLimit2;bool outMotion = spineUpAngle < aimLimit || rootUpAngle < aimLimit;if( outSpeed || outDirection || outMotion){AddReward(-1f);if(outSpeed){judge.outSpeed++;}if(outDirection){judge.outDirection++;}if(outMotion){judge.outMotion++;}judge.Reset();return;}}if(lookAngle > 0.9f && upAngle > 0.9f  && spineLookAngle > 0.9f  && rootLookAngle > 0.9f && velocityReward < 0.3f && angularReward < 0.5f && standReward > 0.9f){Debug.Log("Stand");totalReward += 0.01f;AddReward( 0.01f );}}
3.
for(int i=0; i<doyleBodies.Length; i++){if(doyleBodies[i].isGrounded){if(doyleBodies[i].isWeakness){weaknessOnGround = true;}else{//===Train Stand===if(doyleBodies[i].damageCoef > 0f){AddReward(-0.1f * doyleBodies[i].damageCoef);}}ConfirmLanding();// ConfirmArrived();}}

//大致來說
--1.獎勵抑制速度和角速度,並使用ForceSharping
--2.獎勵視線角度,並使用ForceSharping
--3.獎勵Spine和Root角度為前傾角度,並使用ForceSharping
--4.獎勵雙足接觸地面
--5.當符合靜立標準,會額外加分
--6.尾巴和劍可以觸地,但會扣分

4.SAC Config
DoyleTrain:    learning_rate: 1e-4    tau: 0.005    normalize: true    num_epoch: 3    time_horizon: 1000    batch_size: 512    buffer_size: 1000000    max_steps: 5e7    summary_freq: 30000    num_layers: 4 (3?)    steps_per_update: 30    hidden_units: 512    reward_signals:        extrinsic:            strength: 1.0            gamma: 0.995

實驗時間:
Step: 5e7
Time Elapsed: (忘記記錄了)

實驗結果:
實驗結果有效果,但是偏離實用

總之目前數次實驗,SAC訓練下,物理性人物常常變成這樣

雖然也不是沒效果,但是會是超級奇怪的動作
而且大部分時間他不會訓練出活動的人物,人物大多會定格在某個姿勢
然後只在某些時刻抖動一下,所以超級不像生物也不帥氣,反而進入恐怖谷

但是錄影時仔細的觀察了一下
雖然很醜,可是其實都有達成目標

然後我因為那段時間在趕刀劍神域致敬影片,結果現在忘記num_layers我到底實驗時是多少了
我印象中是3,但現在開檔案是4,不能確定我是不是事後有改
所以果然不立即寫實驗紀錄,後悔的都是未來的我呢哈哈哈

總之我的歸納和假說如下
PPO應用在物理性人物,雖然會陷入Gait,但行為以動作為主,例如在條件相同的情境會試圖做出相同的決策
SAC應用在物理性人物,則會把可以得分的行為都加入,而行為並非動作,而是當下的觀察與決策值,也就是一個姿勢,因此很容易傾向固定在某個能得分的姿勢,直到有必要變換姿勢

總體來說,因為SAC允許決策的橫向廣度,所以反而無法構成縱向的連續決策來形成動作

坦白說很困擾
因為我希望能跟玩家過招的物理性人物,最好能對於玩家的動作,有各種決策能力
但目前看來,PPO一個Brain只能發展一個動作,而SAC無法發展動作

現在有個很複雜的構想是發展大量PPO動作,再由SAC決定切換時機
但這真的會太複雜,主要就是容易導致,原本ML能幫開發者省下大量時間的優勢,又把時間壓力返回到開發者身上

總之決定趁這次機會也全面把ML版本從Verified Package 1.0.8,提升到Release 19
希望能在更新中想出什麼創意又睿智的突破口,最理想是更新後,PPO和SAC突然就變理想了

...雖然也可能完全相反呢!

創作回應

更多創作