《蔡庆芃_短视频推荐强化学习算法_watermark.pdf》由会员分享,可在线阅读,更多相关《蔡庆芃_短视频推荐强化学习算法_watermark.pdf(50页珍藏版)》请在三个皮匠报告上搜索。
1、Reinforcement Learning for Short Video Recommender SystemQingpeng CaiOutlineReinforcement Learning for Short Video Recommender SystemsRL for Multi-objectives(Two-Stage Constrained Actor-Critic for Short Video Recommendation,WWW 2023)Summary125RL for Large Action Space(Exploration and Regularization
2、of the Latent Action Space in Recommendation,WWW 2023)3RL for Delayed feedback(Reinforcing User Retention in a Billion Scale Short Video Recommender System,WWW 2023)4Reinforcement Learning for Short Video RSDifference between Short Video RS and Other RSUsers interact with short video RS Scroll up an
3、d down Watch multiple videosMulti-objectives Watch time of multiple videosMain objective,Dense responses Share,Download,CommentSparse responses,constraintsDelayed feedbackSession depthUser RetentionMotivation of RL in Short Video RS Problems of supervised learning methods predict the value of an ite
4、m or a list of items lack of exploration and can not optimize the long-term value Hyper-parameter tuning in Kuaishou RS Many hyper-parameters Exist 1 1+2 2+How to learn optimal parameters to maximize different objectives?Objectives:watch time,interactions,session depth Non-gradient methods CEM/Bayes
5、 are used in Kuaishou Unable to optimize long-term metric Lack of personalization RL Exploration Aim to maximize the long-term performanceRL for Hyper-parameter Tuning:MDP MDP State:(user information,user history)User information:User history:states,actions,and rewards of previous steps Action Param
6、eters of several ranking functions A continuous vector Reward=+Episode Requests from opening the app to leaving the appRL for Hyper-parameter Tuning:Algorithms Objective Policy DNN Input state,output mu and sigma Sample action from Gaussian distribution Algorithm Selection Reinforce Slow convergence
7、,only works for single objective PPO On-policy,does not work for off-policy setting of KS A3C Faster convergence,sensitive to different reward coefficientRL for Hyper-parameter Tuning:Live Results Loss functions Actor loss Critic loss Live Experiments Baseline:CEM Avg app time +0.15%Watch time+0.33%
8、Fully launched Comparison with Contextual Bandits Gamma=0:contextual bandits Gamma=0.95 compares with gamma=0 App time+0.089%,VV+0.37%RL performs better than Bandits!Challenges of RL for Short Video RS Unstable Environment Each user is a environment,rather than fixed game System fluctuates between d
9、ays and hours Multi-objectives Different reward signals in short-videos:dwell time,like,follow,forward,comment,visiting depth Safe and efficient exploration Large action spaces Random exploration hurts user experience Delayed feedback and credit assignment The long-term engagement signal is delayed
10、and noisy It is hard to allocate credits to immediate actionsQingpeng Cai,Zhenghai Xue,Chi Zhang,Wanqi Xue,ShuchangLiu,Ruohan Zhan,Xueliang Wang,Tianyou Zuo,Wentao Xie,Dong Zheng,Peng Jiang and Kun GaiTwo-Stage Constrained Actor-Critic for Short Video Recommendation(Ranking)Short-video Recommendatio
11、n Users interact with RS Scroll up and down Watch multiple videos Several signals Watch time of multiple videos Main objective of the algorithm Dense responses Can be effectively optimized by RL Share,Download,Comment Sparse responses Serve as constraintsConstrained Markov Decision Process(CMDP)Env:
12、user RS:agent Step:each request Action:a video Immediate Rewards:Watch time and interactions The optimization programChallenges A direct method is learn a policy to optimize its Lagrangian Problem:The estimation is not accurate for sparse signals The dense signal,such as watch time dominates the est
13、imation It is hard to maximize the Lagrangian larger search space due to multiple constraints time costlyMulti-Critic Policy Estimation Each critic estimated the value of one objective Compare Joint and Separate learning Joint Learning:0learns watch time+interaction Separate Learning:1learns watch t
14、ime,2 learns interaction Use MAE error to estimate two learning method Separate learning outperforms joint learning by 0.191%and 0.143%in terms of both watch time and interactionwatch time rewardlike rewardfollow rewardTwo-Stage Constrained Actor-Critic Stage One For each auxiliary response,learn a
15、policy to optimize its cumulative reward Two-Stage Constrained Actor-Critic Stage Two For the main response,learn a policy to optimize its cumulative reward Softly regularize the policy to be close to other auxiliary policiesTwo-Stage Constrained Actor-Critic Stage Two For the main response,learn a
16、policy to optimize its cumulative reward Softly regularize the policy to be close to other auxiliary policiesTwo-Stage Constrained Actor-Critic Stage Two For the main response,learn a policy to optimize its cumulative reward Softly regularize the policy to be close to other auxiliary policiesSmaller
17、,weaker constraintSame for all objectivesOffline ExperimentsLive ExperimentsExploration and Regularization of the Latent Action Space in Recommendation(Ranking)Shuchang Liu,Qingpeng Cai,Bowen Sun,Yuhao Wang,Dong Zheng,Peng Jiang,Kun Gai,Ji Jiang,Xiangyu Zhao and Yongfeng ZhangRL-based Rec SysOptimiz
18、es the users long-term rewardsRL-based Rec SysOptimizes the users long-term rewardsA user session consists of multple steps of interaction.RL-based Rec SysOptimizes the users long-term rewardsA user session consists of multple steps of interaction.In each step,the policy recommend a list of items as
19、 action,and the user generates immediate feedbacks and a reward.RL-based Rec SysOptimizes the users long-term rewardsA user session consists of multple steps of interaction.In each step,the policy recommend a list of items as action,and the user generates immediate feedbacks and a reward.Goal:optimi
20、zes the overall reward of the entire session.Key challenge:Large,dynamic,and discrete action space.RL-based Rec SysOptimizes the users long-term rewardsA user session consists of multple steps of interaction.In each step,the policy recommend a list of items as action,and the user generates immediate
21、 feedbacks and a reward.Goal:optimizes the overall reward of the entire session.Key challenge:Large,dynamic,and discrete action space.Solve it by representing the list with a latent vector!Hyper-actor generation+deterministic scoring and ranking.i.e.Hyper actioni.e.Effect actionRL-based Rec SysOptim
22、izes the users long-term rewardsA user session consists of multple steps of interaction.In each step,the policy recommend a list of items as action,and the user generates immediate feedbacks and a reward.Goal:optimizes the overall reward of the entire session.Key challenge:Large,dynamic,and discrete
23、 action space.Solve it by representing the list with a latent vector!Hyper-actor generation+deterministic scoring and ranking.New Challenge:Inconsistency between hyper-action and effect-action.RL-based Rec SysOptimizes the users long-term rewardsA user session consists of multple steps of interactio
24、n.In each step,the policy recommend a list of items as action,and the user generates immediate feedbacks and a reward.Goal:optimizes the overall reward of the entire session.Key challenge:Large,dynamic,and discrete action space.Solve it by representing the list with a latent vector!Hyper-actor gener
25、ation+deterministic scoring and ranking.New Challenge:Inconsistency between hyper-action and effect-action.RL-based Rec SysOptimizes the users long-term rewardsA user session consists of multple steps of interaction.In each step,the policy recommend a list of items as action,and the user generates i
26、mmediate feedbacks and a reward.Goal:optimizes the overall reward of the entire session.Key challenge:Large,dynamic,and discrete action space.Solve it by representing the list with a latent vector!Hyper-actor generation+deterministic scoring and ranking.New Challenge:Inconsistency between hyper-acti
27、on and effect-action.Hyper-Actor Critic LearningHyper-Actor Critic LearningPolicy only generates the hyper-actionHyper-Actor Critic LearningDot product as the deterministic scoring functionItem encoding kernelHyper-Actor Critic LearningInfer the hyper-action back by the averaged item encodingHyper-A
28、ctor Critic LearningMay apply exploration on both hyper-action and effect-actionHyper-Actor Critic LearningThe critic of the effect-action learns from accurate action representations:Inverse module+critic for hyper-action.Hyper-Actor Critic LearningThe actor efficiently learns from the critic of hyp
29、er-actions.Hyper-Actor Critic LearningEnsure consistency between two action spacesHyper-Actor Critic LearningUse supervision to boost the sample efficiency at the early training stage,and the RL later onEvaluation on Online SimulationEffectiveness:Ablation:Pretrained user response model based on obs
30、erved data logs.Click signals as rewards.Evaluation on Online SimulationPretrained user response model based on observed data logs.Click signals as rewards.Qingpeng Cai*,Shuchang Liu*,Xueliang Wang,Tianyou Zuo,Wentao Xie,Bin Yang,Dong Zheng,Peng Jiang and Kun GaiReinforcing User Retention in a Billi
31、on Scale Short Video Recommender System(Hyper-parameter Tuning)User Retention in Short-video Recommendation User Retention Directly affects DAU long-term feedback after multiple requests Hard to decompose,similar to Go Point-wise and list-wise methods can not optimize Solution:RL optimizes user rete
32、ntion directly Minimize the cumulative sum of returning time Equal to improving user visits First work to directly optimize user retention Previous works focus on cumulative immediate feedbackInfinite Horizon Request-based Markov Decision Process State User profile,user history,candidate video featu
33、res Action:a vector to ensemble ranking functions Immediate Rewards The sum of watch time and interactions,(,)Returning time Time gap between the last step of session sand the first step of session+1 Objective:minimize=11()Challenges of Retention Uncertainty Retention is not fully decided by the rec
34、ommendation Affected by social events Bias Biased with time and user activity High active users have higher retention and more samples Long delay time Retention reward returns in hours to days Cause the instability of online RLReinforcement Learning for User Retention AlgorithmLearning the Retention
35、 and Tackling the Uncertainty Challenge A normalization technique to reduce the varianceLearn a session level classification model Tpredict that the time is shorter than Estimate the lower bound of returning time by Markov Inequality1 Use true returning time/estimated returning time as the retention
36、 rewardEnhancing policy learning with intrinsic rewards and immediate feedbacksEnhancing Learning by Heuristic RewardsIntrinsic rewards by Random Network DistiliationSolving bias:different policies for users of different activityActor learn from both retention critic and immediate response critic.Ta
37、ckling the Unstable Training and Bias ProblemProblem of previous regularization methodsL()+(N ,N(,)Learn either too slow or too fastSoft regularized actor loss:Samples with larger shift gets smaller weights controls the regularization degreeOffline and Live ExperimentsStateuser profileage,gender,and
38、 locationbehavior historyuser statistics,video id and users feedback of in previous 3 requeststhe candidate video featuresAction8-dimensional continuous vector ranging in 0,4Immediate Rewardsum of watch time and interactions of 6 videosSummary RL for Short Video RS Hyparameter tuning and Ranking Multi-objectives,large action spaces,and delayed feedback Code Implementaions of our RL-based works https:/ KuaiSim:A Comprehensive Simulator for Recommender Systems NeurIPS 2023 https:/