上海品茶

您的当前位置:上海品茶 > 报告分类 > PDF报告下载

蔡庆芃_短视频推荐强化学习算法_watermark.pdf

编号:155511 PDF 50页 2.98MB 下载积分:VIP专享
下载报告请您先登录!

蔡庆芃_短视频推荐强化学习算法_watermark.pdf

1、Reinforcement Learning for Short Video Recommender SystemQingpeng CaiOutlineReinforcement Learning for Short Video Recommender SystemsRL for Multi-objectives(Two-Stage Constrained Actor-Critic for Short Video Recommendation,WWW 2023)Summary125RL for Large Action Space(Exploration and Regularization

2、of the Latent Action Space in Recommendation,WWW 2023)3RL for Delayed feedback(Reinforcing User Retention in a Billion Scale Short Video Recommender System,WWW 2023)4Reinforcement Learning for Short Video RSDifference between Short Video RS and Other RSUsers interact with short video RS Scroll up an

3、d down Watch multiple videosMulti-objectives Watch time of multiple videosMain objective,Dense responses Share,Download,CommentSparse responses,constraintsDelayed feedbackSession depthUser RetentionMotivation of RL in Short Video RS Problems of supervised learning methods predict the value of an ite

4、m or a list of items lack of exploration and can not optimize the long-term value Hyper-parameter tuning in Kuaishou RS Many hyper-parameters Exist 1 1+2 2+How to learn optimal parameters to maximize different objectives?Objectives:watch time,interactions,session depth Non-gradient methods CEM/Bayes

5、 are used in Kuaishou Unable to optimize long-term metric Lack of personalization RL Exploration Aim to maximize the long-term performanceRL for Hyper-parameter Tuning:MDP MDP State:(user information,user history)User information:User history:states,actions,and rewards of previous steps Action Param

6、eters of several ranking functions A continuous vector Reward=+Episode Requests from opening the app to leaving the appRL for Hyper-parameter Tuning:Algorithms Objective Policy DNN Input state,output mu and sigma Sample action from Gaussian distribution Algorithm Selection Reinforce Slow convergence

7、,only works for single objective PPO On-policy,does not work for off-policy setting of KS A3C Faster convergence,sensitive to different reward coefficientRL for Hyper-parameter Tuning:Live Results Loss functions Actor loss Critic loss Live Experiments Baseline:CEM Avg app time +0.15%Watch time+0.33%

8、Fully launched Comparison with Contextual Bandits Gamma=0:contextual bandits Gamma=0.95 compares with gamma=0 App time+0.089%,VV+0.37%RL performs better than Bandits!Challenges of RL for Short Video RS Unstable Environment Each user is a environment,rather than fixed game System fluctuates between d

9、ays and hours Multi-objectives Different reward signals in short-videos:dwell time,like,follow,forward,comment,visiting depth Safe and efficient exploration Large action spaces Random exploration hurts user experience Delayed feedback and credit assignment The long-term engagement signal is delayed

10、and noisy It is hard to allocate credits to immediate actionsQingpeng Cai,Zhenghai Xue,Chi Zhang,Wanqi Xue,ShuchangLiu,Ruohan Zhan,Xueliang Wang,Tianyou Zuo,Wentao Xie,Dong Zheng,Peng Jiang and Kun GaiTwo-Stage Constrained Actor-Critic for Short Video Recommendation(Ranking)Short-video Recommendatio

11、n Users interact with RS Scroll up and down Watch multiple videos Several signals Watch time of multiple videos Main objective of the algorithm Dense responses Can be effectively optimized by RL Share,Download,Comment Sparse responses Serve as constraintsConstrained Markov Decision Process(CMDP)Env:

12、user RS:agent Step:each request Action:a video Immediate Rewards:Watch time and interactions The optimization programChallenges A direct method is learn a policy to optimize its Lagrangian Problem:The estimation is not accurate for sparse signals The dense signal,such as watch time dominates the est

13、imation It is hard to maximize the Lagrangian larger search space due to multiple constraints time costlyMulti-Critic Policy Estimation Each critic estimated the value of one objective Compare Joint and Separate learning Joint Learning:0learns watch time+interaction Separate Learning:1learns watch t

14、ime,2 learns interaction Use MAE error to estimate two learning method Separate learning outperforms joint learning by 0.191%and 0.143%in terms of both watch time and interactionwatch time rewardlike rewardfollow rewardTwo-Stage Constrained Actor-Critic Stage One For each auxiliary response,learn a

15、policy to optimize its cumulative reward Two-Stage Constrained Actor-Critic Stage Two For the main response,learn a policy to optimize its cumulative reward Softly regularize the policy to be close to other auxiliary policiesTwo-Stage Constrained Actor-Critic Stage Two For the main response,learn a

16、policy to optimize its cumulative reward Softly regularize the policy to be close to other auxiliary policiesTwo-Stage Constrained Actor-Critic Stage Two For the main response,learn a policy to optimize its cumulative reward Softly regularize the policy to be close to other auxiliary policiesSmaller

17、,weaker constraintSame for all objectivesOffline ExperimentsLive ExperimentsExploration and Regularization of the Latent Action Space in Recommendation(Ranking)Shuchang Liu,Qingpeng Cai,Bowen Sun,Yuhao Wang,Dong Zheng,Peng Jiang,Kun Gai,Ji Jiang,Xiangyu Zhao and Yongfeng ZhangRL-based Rec SysOptimiz

18、es the users long-term rewardsRL-based Rec SysOptimizes the users long-term rewardsA user session consists of multple steps of interaction.RL-based Rec SysOptimizes the users long-term rewardsA user session consists of multple steps of interaction.In each step,the policy recommend a list of items as

19、 action,and the user generates immediate feedbacks and a reward.RL-based Rec SysOptimizes the users long-term rewardsA user session consists of multple steps of interaction.In each step,the policy recommend a list of items as action,and the user generates immediate feedbacks and a reward.Goal:optimi

20、zes the overall reward of the entire session.Key challenge:Large,dynamic,and discrete action space.RL-based Rec SysOptimizes the users long-term rewardsA user session consists of multple steps of interaction.In each step,the policy recommend a list of items as action,and the user generates immediate

21、 feedbacks and a reward.Goal:optimizes the overall reward of the entire session.Key challenge:Large,dynamic,and discrete action space.Solve it by representing the list with a latent vector!Hyper-actor generation+deterministic scoring and ranking.i.e.Hyper actioni.e.Effect actionRL-based Rec SysOptim

22、izes the users long-term rewardsA user session consists of multple steps of interaction.In each step,the policy recommend a list of items as action,and the user generates immediate feedbacks and a reward.Goal:optimizes the overall reward of the entire session.Key challenge:Large,dynamic,and discrete

23、 action space.Solve it by representing the list with a latent vector!Hyper-actor generation+deterministic scoring and ranking.New Challenge:Inconsistency between hyper-action and effect-action.RL-based Rec SysOptimizes the users long-term rewardsA user session consists of multple steps of interactio

24、n.In each step,the policy recommend a list of items as action,and the user generates immediate feedbacks and a reward.Goal:optimizes the overall reward of the entire session.Key challenge:Large,dynamic,and discrete action space.Solve it by representing the list with a latent vector!Hyper-actor gener

25、ation+deterministic scoring and ranking.New Challenge:Inconsistency between hyper-action and effect-action.RL-based Rec SysOptimizes the users long-term rewardsA user session consists of multple steps of interaction.In each step,the policy recommend a list of items as action,and the user generates i

26、mmediate feedbacks and a reward.Goal:optimizes the overall reward of the entire session.Key challenge:Large,dynamic,and discrete action space.Solve it by representing the list with a latent vector!Hyper-actor generation+deterministic scoring and ranking.New Challenge:Inconsistency between hyper-acti

27、on and effect-action.Hyper-Actor Critic LearningHyper-Actor Critic LearningPolicy only generates the hyper-actionHyper-Actor Critic LearningDot product as the deterministic scoring functionItem encoding kernelHyper-Actor Critic LearningInfer the hyper-action back by the averaged item encodingHyper-A

28、ctor Critic LearningMay apply exploration on both hyper-action and effect-actionHyper-Actor Critic LearningThe critic of the effect-action learns from accurate action representations:Inverse module+critic for hyper-action.Hyper-Actor Critic LearningThe actor efficiently learns from the critic of hyp

29、er-actions.Hyper-Actor Critic LearningEnsure consistency between two action spacesHyper-Actor Critic LearningUse supervision to boost the sample efficiency at the early training stage,and the RL later onEvaluation on Online SimulationEffectiveness:Ablation:Pretrained user response model based on obs

30、erved data logs.Click signals as rewards.Evaluation on Online SimulationPretrained user response model based on observed data logs.Click signals as rewards.Qingpeng Cai*,Shuchang Liu*,Xueliang Wang,Tianyou Zuo,Wentao Xie,Bin Yang,Dong Zheng,Peng Jiang and Kun GaiReinforcing User Retention in a Billi

31、on Scale Short Video Recommender System(Hyper-parameter Tuning)User Retention in Short-video Recommendation User Retention Directly affects DAU long-term feedback after multiple requests Hard to decompose,similar to Go Point-wise and list-wise methods can not optimize Solution:RL optimizes user rete

32、ntion directly Minimize the cumulative sum of returning time Equal to improving user visits First work to directly optimize user retention Previous works focus on cumulative immediate feedbackInfinite Horizon Request-based Markov Decision Process State User profile,user history,candidate video featu

33、res Action:a vector to ensemble ranking functions Immediate Rewards The sum of watch time and interactions,(,)Returning time Time gap between the last step of session sand the first step of session+1 Objective:minimize=11()Challenges of Retention Uncertainty Retention is not fully decided by the rec

34、ommendation Affected by social events Bias Biased with time and user activity High active users have higher retention and more samples Long delay time Retention reward returns in hours to days Cause the instability of online RLReinforcement Learning for User Retention AlgorithmLearning the Retention

35、 and Tackling the Uncertainty Challenge A normalization technique to reduce the varianceLearn a session level classification model Tpredict that the time is shorter than Estimate the lower bound of returning time by Markov Inequality1 Use true returning time/estimated returning time as the retention

36、 rewardEnhancing policy learning with intrinsic rewards and immediate feedbacksEnhancing Learning by Heuristic RewardsIntrinsic rewards by Random Network DistiliationSolving bias:different policies for users of different activityActor learn from both retention critic and immediate response critic.Ta

37、ckling the Unstable Training and Bias ProblemProblem of previous regularization methodsL()+(N ,N(,)Learn either too slow or too fastSoft regularized actor loss:Samples with larger shift gets smaller weights controls the regularization degreeOffline and Live ExperimentsStateuser profileage,gender,and

38、 locationbehavior historyuser statistics,video id and users feedback of in previous 3 requeststhe candidate video featuresAction8-dimensional continuous vector ranging in 0,4Immediate Rewardsum of watch time and interactions of 6 videosSummary RL for Short Video RS Hyparameter tuning and Ranking Multi-objectives,large action spaces,and delayed feedback Code Implementaions of our RL-based works https:/ KuaiSim:A Comprehensive Simulator for Recommender Systems NeurIPS 2023 https:/

友情提示

1、下载报告失败解决办法
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。

本文(蔡庆芃_短视频推荐强化学习算法_watermark.pdf)为本站 (张5G) 主动上传,三个皮匠报告文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知三个皮匠报告文库(点击联系客服),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。
会员购买
客服

专属顾问

商务合作

机构入驻、侵权投诉、商务合作

服务号

三个皮匠报告官方公众号

回到顶部