《3、付海涛-Flink on K8s 在京东的持续优化实践.pdf》由会员分享,可在线阅读,更多相关《3、付海涛-Flink on K8s 在京东的持续优化实践.pdf(23页珍藏版)》请在三个皮匠报告上搜索。
1、付海涛/资深技术专家FlinkFlink on K8s on K8s 在京东的持续优化实践在京东的持续优化实践The Practice and Optimization of Flink on K8s in JD 基本介绍基本介绍The IntroductionThe Introduction生产实践生产实践The Practice The Practice#1#2#3#4优化改进优化改进The Optimizations The Optimizations 未来规划未来规划FutureFuture PlanPlan#1#1基本介绍基本介绍The IntroductionWhy Kuberne
2、tes易部署运维easy to setup资源利用better resource utilization隔离安全better resource isolation,security容器化历程The Evolution of Containerization2018年6月20%任务容器化20%jobs running on k8s till Jun.2000212019年2月计算单元全部容器化All computing units running on k8sbefore Feb.20192021年持续优化实践Continuous optimizati
3、on&practice资源使用Efficient resource sharing混合部署服务,资源共享能力提升节省机器资源30%DevOps效率DevOps efficiency improvement开发、测试、生产一致环境部署和运维自动化能力提升管理和运维成本降低50%业务稳定Full isolation&resilience资源隔离,细粒度权限控制弹性自愈,保障业务稳定Flink on K8s in JD配置调试部署监控日志SQLJARJRC(京东实时计算平台)JDOSJDQKafkaHDFS/OSSMySQLJimDBSourceJDQKafkaHDFS/OSSMySQLESSink
4、物理机+云主机#2 2生产实践生产实践The Practice容器化方案(静态模式)The Standalone K8s K8s客户端JRC平台K8s Master-api server-controller-schedulerK8s DeploymentJobManagerK8s PodK8s DeploymentTaskManagerK8s PodDocker RegistryZK集群Hdfs或oss创建集群提交任务jobmanager-deployment.yamltaskmanager-deployment.yamlHA状态存储方案局限:Limitation资源需要提前分配,无法满足灵
5、活多变的业务需要The resource cannot be allocated based on the resource requirements of the job.容器化方案(弹性模式)The Native K8s在平台进行资源创建&销毁Allocate&free resource by platform支持预分配Support preallocation to be compatible with static mode.兼容原有槽位分配策略compatible with the task slot distributed strategy in standalone k8s mo
6、de.K8s客户端JRC平台K8s Master-api server-controller-schedulerDocker RegistryZK集群HAHdfs或oss状态存储K8s DeploymentFlinkFlink J JobmanagerobmanagerJobMasterDispatcherJDResMngrRest ServerFlink TaskmanagerK8s PodFlink TaskmanagerK8s Pod日志&监控Logs and MetricsNodePodJM or TMMetric Reporter白泽系统JVM指标Flink指标log agentme
7、tric agentOrigin系统物理机指标容器指标Logbook系统Flink集群&任务日志JRC实时计算平台监控查看&报警日志查询实时日志历史日志网络性能Network Performance容器网络,不可避免的会出现网络性能损耗网络性能Network Performance方案Solution根据机房环境选择合适的网络模式Choose appropriate network mode according to the environment.主机网(旧机房)+容器网(新机房,高性能网络插件天链)调整集群网络相关参数,增加容错能力Adjust cluster network releva
8、nt parameters to improve fault tolerance比如:-akka.ask.timeout:10 s-60 s-work.request-backoff.max:10000-60000网络损耗对checkpoint快慢影响很大磁盘性能Disk Performance问题IssueOverlayFS会带来一定的性能损耗性能损耗The OverlayFS will bring in additional disk IO overhead.方案Solution使用外挂Volume,绕开overlayFS提升性能Use volume to bypass overlayFS
9、 to improve IO performance.使用本地存储卷(hostPath或lvm数据卷)-日志-RocksDB状态后端-批任务shuffle调优磁盘IO访问相关参数,提升性能Adjust disk io access parameters to improve performance.比如:-RocksDB参数调优弹性伸缩Auto Scaling问题IssueScaling Service任务伸缩配置流量存在波峰波谷,如何减少人工干预同时提升资源利用率?How to reduce operational overhead and improve resource utilizat
10、ion due to unpredictable load variation?JobManagerFlink集群TaskManager*NMetric SystemJRC后台任务调整结果调整集群&任务MetricsMetrics流批错峰混部K8s Master智能诊断Intelligent Diagnosis机器监控容器监控集群监控任务监控监控指标任务拓扑Pod日志任务日志诊断结果诊断症状诊断详情优化建议#3 3优化改进优化改进The Optimizations均衡调度Slot Spread Out Across TaskManagers问题Issue方案Solution在native k8
11、s模式下如何平铺任务实现相对均衡的调度?How to spread out the slots evenly across taskmanagers in native k8s mode?0200040006000800010000顺序调度均衡调度x 10000吞吐量对比5700万157%8947万TMTMTMTM在任务调度之前进行资源预分配,兼容任务平铺Preallocate resource before schedule job to make the scheduling option cluster.evenly-spread-out-slots work.DispatcherRes
12、ourceManagerJobMasterJDElasticRMDriver提交job预申请job所需资源分配job所需资源通知完成资源分配调度任务Pod_APod_BPod_APod_B任务资源隔离Resource Isolation Between Jobs In One Cluster问题Issue方案SolutionSchedulerSlotPoolRequestAssignSlotManagerSlotJobManagerResourceManagerTaskManagerRequestAssignOffer许多业务一个集群跑多个任务,如何避免任务之间互相影响?How to avoi
13、d the interaction between jobs in one cluster?Pod_APod_BPod_CPod_APod_BPod_C隔离后(job隔离)Kubelet快速恢复Fast Failure Recovery问题Issue容器环境复杂多变,Pod被驱逐或重启时有发生,如何减少对业务影响?How to reduce the impact on service when Pod is evicted or restarted?加快作业恢复速度优化1Optimization 1加快Pod被驱逐或重启的感知速度Accelerate the speed to detect t
14、he Pod evicted or restarted.Map(1/3)Sink(1/3)Pod_BPod_A主进程子进程TM感知到TM重启收到Kubelet的TERM信号JM通知TM所在Pod异常退出快速恢复Fast Failure Recovery优化2Optimization 2针对容忍少量数据丢失的场景,采用单点恢复策略Use single task recovery when a certain amount of data loss is tolerable.Source(1/1)Map(1/1)Pod_APod_BSink(1/2)Sink(2/2)Pod_CMap(1/1)Sink(2/2)Pod_DJobManager单个Task异常或Pod异常部署异常Task通知上游Task Ready#4 4未来规划未来规划Future Plan未来规划Future Plan03010204智能运维Smart diagnosis&self-regulation服务混部Mixed workloads to get better resource utilization调度优化Scheduling OptimizationFlink AI支持Flink AI exploring