《Ray on Apache Spark™ Apache Spark™上的Ray.pdf》由会员分享,可在线阅读,更多相关《Ray on Apache Spark™ Apache Spark™上的Ray.pdf(22页珍藏版)》请在三个皮匠报告上搜索。
1、Ray on SparkBen Wilson,DatabricksJiajun Yao,AnyscaleDatabricks2023Who we are Ben WilsonJiajun YaoWorks with ML open source software at DatabricksMLflow maintainerSoftware engineer at AnyscaleRay committerAgenda What is Ray What is Ray-on-Spark Why Ray-on-Spark How to use Ray-on-Spark Demos How does
2、Ray-on-Spark work Future workWhat is RayWhat is Ray An open-source unified distributed framework that makes it easy to scale AI and Python applications.An ecosystem of Python libraries(for scaling ML and more).Makes distributed computing easy and accessible to everyone.Runs on laptop,public cloud,K8
3、s,on-premise.Run anywhereGeneral-purpose framework for distributed computingLibrary+app ecosystemRay coreWhat is RayWhat is Ray def read_array(file):#read ndarray“a”#from“file”return adef add(a,b):return np.add(a,b)a=read_array(file1)b=read_array(file2)sum=add(a,b)class Counter(object):def _init_(se
4、lf):self.value=0 def inc(self):self.value+=1 return self.valuec=Counter()c.inc()c.inc()FunctionClassWhat is Ray ray.remote def read_array(file):#read ndarray“a”#from“file”return a ray.remotedef add(a,b):return np.add(a,b)a_ref=read_array.remote(file1)b_ref=read_array.remote(file2)sum_ref=add.remote(
5、a,b)sum=ray.get(sum_ref)ray.remoteclass Counter(object):def _init_(self):self.value=0 def inc(self):self.value+=1 return self.valuec=Counter.remote()c.inc.remote()c.inc.remote()Function-TaskClass-ActorWhat is RayHigh-level libraries that make scaling easy for both data scientists and ML engineers.Wh
6、at is Ray1,000+OrganizationsUsing Ray25,000+GitHubstars5,000+RepositoriesDepend on Ray820+CommunityContributorsWhat is Ray-on-Spark A library to deploy Ray clusters on Spark and run Ray applications.Ray coreWhy Ray-on-Spark User asksSpark users want to use both Spark MLlib and Ray ML libraries(e.g.R
7、LLib).CostShare the same physical cluster between Ray and Spark applications.Easy to manageNo need to manage two separate physical clusters.How to use Ray-on-Spark Install Ray%pip install rayall=2.3.0 Start a Ray clusterimport rayray.util.spark.setup_ray_cluster(num_worker_nodes=5)Run Ray applicatio
8、nsray.init()#Connect to the previously created Ray cluster.#Your Ray application codeprint(ray.nodes()Stop the Ray clusterray.util.spark.shutdown_ray_cluster()Getting startedThe Ray DashboardValidationParallel processingDistributed Hyperparameter tuningHow does Ray-on-Spark work Driver WorkerGlobal
9、Control Service(GCS)Scheduler Object Store Raylet Worker Worker Scheduler Object Store Raylet Worker Worker Scheduler Object Store Raylet Head NodeWorker Node#1Worker Node#N.An anatomy of a Ray clusterHow does Ray-on-Spark workSpark ClusterSpark Driver NodeSpark Worker NodeSpark Worker NodeSpark Tas
10、kRay Worker NodeSpark TaskRay Worker NodeSpark TaskRay Worker NodeSpark Driver ProgramRay Head NodeRay head node runs on the Spark driver node.Ray worker nodes are started by a long-running Spark job.Each long-running Spark task starts a Ray worker node and allocates to the node the full set of resources available to it.Future work Autoscaling support Delta data source support in Ray DataConclusion Ray-on-Spark is in Public Preview for Ray=2.3&(Spark=3.3|Databricks Runtime=12.0)Try out Ray-on-Spark on Databricks clustersTry out Ray-on-Spark on Spark standalone clustersLearn more about Ray