《数据编排的未来:基于资产的编排.pdf》由会员分享,可在线阅读,更多相关《数据编排的未来:基于资产的编排.pdf(39页珍藏版)》请在三个皮匠报告上搜索。
1、Sandy Ryza(s_ryz)Lead Engineer,Dagster Project-ElementlAsset-Based Data OrchestrationData practitioners build and maintain data pipelinesWhats a data pipeline?Data AssetComputationData AssetComputationData AssetData AssetComputationComputationData AssetData AssetTableFileML modelData pipelines span
2、entire organizationsApp DBCRMMarketing AnalyticsCore EntitiesRecommender SystemThird PartyProduct AnalyticsAutomatically updating data assetsWhy update a data asset?Inputs have changedChanging inputsChanging constantlyNew partition every dayWhy update a data asset?Inputs have changedCode has changed
3、Code changesUpdated business logicWhy update a data asset?Inputs have changedCode has changedFresh data is neededFresh data is neededBy 9 am daily,for exec meetingAs soon as new data arrivesAutomatically updating data assets:how?The status quo:workflow enginesDAG of tasksRun the DAG every hour/day/w
4、hateverWorkflow engines:not actually the best way to schedule data pipelines?Forces running in lockstepCaught between doing redundant work and stale dataCode managementWhat DAG should this new data asset be a part of?Monolithic DAG objectsAlerts when tasks fail vs.when data is lateA different way:As
5、set-based orchestrationGoals of asset-based orchestrationOutcomesMake data ready on timeAvoid redundant workExpress scheduling in terms of the data assetsWhen does source data change?How fresh do data assets need to be?Understand scheduling decisionsBuilding a pipelineaka defining some data assetsAs
6、set-based orchestration in DagsterAuto-materialize policiesThe root of the graph?Source assetssource assetWhat about code changes?Lazy auto-materializationDownstream assetUpstream assetFreshness policiesrunfraudulent_logins_modelmidnightrun events_tablemidnightrun logins_tablenew source datamidnight
7、run events_tablerunfraudulent_logins_modelmidnightrun logins_tablenew source dataCommon scenario:different freshness requirements,same upstream datalogins_dashboardlogins_tablefraud_modelHourlyDailyCommon scenario:different freshness requirements,same upstream dataaggregation_table_1fact_tableaggreg
8、ation_table_2HourlyDailylogins_dashboardlogins_tablefraud_modelCommon scenario:different freshness requirements,same upstream dataaggregation_table_1fact_tableaggregation_table_2HourlyDailylogins_dashboardlogins_tablefraud_modelCommon scenario:different freshness requirements,same upstream dataaggre
9、gation_table_1fact_tableaggregation_table_2HourlyDailylogins_dashboardlogins_tablefraud_modelCommon scenario:different freshness requirements,same upstream datalogins_dashboardlogins_tablefraud_modelHourlyDailyAsset-based orchestration:observabilityTo sum upData pipeline=graph of data assets connected by computationsWorkflows are not an adequate scheduling abstractionAsset-based orchestrationExpress intentions more clearlyAvoid redundant computationsDebug scheduling decisionsSandy Ryzas_ryzThank you