《3-1 通过 dbt 把软件开发的最佳实践带到数据领域.pdf》由会员分享,可在线阅读,更多相关《3-1 通过 dbt 把软件开发的最佳实践带到数据领域.pdf(27页珍藏版)》请在三个皮匠报告上搜索。
1、通通过过 dbt 把把软软件开件开发发的最佳的最佳实实践践带带到到数据数据领领域域Chenyu Li,Sr Software Engineer,dbt Labs因为dbt还是一个比较新的产品,社区也主要集中在美国欧洲,很多材料并没有中文翻译,我会尽量用中文讲解,做的不好的地方还请大家见谅。传统传统数据分析中的流程数据分析中的流程问题问题云原生数云原生数仓带仓带来的机会来的机会dbt 想要提供的解决方案想要提供的解决方案以前的数据以前的数据仓库仓库非常昂非常昂贵贵Legacy E-T-L数据转换发生在 数据存数据存储层储层之外之外基础设施的管理是一份全全职职工作工作数据分析任数据分析任务务分散在
2、分散在 工程师,数据分析师和 stakeholder中间451.工程师成为 每一个更改每一个更改 的瓶颈2.从从头创头创建建 比查找现有代码更简单 3.不可追踪的不可追踪的变变更流程更流程 破坏数据pipeline,降低大家对数据的信任.传统数据分析中的流程问题云原生数云原生数仓带仓带来的机会来的机会云原生数云原生数仓仓降低成本,并且更容易使用降低成本,并且更容易使用弹性存储和计算使得 数数仓仓内内转换变转换变得得可行可行.云原生架构 减少了基减少了基础设础设施的管理施的管理工程师,分析师可以更加专注在 高回高回报报的任的任务务上,比如上,比如优优化和构建数据化和构建数据转换转换流程流程这这些
3、些变变化化给给数据工作流程数据工作流程创创新提供了新提供了可能性,可能性,dbt在在这样这样的契机下,提出了的契机下,提出了自己的一套解决方案自己的一套解决方案Modern E-L-T7dbt 想要提供的解决方案想要提供的解决方案8模块化 可测试持续集成有文档 数据分析流程更快更稳定的更新The dbt viewpoint:Build data like developers build applications9how dbt want data teams work together1.Enable anyone who knows SQL to quickly build and tes
4、t data2.Use version control to update once and deploy everywhere 3.Provide documentation tool and auto-refreshing lineagestg_ordersordersselect*from ref(stg_orders)where is_deleted=false-orders.sqlcreate table analytics.dev.orders as(select*from analytics.dev.stg_orderswhere is_deleted=false);Runs i
5、n the warehouseThis isnt anything new,its how every high-quality software project is run.You expect there to be tests.You expect there to be documentation.You expect the PR process to be collaborative.Youre building software together.Were just applying this to analytics code as well.11DevelopDocumen
6、tTestDeploy IDE or CLI Modular SQL No DDL/DML Pre-built packages Dependency management Auto-generate DAG Auto-updated docs Schema tests Data value testing Pre-packaged tests for complex logic Job scheduling CI/CD Version control Logging&alertingA centralized environment for collaborative development
7、12Develop IDE or CLI Modular SQL No DDL/DML Pre-built packagesA centralized environment for collaborative development13Develop faster with SELECT statements(declarative)Express business logic in SQL Includes several materializations Table View Incremental snapshotselect*from analytics.dev.stg_orders
8、where is_deleted=false-orders.sqlcreate table analytics.dev.orders as(select*from analytics.dev.stg_orderswhere is_deleted=false);Runs in the warehouse14Develop faster without having to think about run orderRun the same code in dev,test and prod the correct schema is resolved for youDependencies bui
9、lt automatically so you can focus on modeling,not run orderstg_ordersorders15select*from ref(stg_orders)where is_deleted=false-orders.sqlcreate table analytics.dev.orders as(select*from analytics.dev.stg_orderswhere is_deleted=false);Runs in the warehouseDevelop faster without having to think about
10、run orderRun the same code in dev,test and prod the correct schema is resolved for youDependencies built automatically so you can focus on modeling,not run orderstg_ordersorders16select*from ref(stg_orders)where is_deleted=false-orders.sqlcreate table analytics.prod.orders as(select*from analytics.p
11、rod.stg_orderswhere is_deleted=false);Runs in the warehouseMacrosA sandbox environment to execute user logicAbstract snippets of SQL into reusable macros these are analogous to functions in most programming languages.Use control structures(e.g.if statements and for loops)in SQLUse environment variab
12、les in your dbt project for production deploymentsOperate on the results of one query to generate another query17Apply industry standard code to your project Check out the dbt Packages HubAkin to python librariesGet to focusing on unique business logic rather than implementing something people have
13、already solved forTypes of packages:Transforming data from a structured SaaS datasetWriting dbt macros to answer“How do I do this in SQL?”(i.e.Dbt_utils.equal_rowcount,date conversion)Auditing&Testing 18Use packages to skip boilerplate codeDevelopDocument IDE or CLI Modular SQL No DDL/DML Pre-built
14、packages Dependency management Auto-generate DAG Auto-updated docsA centralized environment for collaborative development19Maintain shared understanding with auto-updating lineage2021DevelopDocumentTest IDE or CLI Modular SQL No DDL/DML Pre-built packages Dependency management Auto-generate DAG Auto
15、-updated docs Schema tests Data value testing Pre-packaged tests for complex logicA centralized environment for collaborative development22Test assumptions about data,and the validity of transformationsCustom+out of the box tests including:UniquenessNull valuesCertain valuesIs a valid foreign key to
16、 another table23Preserve quality by testing in-lineDevelopDocumentTestDeploy IDE or CLI Modular SQL No DDL/DML Pre-built packages Dependency management Auto-generate DAG Auto-updated docs Schema tests Data value testing Pre-packaged tests for complex logic Job scheduling CI/CD Version control Loggin
17、g&alertingA centralized environment for collaborative development24Deploy seamlessly with version control and CI/CDVersion ControlIntegrate with git provider of choiceContinuous IntegrationContinuous DeploymentMinimize wasteful runs by testing only changesJob scheduling and alertingLogging&Alerting2526Thank you!Questions?