《Databricks Lakehouse 上的多云数据治理.pdf》由会员分享,可在线阅读,更多相关《Databricks Lakehouse 上的多云数据治理.pdf(35页珍藏版)》请在三个皮匠报告上搜索。
1、Multi-cloud data governance on the Databricks LakehouseIoannis Papadopoulos,DatabricksVolker Tjaden,DatabricksDatabricks2023Data governance as answers to questionsThe questions that data governance is attempting to addressWho has access to what dataHow do we ensure we can trust the dataHow can we pr
2、ove the validity of the insights generated by the dataPrincipalPrivilegeSecurableData governance defined as questionsThe Privacy and Security DimensionWho has access to what dataWhen,HowEntitlementData governance in the real worldConstraints that do matter in implementing a data governance systemMul
3、tiple Cloud ProvidersMultiple Geo LocationsMultiple Data TypesMultiple Technology Stacks1_DAIS_Title_SlideData governance features ofthe Databricks LakehouseCentralised governance with Unity CatalogCloud Storage(S3,ADLS,GCS)container/bucket container/bucket Unity CatalogAudit LogDatabricksWorkspaceA
4、ccount Level User MgmtCredentialsMetastoreIdentity ProviderLineage ExplorerACL StoreData ExplorerAccess ControlXuserDelta Sharing ServerDelta Sharing Clientuser Short-lived tokenXAccess deniedWhere do identities and entitlements live?DatabricksWorkspaceDatabricksWorkspaceClustersSQL WarehousesCluste
5、rsSQL WarehousesUnity CatalogUser ManagementMetastoreAccess ControlDatabricks and cloud provider hierarchiesA Databricks account per cloud providerGCPOrganizationProjectProjectDatabricks WorkspaceDatabricks WorkspaceDatabricks AccountAccount ConsoleAzureAWSAADTenantAzure SubscriptionAzure Subscripti
6、onDatabricks WorkspaceDatabricks WorkspaceDatabricks AccountAccount ConsoleAccount ConsoleOrganizational UnitAWS AccountAWS AccountDatabricks WorkspaceDatabricks WorkspaceDatabricks AccountDatabricks WorkspaceThe Databricks account consoleThe Databricks account consoleCloud 3Cloud 2Cloud 1User/group
7、 syncSCIM APIIdentity federationProvisioning identities in Accounts and WorkspacesUnity Catalog Account ConsoleAccount Level User MgmtGroup 1Group 2Group 4Databricks Workspace 1Databricks Workspace 2Databricks Workspace 3Group 1Group 1Group 4Group 2Group 4Identity ProviderGroup 1Group 2Group 4Group
8、3Admin grants permissions forusers to accessthe workspaces Single sign on in a multi-cloud contextAzure(AAD)AWS exampleAWSAccount ConsoleOrganizational UnitAWS AccountAWS AccountDatabricks WorkspaceDatabricks WorkspaceDatabricks AccountDatabricks WorkspaceAzureAADTenantAzure SubscriptionAzure Subscr
9、iptionDatabricks WorkspaceDatabricks WorkspaceDatabricks AccountAccount ConsoleProvision users with SCIMSet up SSO with OIDC or SAML Assign Usersto WorkspacesSet up SSO with SAMLThe Databricks account consoleThe Databricks account consoleUnity Catalog MetastoresManaged TableViewExternal tableDatabri
10、cks AccountCatalogCatalogSchema(Database)Schema(Database)External TableViewManaged TableDatabricksWorkspaceDatabricksWorkspace assigned to(Unity)Metastore(Unity)Metastore Unity CatalogSELECT*FROM catalog1.database1.table1;/Volumes/volumeNameExternal tableVolumeExternal tableModelGRANT USE SCHEMA ON
11、SCHEMA sales TO account users;GRANT SELECT ON TABLE sales.orders TO marketing_team;Row level security and column level maskingProvide differential fine grained access to datasetsAssign reusable filter to tableSpecify filter predicatesTest for group membershipOnly show specific rowsCREATE FUNCTION us
12、_filter(region STRING)RETURN IF(IS_MEMBER(admin),true,region=“US”);ALTER TABLE sales SET ROW FILTER us_filter ON region;CREATE FUNCTION (.)RETURN filter clause whose output must be a booleanAssign reusable mask to columnSpecify mask or function to maskTest for group membershipMask or redact sensitiv
13、e columnsCREATE FUNCTION ssn_mask(ssn STRING)RETURN IF(IS_MEMBER(admin),ssn,“*”);ALTER TABLE users ALTER COLUMN table_ssn SET MASK ssn_mask;CREATE FUNCTION (,.)RETURN expression with the same type as the first parameterDelta SharingCrossing the boundaries between regions and cloudsDelta LakeDelta Sh
14、aring ServerDelta filesin cloud storageRequest tablePre-signed short-lived URLsTemporary direct access to files(Delta format)in the object store-AWS S3,GCP,ADLSDATA PROVIDERDATA CONSUMERDelta Sharing ConnectorsActivation linkDatabricks-to-Databricks Delta SharingCrossing the boundaries between regio
15、ns and cloudsCloud 1/Region 1Cloud 2/Region 2Unity CatalogMetastore 1Metastore 2Managed datacontainerRecipientProviderSharetabletablecreate Metastore 2 UUID11create 22grant“select”on to 44add tables to 3List and (automatically available after recipient creation)66Catalog(r/o)7Create catalog for and
16、7schema3tabletableschemaGrant access to“Metastore 2”userson schemas/tables88ProviderRecipientenable network access for consumers in cloud region 25External datacontainer5Automating with TerraformModules related to Data Governance1_DAIS_Title_SlideMulti-cloud data governancein practice Implementing m
17、ulti-cloud data governanceA stylized exampleMultinational companyML is becoming central to every aspect of their businessRegions may operate on different cloudsData is managed centrally,but owned locallyUSAOrganizational setupCross-regional collaboration on data,models and codeCentral ITBU 1 Cloud P
18、latform TeamProvisions infrastructureERP Data TeamProvisions dataBU 1 Data TeamOwns and enriches dataEUBU 2BU 2 Data TeamOwns and enriches dataML TeamDevelops anddeploys modelsToolsToolsDataDataDataArchitecture overviewAWS/US-East-1Azure/West EuropeUnity CatalogERPMetastore 1ERP CatalogMetastore 2BU
19、 1 CatalogTablesERP CatalogBU 2 CatalogTablesFeaturesModelsTablesFeature engineering,model training&CICDModel serving/monitoringModel training&CICDModel serving/monitoringOther sourcesIoTMediaOther sourcesIoTMediaFeaturesModelsTablesCloud platform teamManages AWS organizationand Azure tenantUtilizes
20、 Terraform templatesto provision Databricks workspaces for subsidiariesServes as theMetastore administratorUses Terraform for efficient standard-compliant resource provisioning26Data sharing within region&cloud ERP Data TeamIngests ERP data into Unity CatalogDefines Row-level access controls to sepa
21、rate READ permissions betweenthe different BUs in ERP catalogBU 1 Data TeamIngests data from additional sources(IoT,video,.)into BU 1 catalogBuilds and deploys reportsbased on enriched ERP dataBuilds and owns enriched inputdata for ML feature tablesData governance on the Unity Catalog metastore27Dat
22、a sharing across cloud®ionERP Data TeamCreates share in UC metastore 1Creates recipient for UC metastore 2and adds it to shareAdds data from ERP catalog to shareCreates catalog for sharedERP data in UC metastore 2 Defines row-level permissionsin ERP catalog in UC metastore 2BU 2 Data TeamIngests
23、data from additional sources(IoT,video,.)into BU 2 catalogBuilds and deploys reportsbased on enriched ERP dataBuilds and owns enrichedinput data for ML feature tablesDatabricks-to-Databricks sharing connects metastores28ML teamDevelops models andmoves them to productionTreats models as codeEmploys a
24、 branching strategyto separate dev/staging/prodRuns CI tests in one cloudIn each cloud/regionRefreshes feature tablesTrains modelsRuns CD testsDeploys and monitors modelsImplements MLOps best practices across regions and clouds29MLOps multi-cloud architectureAWS/US-East-1Model registryAzure/West Eur
25、opeDev EnvironmentStaging EnvironmentProduction EnvironmentTablesFeaturesData explorationInference&servingFeature table refreshModel trainingdevTablesFeaturesUnit tests(CI)devIntegration tests(CD)devFeaturesFeature table refreshreleaseModeltrainingreleaseContinuous Deployment(CD)releaseInference&ser
26、vingreleaseStage:NoneStage:StagingStage:Productiondevstaging(main)releaseMerge requestPull release branchto productionModel registryProduction EnvironmentTablesFeaturesFeature table refreshreleaseModeltrainingreleaseContinuous Deployment(CD)releaseInference&servingreleaseStage:StagingStage:Productio
27、n1_DAIS_Title_SlideKey takeawaysSummaryIdentities SCIM and OIDC are your friendsEntitlements need to be synced alongside shared dataData Delta Sharing forms the foundation in a multi-cloud contextML treat models as code and sync input data across environmentsRelated presentations at DAISRelated presentationsMore Data Governance,Lakehouse Architecture&Data SharingWednesdayThursday