《ATscale:语义层定义白皮书-语义层的7个核心功能(英文版)(34页).pdf》由会员分享,可在线阅读,更多相关《ATscale:语义层定义白皮书-语义层的7个核心功能(英文版)(34页).pdf(34页珍藏版)》请在三个皮匠报告上搜索。
1、The 7 Requirements of a Semantic LayerDavid P.Mariani,Chief Technology Officer,Founder at AtScaleDave Mariani is the co-founder and CTO of AtScale.He is a hands-on technology executive with over 25 years of experience in delivering Big Data,consumer Internet,Internet advertising and hosted services
2、platforms,creating nearly$800M in company exits.Wh i t e pap e rThe Semantics of a Semantic Layer1IntroductionI co-founded AtScale to focus on the challenges of supporting a large number of data analysts working on disparate sets of data managed in a massive lake.We borrowed the term“semantic layer”
3、from the folks at Business Objects who originally coined it in the 1990s.The term was actually over 20 years old when we adopted it.So what is a semantic layer exactly?If you Google the term,the following definition will pop up,which is a pretty darn good definition in my opinion(Googles highlighted
4、 words,not mine):Wikipedia defines a semantic layer as a business representation of data that allows end users to access data autonomously.Everyone can agree that a business-friendly view of data that provides users with self-service access to analytics is desirable true data democratization.Its eas
5、y to see why it is fundamental to scaling data and analytics.The challenge is actually implementing a semantic layer in a way that just works.We began building the AtScale semantic layer after working on big data from the trenches.We had to deal with the basic challenges of data scalability,query pe
6、rformance,metrics sprawl,complicated data pipelines and shadow business intelligence(BI).While the challenges seemed obvious to us,most of the industry was preoccupied with shifting data gravity to the cloud.With cloud data re-platforming in full swing,we are finally seeing attention turning to the
7、last mile of enterprise analytics with the semantic layer topic surging in popularity.A semantic layer is a business representation of corporate data that helps end users access data autonomously using common business terms.A semantic layer maps complex data into familiar business terms such as prod
8、uct,customer,or revenue to offer a unified,consolidated view of data across the organization.Semantic layer-Wikipediahttps:/en.wikipedia.org wiki Semantic_layer2Cloud giants like and,the unicorns like and a host of venture-backed startups are now talking about this critical new layer in the data and
9、 analytics stack.Some call it a“metrics layer”,or a“metrics hub”or“headless BI”,but most call it a“semantic layer”.I cant tell you how happy it makes me that the industry is finally recognizing the importance of the semantic layer in a modern,cloud-first analytics stack.I couldnt agree more that a l
10、ogical,business-friendly view of data is whats needed to make analytics accessible to everyone,not just data engineers and SQL jockeys.While it might be just a matter of semantics,I prefer“semantic layer”over“metrics layer”,“metrics hub”,or“headless BI”.I think that the term“semantic layer”best desc
11、ribes this business-friendly data interface because it covers all types of data and use cases.For example,the terms“metric store”and“metric layer”ignore the concept of“dimensions”altogether.Take a look at just about every BI tool on the market(i.e Tableau,Looker,Power BI)and they all include measure
12、s(or metrics)and dimensions in their interfaces.Metrics measure something but dimensions(i.e.“product”,“time”and“location”)categorize data by grouping or aggregating metrics.So,terms using“metric”are confusing and dont map to how these layers will be consumed.The term“headless BI”is also problematic
13、 because it only covers business intelligence use cases.A universal data layer is useful to more than just business analysts and BI.Data scientists need to access a consistent business-friendly interface to data for building and training their models.Furthermore,application developers who are buildi
14、ng data-driven applications also need interfaces to data.As such,the term“headless BI”is inadequate because it only covers a single use case:business intelligence.GoogleSnowflakedbt LabsTheres a reason why independent semantic layers have taken time to come to market building a semantic layer is har
15、d.Yes,a semantic layer serves as a common metrics store or single source of truth,but theres much more to it than that.For a semantic layer to be viable,it needs to?Support any query tool,interface or protocol with a live connection to dat?Express the most complex business logic(serve as a digital t
16、win)using a semantic mode?Deliver queries in under 2 second?Govern access to data for every quer?Connect to any backend data storeThe Semantics of a Semantic LayerMore Than Just Metrics3The following diagram illustrates the core capabilities of a semantic layer:The 7 Requirements of a Semantic Layer
17、Consumption IntegrationConnect consumption tools to data LIVEData IntegrationConnect to data platforms,abstract location&format of dataSemanticModelingCreate new data productsMULTI-DIMENSIONALCALCULATION ENGINEExecute sophisticatedcell-based expressionsDATA PREPVIRTUALIZATIONPush down datatransforma
18、tionsPerformance OptimizationOptimize performance and costAnalyticsGovernanceEnforce access control and data policies17564324Chapter 1:Consumption IntegrationFor a semantic layer to be truly universal,it needs to support“live”query connections for all user personas and for all popular query tools an
19、d programming interfaces.A universal data layer is useful to more than just the business analyst and business intelligence personas.A semantic layer must also serve the needs of the data scientist and application developer.Lets start with the data science persona.Like business analysts,data scientis
20、ts also need access to consistent,business-friendly data for building and training their machine learning models.In addition to the ability to read(or consume)the semantic layer,data scientists also need to write their predictions and features back to the semantic layer.By supporting both reading an
21、d writing,the semantic layer and underlying semantic model becomes the bridge that spans the traditional business analytics and data science silos.The image below illustrates how the semantic layer unifies the workflows of the business analyst and data science personas:More Than Just BI+Data Product
22、raw dataAI/ML ToolsSemanticModelManagedfeaturesAI/MLBusiness Intelligencef1f2f3f5fnfafb5Key Takeaway:A semantic layer must support multiple consumer personas,including business analysts,data scientists and application developers,to deliver the full spectrum of data access and analysis.More Than Just
23、 SQLSQL was a godsend to database programmers because it became a standard for structured data access for a variety of data platforms.Since a semantic layer exists to provide data access to everyone,not just programmers and data engineers,SQL-only access limits users to tools that speak SQL or those
24、 who can write SQL.While most tools do speak SQL,some tools like Excel(the most popular BI tool on the planet)and Power BI,dont play nice with SQL.Rather,these tools prefer to speak in their native,dimensional dialects using MDX(Excel)and DAX(Power BI).Data scientists prefer to speak to their data u
25、sing Python and data frames while application developers may prefer using REST,JDBC or ODBC interfaces.For example,the following queries all answer the same question“how many water bottles did I sell by state in the US?”:In addition to business analysts and data scientists,application developers nee
26、d simple interfaces into data to build data-driven applications.By addressing all three personas,a semantic layer can deliver all four flavors of analytics,from descriptive and diagnostic(business analysts)to predictive(data scientist)to prescriptive(data scientist,application developer),becoming th
27、e unifying thread underpinning a full range of analysis and personas.SQL(from Tableau)SELECT Internet Sales.CountryCity AS countrycity,Internet Sales.Product Name AS product_name,Internet Sales.State AS state,SUM(Internet Sales.orderquantity1)AS sum_orderquantity1_okFROM sales insights-snowflake.int
28、ernet sales Internet SalesWHERE(Internet Sales.CountryCity=United States)AND(Internet Sales.Product Name=Water Bottle-30 oz.)GROUP BY 1,2,36MDX(From Excel)SELECT NON EMPTY Hierarchize(DrilldownMember(DrilldownLevel(Geography Dimension.Geography City.All,INCLUDE_CALC_MEMBERS ),Geography Dimension.Geo
29、graphy City.CountryCity.&United States,INCLUDE_CALC_MEMBERS )DIMENSION PROPERTIES PARENT_UNIQUE_NAME,HIERARCHY_UNIQUE_NAME ON COLUMNSFROM (SELECT (Geography Dimension.Geography City.CountryCity.&United States )ON COLUMNS FROM Internet Sales )WHERE (Product Dimension.Product Dimension.Product Line.&S
30、.&S&28.&477,Measures.orderquantity1 )CELL PROPERTIES VALUE,FORMAT_STRING,LANGUAGE,BACK_COLOR,FORE_COLOR,FONT_FLAGS7DAX(From Power BI)EVALUATE TOPN(1001,CALCULATETABLE(ADDCOLUMNS(KEEPFILTERS(ADDCOLUMNS(KEEPFILTERS(FILTER(KEEPFILTERS(SUMMARIZE(VALUES(Geography Dimension),Geography Dimension City.Key0,
31、Geography Dimension City.Key1,Geography Dimension City ),NOT(ISBLANK(CubeMeasures orderquantity1),orderquantity1_City_Key0,CubeMeasures orderquantity1 ),orderquantity1,orderquantity1_City_Key0 ),KEEPFILTERS(FILTER(KEEPFILTERS(VALUES(Geography Dimension CountryCity.Key0),Geography Dimension CountryCi
32、ty.Key0=United States8 ),orderquantity1_City_Key0,0,Geography Dimension City,1,Geography Dimension City.Key0,1,Geography Dimension City.Key1,1)ORDER BY orderquantity1_City_Key0 DESC,Geography Dimension City,Geography Dimension City.Key0,Geography Dimension City.Key1As you can see,even though the que
33、stions and answers are the same,these tools produce wildly different queries in their native dialect.A semantic layer should handle all these dialects(and more),deliver the same sub-second query performance,apply the same governance filters and,of course,return the same results.For a semantic layer
34、to be universal,it must bring data to its consumers and that means speaking the native language of the end users tool of choice,whether that be a business analyst,data scientist or application developer.Key Takeaway:A semantic layer must support multiple inbound languages to support a wide range of
35、data consumers using their preferred protocols.Semantic layer solutions that only support SQL or Javascript are unsuitable to serve as endpoints for a variety of popular consumption tools.9A semantic layer cant deliver on its full potential if its not accessible and usable by everyone.In order to re
36、ach the largest number of users,a semantic layer shouldnt require additional client-side software to make it work.This is harder than it seems because custom drivers or plug-ins are usually required to make query tools and applications work with most data platforms.A well-designed semantic layer wil
37、l leverage each query tools built-in connectivity for accessing the semantic layer.For example,AtScale uses the built-in SQL Server Analysis Services(SSAS)drivers in Excel and Power BI to connect to the AtScale semantic layer.This means that anyone with Excel can connect“live”to the AtScale semantic
38、 layer without any added software requirementsBesides serving as a metrics hub,a semantic layer provides a business-friendly interface to data for all consumer personas,tools and dialects.A semantic layer truly democratizes data access by turning everyone into a data-driven decision-maker.In my next
39、 post,part three of seven,well dive into the semantic layers semantic data model for mapping your digital assets to the business.Key Takeaway:A semantic layer should not require the IT team to install additional client software on query consumers machines.Zero FootprintData for Everyone10Chapter 2:S
40、emantic ModelingFor a semantic layer to function,it must map the physical data objects to the logical business constructs,creating a digital twin of the business while serving as the graph-based query planner and optimizer.The days of centrally managed,monolithic data pipelines are over.Data moves a
41、nd changes too fast for a single team to keep up with the demands of the business.At the other end of the spectrum,business users creating their own data pipelines,also proved problematic.After all,business users arent data engineers,but they are business domain experts.New approaches,like the or a
42、model,seek to create a modern,distributed architecture for analytical data management that alleviates the traditional bottlenecks while putting business definitions in the hands of business domain experts.The illustrations below show an example of how domain-oriented data ownership works using a dat
43、a mesh approach to analytics data management.data meshhub and spokeWrite Once,Re-Use ManySource:,Zhamak DehghaniData Mesh Principles and Logical ArchitectureAn operational capability or an operational data access provided and owned by the domainOAn analytical data provided and owned by the domaianDA
44、 Domain Bounded Context(teams and systems)11Source:,Zhamak DehghaniData Mesh Principles and Logical ArchitectureThe semantic data model is a key component for delivering this decentralized strategy for analytics data management because it gives the domain experts the environment to create their data
45、 products and share them across the organization.With a semantic data modeling platform that promotes object-oriented model definitions,reusability and sharing,business domain experts or data engineers can free themselves of manual data engineering tasks and combine models and components to create n
46、ew analytics products.To support sharing and reuse,a semantic data platform must support the role-based security and sharing of model components for creating data products from multiple data domains.User profilesDUser updatesDRegister userODaily streams by artistsDArtists historyDOnboard artistOPay
47、artistOCreate padcastORelease podcast episodeOPodcasts Listeners DemographicDTop podcasts dailyDPlay stream for userOCompose playlist for userDMedia(music audio&video)streamsUsersArtistsPodcastsData was already siloed before the cloud revolution,but with the proliferation of cloud data lakes,cloud d
48、ata warehouses and SaaS applications,data lives in more places than ever before.A web of proprietary and incompatible data APIs makes the matter even worse for users looking to create data products.A semantic layer platform can break down these data silos and abstract away the location and format of
49、 the data.At the same time,modelers can create mashups of the data that span multiple locales and platforms to create new,composite data products that can serve as the digital twin of the business.By creating these logical views server-side instead of in the consumption tools,these blended data view
50、s can be shared across multiple user personas,whether its business analysts,data scientists,or application developersThe people who understand their business domain and the data that feeds it are usually the best authors of the semantic data model.By leveraging their knowledge of the physical(how th
51、e data is stored and structured)and the logical(business terms and calculations),they can create data products that are consumable by a wide range of downstream users.The modeler persona is not always a business analyst,though.Sometimes,a data engineer is the best person to own a particular data dom
52、ain and thereby own the semantic data model.In order to support both types of model authors,business users and data engineers,a data modeling platform must support both visual and code-based model definitions.In order to support a decentralized analytics data management style,CI/CD is critical to su
53、pporting a scalable workflow to enable multiple data domain owners and model contributors.Server-side Data BlendingGraphical&CI/CD FriendlyKey Takeaway:A semantic data model must break down data silos by blending data sources on the service side to create rich,composite views across multiple busines
54、s domains.Key Takeaway:A semantic data model must support a hub and spoke style for creating data products using an object-oriented data modeling language that allows subject matter experts to own and share data model components across teams.1213Besides serving as a metrics hub,a semantic layer powe
55、red by a semantic data model creates a digital twin of your business.By harnessing the power of the subject matter expert,the semantic data model serves as the rosetta stone for the enterprise,providing a business-friendly access layer to data for everyone.Data transformations are a necessary requir
56、ement for a semantic layer.A semantic layer platform should support virtualized calculations for expressing business logic and be capable of generating multi-pass SQL queries to handle calculations that require different levels of granularity like ratios and weighted averages.Data platforms,especial
57、ly cloud data platforms,have come a long way from their origins as relational databases.Modern cloud data platforms allow customers to use custom SQL to store and access nested,semi-structured data and extend their platforms functionality with user-defined and custom aggregation functions.With a nea
58、r constant flow of new,powerful functionality,its critically important that a semantic layer platform supports native data platform dialects for defining data transformation rules,calculations and expressions.By passing these expressions down to the underlying data platforms for execution,customers
59、can enjoy the data platforms full range of capabilities and scale their data transformations by keeping them close to the data,server-side.The Power of the Semantic ModelOpennessChapter 3:Data Prep VirtualizationKey Takeaway:A semantic later platform should support both code-based and graphical data
60、 modeling to allow engineers and non-engineers alike to build and collaborate on data models and data modeling components14In the example below,we are using the native Snowflake PARSE_JSON function to find the“Sales Person”from a column containing JSON data about a sale called“SALES_INFO”:By taking
61、advantage of the data platforms native functionality,data modelers can leverage the syntax of their underlying data platform without needing to learn another language.Sometimes a single-pass formula is not expressive enough to support calculations that combine data at different levels of granularity
62、.For example,a weighted average calculation like“Average Interest Rate”or a ratio calculation like“Sales per Order”require data to be combined for a numerator and a denominator.In order to calculate these types of expressions,the semantic layer platform needs the ability to aggregate data first and
63、then perform the final calculation on the aggregated results,requiring ordered operations and multi-pass queries.Key Takeaway:A semantic data model must support data transformation expressions in the semantic data model using the native platforms SQL dialect.Multipass15The Multidimensional Expressio
64、ns language,or MDX,is ideally suited for supporting these types of expressions.In addition to supporting calculations at different levels of granularity,MDX is perfect for creating time-relative and cell-based expressions like the following formula that calculates a“30 day moving average of Sales”us
65、ing a:Retail 445 calendarKey Takeaway:A semantic data model must support the ability to perform pre-query&post-query calculations for handling calculations that summarize data at different levels of granularity.VirtualizedData wrangling tools and ETL/ELT platforms are familiar to most data engineers
66、.These tools are meant to move data from point A to point B while transforming data for the purposes of cleansing,correcting,combining or just calculating new values.For most use cases,creating new tables or files with transformed data is overkill and adds complexity to data pipelines.Data virtualiz
67、ation can automate most data transformation tasks without data movement and with the added benefit of leveraging the power of the data platform for performing these transformations.By pushing calculations down to the underlying data platform without physical data movement,subject matter experts can
68、create flexible,documented data transformations without writing complex code.16For example,in the image below,we are“cleaning”the“sales_reasons”field by replacing the NULL values with an“Unknown”string by creating a new virtual column.Key Takeaway:A semantic later platform should support inline data
69、 transformations using direct queries without data movement or creating copies of data.The Power of Data Prep VirtualizationBesides serving as a metrics hub,a semantic layer is a central repository for business logic and calculations.By supporting complex,multipass data transformation expressions,th
70、e semantic data model can take the place of physical ETL data pipelines.By harnessing the power of the subject matter expert,the semantic layer platform can become the digital twin of the business while avoiding dependencies on data engineers and SQL experts.17Chapter 4:Multi-dimensional Calculation
71、 EngineFor a semantic layer to function,it must translate the inbound,logical queries coming from the data consumers into physical SQL queries in the dialects of the underlying data platforms.The semantic layer data model is ideally suited to define the digital version of the business,but it needs t
72、o be capable of expressing a wide variety of business concepts in a variety of contexts.There are two types of semantic layers,or models,to consider:a tabular semantic layer and a multidimensional semantic layer.The tabular or relational model was popularized by modeling gurus like EF Codd and Ralph
73、 Kimbal in the 70s and 80s.These modeling techniques rely on concepts like fact and dimension tables and are meant to make a relational database or data warehouse easier to query.A tabular view of the data is useful since it presents data as a simple,flatten view of data in rows and columns.This for
74、mat is ideally suited for SQL-based use cases and tooling using common protocols like JDBC and ODBC.The multidimensional data model goes one step further,though.By defining relationships and aggregation rules,the multidimensional semantic model adds a business friendly context and makes hand writing
75、 SQL either unnecessary or substantially more simplistic.For the widest range of uses and consumption styles,a multidimensional semantic layer offers more power in an easier to use package because it combines business-friendly metadata and data in one interface.Cell-Based,Multi-dimensionalTabular an
76、d Multidimensional18The relational database,with its row-oriented architecture,is well-suited to store transactional data,but relational databases are not well-suited to express business logic,since the business thinks in multiple dimensions or cells,instead of rows.Its not an accident that the spre
77、adsheet interface like Excel became the dominant tool for modeling business because it is designed for multidimensional analysis.The spreadsheets cell-based calculation architecture provides a full range of expressiveness that a row-based engine lacks.As such,the semantic layer must be capable of mo
78、deling data in cells,not rows.The multidimensional expression language,or MDX,has hundreds of functions and operators to express a wide variety of cell-based calculations for expressing complex business logic.Lets take a simple example of defining a metric for a 30 day moving average of sales.While
79、this seems like a simple metric,it gets complicated when you wish to use the metric in queries with a variety of time periods(i.e.30 day moving average by week,month,quarter,year,etc.).Using a semantic layer with support for MDX,we can define this metric once using the following expression:The above
80、 expression packs a lot of power in a compact package.In this example,we are using an AVG expression to create an average of the Sales metric,a LAG function to specific the lookback period(30 days),and a range expression using a“:”operator and the CURRENTMEMBER expression to dynamically calculate a
81、range of values given a query context.We can then use this common expression in a variety of queries without regard to its context.In the following example,we computed a 30 day moving average of sales by quarter for All-Purpose Bike Stands:Dimensional ExpressionsWITH MEMBER 30 Day Moving Avg ASAVG(O
82、rder Date Dimension.Order Retail445.CurrentMember.LAG(30):Order DateDimension.Order Retail445.CURRENTMEMBER,Measures.Sales)SELECT Order Querter,AVG(30 Day Moving Avg)FROM Internet Sales ModelWHERE(Product Name=All-Purpose Bike Stand)GROUP BY 119By simply changing Order Quarter to Order Week in the a
83、bove query,the MDX expression would render the correct moving average automatically.Furthermore,changing the filter(All-Purpose Bike Stands)also returns the correct answer without further model modification.Without a cell-based computation language,a SQL-based semantic layer would require hand codin
84、g each query variation and require a separate query definition for each unit of time.Doing so would require loads of custom SQL code that lacks reusability and is tough to maintain,making a SQL-based semantic layer engine a poor choice for expressing even simple business constructs.The inventors of
85、OLAP introduced the world to hierarchies in the 80s and its been a core feature for data visualization ever since.Hierarchies allow users to drill into data along a defined path,moving from less to more detail intuitively.For example,the product hierarchy below allows Excel Pivot Table users to choo
86、se their level of product detail and drill down for more granularity with a simple mouse click.Hierarchies20Hierarchies provide more value than just making data visualization more intuitive.Hierarchies also provide easy-to-use alternative representations of the same base data.For example,the date di
87、mension in the model below(as viewed through Tableau)allows users to group order data by Year-Quarter-Month-Day,by Year-Quarter-Week-Day or by reporting period using a popular with the retail and manufacturing industries.4-4-5 calendarWithout the ability to model hierarchies,end users lose critical
88、functionality when using tools like Tableau,Power BI and Excel which support hierarchical visualizations.Trying to guess what questions end users may ask of data is an exercise in futility.The needs of the business often outpace the ability for data teams to respond.Its imperative,therefore,that a s
89、emantic layer be as flexible as possible in answering queries.Recently,at the Coalesce 2021 Conference,Drew Banin,co-founder of dbt Labs,delivered an excellent called“The Metric System”.Drew did a great job explaining the value of the semantic layer and he lays out the dbt Labs approach to deliverin
90、g one.presentationOrder Retail445Order Reporting YearAbcAbcAbcAbcAbcAbcAbcAbcAbcAbcAbcAbcOrder Reporting Half YearOrder Reporting QuarterOrder Reporting MonthOrder Reporting WeekOrder Reporting DayOrder Date Week HierarchyOrder Date Month HierarchyOrder YearOrder YearOrder QuarterOrder MonthOrder Da
91、yOrder WeekOrder DayKey Takeaway:A semantic layer must be backed by a multidimensional,cell-based engine to express complex business logic.Semantic layer solutions that use SQL-based calculation engines cannot express business constructs in a variety of contexts.A semantic layer must also support hi
92、erarchies to allow for intuitive drill paths and level relationships.Semantic layer solutions that only support dimensions and metrics do not provide an intuitive data navigation experience for end users.“Anything by Anything”21In his presentation,Drew introduced the following example of a metric mo
93、del:I know this is meant to be a simple,illustrative example,but it(and similar metric layer solutions)demonstrates some serious shortcomings.In the metric model example above,you will notice the following“filters”directive:This directive is problematic because this type of filter belongs in the use
94、rs dashboard or report query,not in the definition of the model.As a result,this filter drastically reduces the value of this model since this directive means only“paying”customers can ever be analyzed.The following construct is also problematic:#models/marts/product/schema.ymlversion:.:2truemodelsm
95、etricsnamenamedim_customersnew_customersNew Customersdim_customers“The number of paid customers who are using the product”countuser_idlabelmodeldescriptiontypesql#superflous here,but shown as an exampletimestampdimensionsfilterssignup_dateplanis_payingtime_grainsday,week,monthcountryfieldvaluemetafi
96、lters:fieldis_ payingvaluetruedimensions:plancountry22This directive limits the model to analysis by plan and country.While this is an illustrative example and more dimensions can be added to the model,ultimately this markup language is too simplistic.A better approach is to use a multidimensional m
97、odel that allows for“anything by anything”queries by supporting relationships between several entities(i.e.customer by location)with many-to-many,many-to-one and semi-additive relationships for expressing business rules.For example,the multidimensional model below,based on an excerpt of the model,al
98、lows users to report sales and returns by source,by store,product,date,warehouse,customer and customer demographics in any combination desired.TPC-DS benchmarkThe relationships between these business entities is crucial and SQL alone is too cumbersome for expressing these relationships.Key Takeaway:
99、A semantic layer must support a wide range of query patterns that are not constrained to a single view of data or subset of dimensions or metrics.Semantic layer solutions that define slices of data are not flexible enough to support a wide range of use cases and will force end users to wait for mode
100、l owners to introduce new model views.23Besides serving as a metrics hub,a semantic layer powered by a multi-dimensional calculation engine creates a digital twin of your business.In order to express the complexities of business processes,the semantic layer platform must support dimensional expressi
101、ons,hierarchies and entity relationships defined in a semantic model and backed by a graph-based query planner.For a semantic layer to function,its critical to deliver a live,interactive query experience to discourage users from bypassing the semantic layer by moving data into external,ungoverned ca
102、ching layers.For a semantic layer to be useful,it must be the source of truth for all queries,which means data needs to be queried”live”regardless of where the data lives.Creating copies of data to improve performance introduces data latency and inconsistency and thereby undermines the core values o
103、f a semantic layer.It follows,then,that a semantic layer needs to deliver extract-level performance against the cloud data platforms natively.As such,the semantic layer must include automated performance management to deliver queries at“speed of thought”with a live connection to cloud data platforms
104、.Since data is always growing and evolving and user query behavior is far from predictable,attempts to manually tune queries is futile.A Multidimensional Calculation Engine Is CriticalAutonomous&AdaptiveChapter 5:Performance Optimization24As illustrated above,the semantic layer platform must be capa
105、ble of rewriting queries and creating aggregates on-demand using the data model,end user query patterns,data statistics and machine learning to automatically manage performance for every query.When working with cloud data platforms,most data analysts using BI tools like Tableau and Power BI create d
106、ata extracts(i.e.Hyper)or import data into their BI tools before creating their dashboards and reports.Its not their desire to add another data management task to their workflow,but end users must create these data copies in order to get the performance and interactivity they need when querying clou
107、d data platforms.Besides the extra work involved,creating data copies outside of the data platform adds costs and introduces data inconsistency and security risks.To avoid these pitfalls,a semantic layer should improve performance without moving data outside of the native cloud data platform and avo
108、id creating and managing a separate query acceleration infrastructure for accelerating performance.In Situ:No Data Movement Required+=Human SignalsSemantic LayerMachine Learning+Data ModelerExisting Data Context&RelationshipsData ConsumersAI/ML ToolsAugmented ModelQuery Patterns&FrequencyNew Feature
109、sPredictions,StatsAugmented Model&AggregatesKey Takeaway:A semantic layer must autonomously tune query performance to support interactive,live connections to data platforms.Semantic layer solutions that do not automatically manage query performance are unsuitable for supporting direct(live)queries.2
110、5As illustrated above,ideally,the semantic layer should create aggregate tables,or materialized views,on-demand using a machine learning model that is informed by user query patterns and data statistics.By rewriting queries to target smaller aggregates instead of raw data,an optimized semantic layer
111、 can also substantially reduce your cloud data platforms operating costs.To see just how much a semantic layer can accelerate queries and reduce costs,check out AtScales TPC-DS 10TB Benchmark Reports.Key Takeaway:A semantic layer must deliver query performance at“speed of thought”with a live connect
112、ion to data platforms without the need to create tool-specific extracts or imports or moving data to separate caching subsystems.Source SystemsSource SystemsSourceSystemsETL ToolsData TablesAggregatesData LakeS3XMLAJDBCODBCRESTSpark/HiveSpectrumMDXPythonDAXSQLData WarehousePlatformSemantic LayerPowe
113、rful Autonomous Performance ManagementBesides serving as a metrics hub,a semantic layer powered by autonomous performance management provides a live,interactive connection to your data without data movement.By avoiding external caching layers,data stays where it landed and queries scale with your cl
114、oud data platform.By avoiding redundant data scans,an optimized semantic layer can also substantially reduce your data platforms operating costs.26While there are platforms and tools focused on just data governance,its impossible to truly secure data without integrating governance rules with the sem
115、antic layers logical view of data.While the semantic layer must respect and integrate with the underlying data platforms native,physical security policies,it must extend those policies to the derived business calculations and constructs defined in the semantic layer data model.Key Takeaway:A semanti
116、c layer must support deep integration with IT identity management services and respect underlying data platform security policies by running queries with the users account.Whos Who in the Zoo?Governance and security must start with the users identity.Without knowing who is running a query,its not fe
117、asible to apply data access policies to users or groups.No governance controls can work without user identity management.As a result,table stakes for any semantic layer is to support semantic layer integration with enterprise directory services like Active Directory(AD),LDAP,Okta,and more.With enter
118、prise directory integration,the semantic layer can identify a user and then run queries“as that user”against the native data platform.By doing so,the semantic layer will respect the underlying data platforms physical security policies something not possible if the semantic layer uses a proxy or serv
119、ice account to run queries instead.The semantic layer must also synchronize its users and groups with the directory service users and groups to avoid a duplicative,shadow governance infrastructure when applying data access policies.This requires a semantic layer to support deep integration with IT s
120、ervices and data platforms nothing else will do.Chapter 6:Analytics Governance27Key Takeaway:A semantic layer must enforce data governance in real time for every query in order to provide comprehensive coverage and respond to frequently changing policies.Realtime EnforcementWho Sees WhatEnterprises
121、are in constant motion,adding new employees,new business groups and new data sets all the time.For a semantic layer to serve as the governance control plane for the enterprise,it must react to these changes in real time.Starting with user and group management,a semantic layer must seamlessly integra
122、te and sync with enterprise identity management services.As new employees join,depart or change groups,the semantic layers governance policies will instantly reflect those changes.With policies defined in the semantic layer and user identity up to date,the semantic layer platform can then intercept
123、consumer queries and rewrite them to enforce governance policies in real time.Since the semantic layer can connect to any consumer,data platform or user persona,it can deliver comprehensive and consistent coverage with confidence.With the ability to identify end users by name,the semantic layer can
124、apply the following governance functions that are critical to securing data access in a semantic layer?Row-level securit?Column-level securit?Object-level securityThe first critical governance feature for a semantic layer is row-level security,or the ability to apply a filter(or WHERE clause)to each
125、 outbound query to select only the rows of data that users should see.For example,row-level security can be used to automatically restrict data for a sales team in the West so they only see their data and not the data from other regions.As long as the salesperson is mapped to the Western Regions gro
126、up,the semantic layer can automatically generate the proper WHERE clause to restrict data access to just the West regions data for that user.28Key Takeaway:A semantic layer must apply query governance with dynamic filtering,column-level security and object-level security based on the query users ide
127、ntity.Semantic layer solutions that lack row,column and modeling object controls are not suitable for use cases where data access restrictions are required.The second critical governance feature for a semantic layer is column-level security,or the ability to hide sensitive data columns or mask their
128、 contents.For example,a column-level security rule can hide or mask personally identifiable information(PII)fields for the marketing team while making them visible for the HR team.By dynamically adjusting the view of the semantic layer based on user access rules,the semantic layer can be defined for
129、 everyone,but appear customized to users and groups.The final key governance function is object-level security.This layer of governance allows for users and groups to own and share modeling components(i.e.conformed dimensions,hierarchies,calculations,models and much more).This functionality is criti
130、cal to supporting the concept of a,a popular topic of discussion in the data and analytics community.The key principles for a data mesh architecture are supporting domain ownership of data objects and a decentralized system of data stewardship.Without modeling object-level security,backed by RBAC,ac
131、hieving the vision of federating the creation and management of data products just isnt feasible.data meshColumnSocial security numberCustomer AddressVisible to HR,maskedfor everyone elseVisible to HR,hiddenfor everyone else123-45-6789XXX-XX-6789Not Visible123 Anywhere St.RuleWhat HR SeesWhat Market
132、ing Sees29Besides serving as a metrics hub,a semantic layer must apply data access controls and governance policies to every query using the query users identity.By avoiding duplicative governance tools and shadow user management,enterprises can apply both physical and logical data access policies t
133、o ensure that data is only visible to those who have authorized access.With the trust that data is secure,organizations can confidently share data more broadly both internally and externally.By hiding the format,location and complexity of data,a semantic layer provides a business-friendly view of da
134、ta for everyone,not just data engineers and SQL jockeys.Delivering a logical view of data on a variety of data platforms has its challenges,though.In this post,well drill down on some important considerations when evaluating a semantic layers ability to integrate disparate data sources.In addition t
135、o having various levels of support for SQL and SQL extensions,different data platforms have different performance characteristics and optimization controls.In order to avoid data movement and all its adverse effects,its imperative that the semantic layer generate platform-optimized queries and push
136、down those queries to the underlying data platform.Lowest common denominator approaches that generate and execute simple,vanilla SQL and perform aggregations and calculations locally,just cant scale and dont allow customers to take advantage of platform-specific features.A scalable semantic layer pl
137、atform must integrate seamlessly with data platforms and must?Generate and execute multi-pass SQL and push down queries to the underlying data platfor?Leverage platform-specific optimizations,including partitioning,clustering and DDL hint?Avoid a separate compute infrastructure to process query resu
138、lt?Allow native SQL in calculations to leverage platform-specific and user-defined functionsData and Analytics Governance TogetherChapter 7:Data IntegrationSpeaking Their Language30With tight,platform-specific integrations,a semantic layer will generate optimized queries to deliver consistent perfor
139、mance and lower costs.Key Takeaway:A semantic layer must work with a variety of data platforms equally well by supporting native platform dialects and optimizations.Breaking Down SilosIt seems like just about every five years we see a new data platform technology or trend become all the rage.If your
140、 organization has been around long enough,you probably have one of everything and a proliferation of cloud-based applications with your precious data locked behind their proprietary APIs.A semantic layer with data virtualization future-proofs your data platform technology choices by creating an abst
141、raction layer between your data and the tools that interact with it.Besides hiding the complexity of each data platform and preventing vendor lock-in,a semantic layer minimizes or eliminates the cost of migration to new data platforms in the future by using data virtualization as its core mechanism
142、for querying the underlying data.A semantic layer with query federation goes even further to break down data silos than just virtualizing data access.As I discussed in my earlier blog,a semantic layers data model can blend data from multiple sources to create new,composite views of data for modeling
143、 complex business processes.The Semantic Layer.Back to the Future Part 3:The Semantic Model31The illustration below shows how a semantic layer can blend data from multiple sources,including SaaS applications,third party data from exchanges and first party data:In todays cloud data platform world,a d
144、atabase table is no longer just made up of simple rows and columns.Modern data platforms now have advanced support for non-scalar data types like JSON and XML.Google BigQuery goes even further by allowing for nested and repeating fields,which effectively allows the embedding of tables within tables.
145、For example,a customer table may take the form of:CRM SaaS AppsThird Party DataFirst Party DataBusiness Intelligence AppsFinancial SaaS AppsSemantic LayerAI/ML PlatformsOther SaaS AppsThird Party DataFirst Party DataSecond Party DataBusiness ModelsGovernance RulesData VirtualizationData Marketplaces
146、AccessEnhanceModel&BlendShare&ConsumeSaaS ApplicationsDataMarketplacesSuppliers,PartnersKey Takeaway:A semantic layer must support data blending across multiple data platforms and data sources and minimize data movement for federated queries by leveraging localized aggregates and query push down.Mor
147、e Than Rows and Columns32Its easy to see how powerful these constructs can be for compressing information and reducing table joins,something cloud data platforms are not good at.However,while these new constructs make data loading easier and querying faster,they add additional complexity for the use
148、r when writing queries.Besides having to understand these more complex table constructs,each data platform has its own proprietary syntax for unnesting these data types.A semantic layer should hide this additional complexity from the end user and abstract away the dialect differences of each platfor
149、m.In this way,data warehouse architects can take advantage of these powerful new“schema on read”design principles without creating an additional burden on their end users.Besides serving as a metrics hub,a semantic layer must hide the complexity of the data stored in various data platforms.A well de
150、signed semantic layer will scale with your data growth(and your data platforms)by pushing down queries and leveraging each platforms dialect and optimizations features.By eliminating data copies and data movement,a semantic layer can make data instantly accessible to everyone.Key Takeaway:A semantic
151、 layer must support modern data platform features and constructs to support analytics on unstructured and semi-structured data.Data Integration for Data Democratization?i?first_nam?last_nam?dob(date of birth?addresses(a nested and repeated field)?addresses.status(current or previous?addresses.addres
152、?addresses.cit?addresses.stat?addresses.zi?addresses.numberOfYears(years at the addressDave is the founder of AtScale and is the Chief Strategy Officer.Prior to AtScale,he was VP of Engineering at Klout&at Yahoo!where he built the worlds largest multi-dimensional cube for BI on Hadoop.Mariani is a B
153、ig Data visionary&serial entrepreneur.33As you can see,building a semantic layer platform is not simply a matter of defining metrics with a cool new markup language.For a semantic layer to be practical and usable,it needs to?Be capable of expressing your most complex business construct?Be able to pe
154、rform better than your underlying data platform?Be able to connect live to all your data platform?Be able to connect to all your data consumption tool?Be able to govern every query at the user leve?Be able to scale to everyone in your businessIf any of these requirements is missing,a semantic layer
155、is unusable.In other words,its binary it either works 100%or it doesnt work at all.Therein lies the challenge for anyone building a universal semantic layer from scratch.Its not good enough to deliver an MVP that sort of works and can be enhanced as you go.The MVP is not an MVP it just needs to work
156、 completely on day one or no one will bother using it.The team at AtScale has spent more than a decade working to deliver the vision of a universal semantic layer and making it work for real,demanding customers.A universal semantic layer has become a critical component in the modern data and analytics stack.We cannot be more pleased to see our industry partners(and competitors)agree.Final Thoughts