《将云供应商提升到一个新的水平:使用 Azure 数据砖解决复杂的挑战.pdf》由会员分享,可在线阅读,更多相关《将云供应商提升到一个新的水平:使用 Azure 数据砖解决复杂的挑战.pdf(76页珍藏版)》请在三个皮匠报告上搜索。
1、ItaiYaffe,tomer_patel1When Working with Big DataItaiYaffe,tomer_patel2 Youve Probably Encountered ThisERROR StorageException:Status code 503,error:code:ServerBusy,message:Operations per second is over the account limit.ItaiYaffe,tomer_patel3 Or ThisKafka LagItaiYaffe,tomer_patel4 Or ThisItaiYaffe,to
2、mer_patel5This is the Talk for You!Taking Your Cloud Vendor To The Next LevelSolving Complex Challenges With Azure DatabricksItaiYaffe,tomer_patel7IntroductionItai YaffeItai YaffeItaiYaffeSenior Big Data Architect AkamaiPrev.Sr.Solutions Architect DatabricksDealing with Big Data challenges since 201
3、2Tomer PatelEngineering Manager AkamaiPrev.Team Lead ClarizenTomer Pateltomer_patelItaiYaffe,tomer_patel8What Will You Learn?Understanding the main challenges of a cloud-based massive-scale data infrastructureItaiYaffe,tomer_patel9What Will You Learn?Understanding the main challenges of a cloud-base
4、d massive-scale data infrastructureHow to iteratively architect such an infrastructure to mitigate those challengesItaiYaffe,tomer_patel10What Will You Learn?Understanding the main challenges of a cloud-based massive-scale data infrastructureHow to iteratively architect such an infrastructure to mit
5、igate those challengesTips for optimizing a massive-scale data infrastructureItaiYaffe,tomer_patel11About AkamaiPower and Protect Life OnlineOver 20 years ago,we set out to solve the toughest challenge of the early internetItaiYaffe,tomer_patel12Outsmart the most sophisticated threats.Protect your d
6、ata,workforce,systems,and digital experiences everywhere your business meets the world.SecurityBoost performance,speed innovation.Build,run,and secure applications and workloads everywhere your business connects online.Cloud Computing Akamais 3 PillarsMake digital magic.Flawlessly deliver apps and e
7、xperiences closer to your customers,wherever they connect.CDNItaiYaffe,tomer_patel13Outsmart the most sophisticated threats.Protect your data,workforce,systems,and digital experiences everywhere your business meets the world.SecurityBoost performance,speed innovation.Build,run,and secure application
8、s and workloads everywhere your business connects online.Cloud Computing Akamais 3 PillarsMake digital magic.Flawlessly deliver apps and experiences closer to your customers,wherever they connect.CDNItaiYaffe,tomer_patel14Outsmart the most sophisticated threats.Protect your data,workforce,systems,an
9、d digital experiences everywhere your business meets the world.SecurityBoost performance,speed innovation.Build,run,and secure applications and workloads everywhere your business connects online.Cloud ComputingMake digital magic.Flawlessly deliver apps and experiences closer to your customers,wherev
10、er they connect.CDN Akamais 3 PillarsItaiYaffe,tomer_patel15Handling of the 30%internets traffic Akamai in NumbersItaiYaffe,tomer_patel16Handling of the 30%internets traffic Akamai in NumbersEmployees 10,000ItaiYaffe,tomer_patel17Handling of the 30%internets trafficEmployees 10,000Data processed usi
11、ng Databricks 50Exabytes Akamai in NumbersItaiYaffe,tomer_patel18What is WSA?A unified and efficient platform,that enables Akamais customers to assess a wide range of streaming security events,and perform analysis of those events,so they can take informed actions in real-timeWeb Security AnalyticsIt
12、aiYaffe,tomer_patel19What is WSA?A Massive-Scale Data InfrastructureWeb Security AnalyticsItaiYaffe,tomer_patel20What does Massive-Scale Data Infrastructure Mean?Generally speaking,its about efficiently handling massive amounts of data at scale1_DAIS_Title_SlideMain Challenges of a Cloud-Based Massi
13、ve-Scale Data InfrastructureItaiYaffe,tomer_patel223 Main ChallengesProcessingStoringAnalyzingItaiYaffe,tomer_patel23WSA Main Challenges Volume-10-14 Gbps(and increasing)SLA-5 minutes from our Edge servers to our Data LakeProcessingItaiYaffe,tomer_patel24WSA Main Challenges Storage capacity-over 6PB
14、 Retention period-31 daysStoring Volume-10-14 Gbps(and increasing)SLA-5 minutes from our Edge servers to our Data LakeProcessingItaiYaffe,tomer_patel25WSA Main Challenges Storage capacity-over 6PB Retention period-31 daysStoring#of queries-100s of queries/minute SLA-10s for 99%of the queriesEach que
15、ry can scan 100s of TBs60+dimensions,(almost)infinite number of filter combinationsAnalyzing Volume-10-14 Gbps(and increasing)SLA-5 minutes from our Edge servers to our Data LakeProcessing1_DAIS_Title_SlideArchitecting and Re-Architecting to Mitigate the Main ChallengesItaiYaffe,tomer_patel27CSI Hig
16、h-level ArchitectureReceiving Layer123Query Layer4Avro files5678ItaiYaffe,tomer_patel28CSI High-level ArchitectureReceiving Layer123Query Layer4Avro files5678ItaiYaffe,tomer_patel29Receiving Layer-Raw DataQueue with“pointers”to the blob storage,plus metadata.For example:Path:/2023-05-16/avro.deflate
17、,size:7526435,recordsCount:9686The actual avro filesItaiYaffe,tomer_patel30Receiving Layer-Storage Types1.Azure Standard Blob Storagea.Relatively cheap and write-performant3 Types of Storages In UseItaiYaffe,tomer_patel31Receiving Layer-Storage Types1.Azure Standard Blob Storagea.Relatively cheap an
18、d write-performant2.Azure Premium Blob Storagea.Where we need minimal write latencyb.More expensive than Standard(10 x)3 Types of Storages In UseItaiYaffe,tomer_patel32Receiving Layer-Storage Types1.Azure Standard Blob Storagea.Relatively cheap and write-performant2.Azure Premium Blob Storagea.Where
19、 we need minimal write latencyb.More expensive than Standard(10 x)3.Azure Data Lake Storage Gen2a.Provides additional capabilities such as i.Hadoop-compatible accessii.Hierarchical directory structure for high-performance data access3 Types of Storages In UseItaiYaffe,tomer_patel33CSI High-level Arc
20、hitectureReceiving Layer123Query Layer4Avro files5678ItaiYaffe,tomer_patel34WSA Architecture ItaiYaffe,tomer_patel35WSA Architecture ItaiYaffe,tomer_patel36WSA Tables Huge tablesOver 6PB in totalItaiYaffe,tomer_patel37WSA Tables Huge tablesOver 6PB in total Table format is Delta Lake1 of 3 leading O
21、pen Table Formats Alongside Apache Hudi,Apache IcebergBrings reliability to data lakes(e.g ACID transactions)Uses versioned Parquet files to store the dataAlso stores a transaction log To keep track of all the commits made to the table or blob store directoryHas a large ecosystemItaiYaffe,tomer_pate
22、l38WSA Tables Huge tablesOver 6PB in total Table format is Delta Lake1 of 3 leading Open Table Formats Alongside Apache Hudi,Apache IcebergBrings reliability to data lakes(e.g ACID transactions)Uses versioned Parquet files to store the dataAlso stores a transaction log To keep track of all the commi
23、ts made to the table or blob store directoryHas a large ecosystem Storage type is Azure Data Lake Storage Gen2ItaiYaffe,tomer_patel39Storage LimitsFactsCloud storage has capacity limits-ingress,egress,TPSItaiYaffe,tomer_patel40 We started seeing a lot of throttling&server busy errors from Azure stor
24、age 300K/day That had a negative impact on ingest,optimize and query timeProblemStorage LimitsItaiYaffe,tomer_patel41Storage LimitsA preview feature(hidden feature)-a multi-cluster storageSolution#1-Regional StorageAccount nameCurrent capacityIngress(Gbps)Egress(Gbps)TPSInput storage-6-7 clusters242
25、.76 TiB43086050kOutput storage-9-10 clusters5.93 PiB540108050kItaiYaffe,tomer_patel42Storage LimitsSo What did we have until now?Solution#2-ShardingItaiYaffe,tomer_patel43Storage LimitsSolution#2-ShardingItaiYaffe,tomer_patel44Storage APIsFactsExcessive number of invocations of the GetPathStatus sto
26、rage APIItaiYaffe,tomer_patel45Negative impact on performance-query and ingest delaysProblemStorage APIs12M10M8M6MItaiYaffe,tomer_patel46Storage APIsDatabricks updated Azure Storage APIs to newer ones in DBRUpgrade DBR version to+11.2Solution and RecommendationsItaiYaffe,tomer_patel47WSA Architectur
27、e ItaiYaffe,tomer_patel48Strict Query SLAFactsWSA needs to execute queries with strict SLA For different use cases-e.g aggregated,raw data On up to the last 31 days of dataItaiYaffe,tomer_patel49Most queries were taking 10s of seconds or even minutes-even after OPTIMIZEProblemStrict Query SLAItaiYaf
28、fe,tomer_patel50Most queries were taking 10s of seconds or even minutes-even after OPTIMIZEProblemStrict Query SLACombining Regional Storage and Sharding Allowed us to support significantly more egress and TPSUsing Databricks PhotonBuilding an in-house Load Balancer on All-Purpose/SQL WarehouseSampl
29、ingSolution and RecommendationsItaiYaffe,tomer_patel51SamplingMain Goals Better query response time Cost reduction By reducing the number and size of query clusters Reduce issues in StorageItaiYaffe,tomer_patel52SamplingRedirection to fast query dataset based on decision tree Specific APIs,specific
30、filters and etc.By default,the user will query the fast query datasetUsers will still able to query the full datasetCurrently based on a statistical modelIn the future,based on an ML modelCreating a Sampled Dataset-1%/5%/10%of the Data ItaiYaffe,tomer_patel53Strict Query SLAResultsSignificantly impr
31、oved query response times From 10s of seconds(or even minutes)to less than 7 seconds for 85%of the queries1_DAIS_Title_SlideTips for Optimizing a Massive-Scale Data Infrastructure Data Retention Compression FormatsItaiYaffe,tomer_patel55Data Retention Deleting data older than a defined threshold(a.k
32、.a TTL)is very common Delta Lake does not support ALTER TABLE table_name DROP PARTITION TRUNCATE TABLE table_name PARTITION clause Instead,it provides a DELETE FROM statement,e.gDELETE FROM table_name WHERE event_time (now()-INTERVAL 31 DAY)Where event_time is TIMESTAMPDeleting Old RecordsItaiYaffe,
33、tomer_patel56Data Retention But This can actually create new files in your Delta table!DESCRIBE HISTORY table_nameoperation|operationParameters|operationMetrics=|=|=DELETE|predicate:(my_table.event_time TIMESTAMP 2023-04-03 12:45:34.813)|executionTimeMs:2354,.,numAddedFiles:3,numCopiedRows:1321,numD
34、eletedRows:60654,.,numRemovedFiles:155,rewriteTimeMs:1438,scanTimeMs:916Potential Impact of DELETE FROMItaiYaffe,tomer_patel57Data Retention Parquet files are immutable Hence,Delta Lake has toRead the existing Parquet file(s)Filter out the records to be deletedWrite new Parquet file(s)with the remai
35、ning recordsWhy?ItaiYaffe,tomer_patel58Data Retention E.g table_name includesevent_time-TIMESTAMPevent_day-DATE Re-write your DELETE FROM statement,to match the partition columnsDELETE FROM table_name WHERE event_time (now()-INTERVAL 31 DAY)How to Avoid Creating New Files in this Use-case?ItaiYaffe,
36、tomer_patel59Data Retention E.g table_name includesevent_time-TIMESTAMPevent_day-DATE Re-write your DELETE FROM statement,to match the partition columnsDELETE FROM table_name WHERE event_time (now()-INTERVAL 31 DAY)DELETE FROM table_name WHERE event_day to_date(now()-INTERVAL 31 DAY,YYYY-MM-DD)How t
37、o Avoid Creating New Files in this Use-case?ItaiYaffe,tomer_patel60Data Retention Lets check the table history nowDESCRIBE HISTORY table_nameoperation|operationParameters|operationMetrics=|=|=DELETE|predicate:(my_table.event_day DATE 2023-04-03)|executionTimeMs:26,.,numAddedFiles:0,numCopiedRows:0,n
38、umDeletedRows:8745,.,numRemovedFiles:78,rewriteTimeMs:0,scanTimeMs:25Rewrite ImpactItaiYaffe,tomer_patel61Data Retention For our petabytes Delta Lake tablesPartitioned by DELETE job is executed on a daily basis Achieved:Execution time(per job)4-5 hours-20 minutesCosts(in total)$500/day(max.)-$10/day
39、Significantly less IOPS on storageRewrite Impact-Full Scale1_DAIS_Title_SlideTips for Optimizing a Massive-Scale Data Infrastructure Data Retention Compression FormatsItaiYaffe,tomer_patel63Compression Formats WSA writes and reads TBs of data to/from Delta Lake tables stored in ADLS Snappy is the de
40、fault compression format for Parquet files written by SparkSpark supports other compression formats,e.g LZ4,zstd,gzip,etc.FactsItaiYaffe,tomer_patel64Compression Formats Reduce the amount of data written to/read from ADLSReduces IOPS and costsMain GoalItaiYaffe,tomer_patel65Compression Formats A mod
41、ern compression format developed by Meta Has a promising compression ratioSupports 22 compression levels,the default is 3 Databricks Photon has a built-in support of optimized execution for zstdZSTDItaiYaffe,tomer_patel66Compression FormatsSetup is easyspark.conf.set(pression.codec,zstd)ORpression.c
42、odec zstd in the Spark Config of the Databricks clusterORdf.write.mode(overwrite).format(delta).option(compression,zstd).saveAsTable(my_table)ZSTD SetupItaiYaffe,tomer_patel67Compression FormatsControlling the specific zstd levelRequires the addition pression.codec.zstd.level 19 in the Spark Config
43、of the Databricks clusterZSTD SetupItaiYaffe,tomer_patel68Compression FormatsIts very important to set it up on all jobs that manipulate the dataRemember-even Delta Lakes DELETE FROM statement can potentially create new files!ZSTD SetupItaiYaffe,tomer_patel69Compression FormatsBenchmarked snappy vs
44、3 levels of zstd in pre-productionResults:ZSTD BenchmarkComparison aspect vs SnappyZstd(level 3-default)Zstd(level 11)Zstd(level 19)Used storage50%50%50%Ingest performance-micro-batch mean duration1.1X1.3X2XQuery performanceRoughly the sameRoughly the sameN/AItaiYaffe,tomer_patel70Compression Format
45、sSnappy vs zstd(default level)in production:ZSTD Actual ResultsComparison aspect vs SnappyZstd(level 3)Used storage35%Ingest performance-micro-batch mean durationRoughly the sameQuery performanceRoughly the sameItaiYaffe,tomer_patel71 Akamai recently announced its new offering,Akamai Connected Cloud
46、(formerly Linode)As part of our ongoing efforts to optimize efficiency,and the“drinking your own champagne”mindset,were in the process of moving some workloads to Akamais cloud.We are applying the lessons learned from our Azure Databricks journey,e.gUsing zstd compression format where applicableShar
47、ding our ingest pipelines to avoid throttlingOne Last Re-Architecture(For Now)ItaiYaffe,tomer_patel72Summary Using Kafka to store only“pointers”to raw data files Splitting ingest pipeline to overcome storage limitations-a.k.a Sharding Avoid excessive Storage API invocations where possible Processing
48、ItaiYaffe,tomer_patel73Summary Choosing the right storage type for each workload Using an Open Table Format(e.g Delta Lake)Leveraging advanced,preview features such as Regional Storage Properly deleting old data Using the appropriate compression formatStoring Using Kafka to store only“pointers”to ra
49、w data files Splitting ingest pipeline to overcome storage limitations-a.k.a Sharding Avoid excessive Storage API invocations where possible ProcessingItaiYaffe,tomer_patel74Summary Choosing the right storage type for each workload Using an Open Table Format(e.g Delta Lake)Leveraging advanced,previe
50、w features such as Regional Storage Properly deleting old data Using the appropriate compression formatStoring Sampling can improve query performance with a little impact on results accuracyAnalyzing Using Kafka to store only“pointers”to raw data files Splitting ingest pipeline to overcome storage l
51、imitations-a.k.a Sharding Avoid excessive Storage API invocations where possible ProcessingItaiYaffe,tomer_patel75Want To Know More?Women in Big Data A world-wide program that aims:to inspire,connect,grow and champion success of women in all data domains50+chapters and 20,000+members world-wideEvery
52、one can join(regardless of gender),so find a chapter near you- in Data+AI panel and luncheon(Thursday,11:30AM)- talks tomorrow“From Snowflake to Enterprise-Scale Apache Spark”by Nic Jansma&Amir Skovronik(12:30PM)- the Power of Interactive Analytics at Scale with Databricks&Delta Lake”by Tomer&myself(1:30PM)- Analytics:Migrating a Mission Critical Product to the Cloud”by Yaniv Kunda(2:30PM)- You!Your feedback is important to us!Feel free to reach out Tomer Pateltomer_patelItai YaffeItaiYaffe