三角洲湖流媒体的搭便车指南.pdf

编号：139078

PDF 45页 6.58MB 下载积分：VIP专享

下载报告请您先登录！

三角洲湖流媒体的搭便车指南.pdf

1、The Hitchhikers Guide to Delta Lake Streaming Tristen WentlingSr.Solutions ArchitectDelta OSS Contributor2023Scott HainesDistinguished Software EngineerSpark/Delta OSS Contributorhttps:/bit.ly/dais2023_hgdlsIntroductionsScott HainesApache Spark journey began in 2016Delta Lake journey began in 2019No

2、minated to the Databricks BeaconsPublished First Big Book on Apache Spark in 2022.Working on First Big Book on Delta LakeLove Learning,Teaching,and MentorshipSpends his days working at one of the worlds largest apparel and shoe companies Enjoys Growing,Making,and Consuming Hot Sauce Loves his kick-a

3、ss wife and two dogsIntroductionsTristen WentlingApache Spark Journey began in 2017 Published blogs on streaming with Spark Working on First Big Book on Delta Lake Reformed Data Scientist Currently works at helping customers create solutions in the retail industry Spends too much time online gaming

4、Somewhat obsessed with palms and other tropical treesPreparing for our JourneyWhat we will probably get through today Part 1:A Gentle Introduction to Stream Processing with Delta Lake Part 2:The Hitchhikers Guide to Delta Lake Streaming Part 3:And One More Thinghttps:/bit.ly/hitchhikers-guide-to-dls

5、https:/bit.ly/dldgv2“Slide Deck Budget Cuts Begin Now!”1_DAIS_Title_SlidePart 1:Delta Lake Streaming 101First Steps:Delta LakeWhat is the Difference between Batch and Stream Processing?Batch processing can be considered as taking the incremental workloads and handling them in larger groups.The bound

6、aries are mostly semantic and the methods differ primarily in terms of latency.First Steps:Delta LakeWhat is the Difference between Batch and Stream Processing?Batch-Periodic processesScheduled(every day/week/month/)expecting some finite set of data;(mon for ETL where files arrive on a scheduled int

7、erval)“Hybrid”-Incrementalized batch processesPeriodic processes that take advantage of state management or checkpointing Streaming-Fully incremental processes(often always on)Expects an unbounded input source and uses trigger intervals/checkpointingFirst Steps:Delta LakeWhat is the Difference betwe

8、en Batch and Stream Processing?streamingDeltaDf=(spark .readStream .format(delta).option(ignoreDeletes,true).option(startingVersion,1).option(maxFilesPerTrigger,4).load(/files/delta/source)(difference comes down to the checkpoints,throttling,etc)batchDeltaDf=(spark .read .format(delta).load(/files/d

9、elta/source)First Steps:Delta LakeDelta Lake:Open Table Format for an Interoperable WorldYou can also feed Delta Lake from other streaming systems besides Spark!Delta.io IntegrationsImages above are trademarks of the Apache Software Foundation and the Linux FoundationFirst Steps:Delta LakeAdopting t

10、he Streaming Mindset Streaming is just a different method of getting to the same goal.If your application will always be running,then youll make different choices in how you architect your Delta Lake apps.Preparing for zero-to-low downtime requires breaking things Stress test applicationsOperating S

11、treaming Applications In Production Can Feel Like ThisDont PanicGuide:Moving Fast and Moving SlowUnderstanding Scale and Bounds of Unbounded Data Streaming Tables are“unbounded”.They can grow nearly infinitelyGuide:Moving Fast and Moving SlowUnderstanding Scale and Bounds of Unbounded Data So the on

12、ly bounds end up being cpu,ram,disk,and network IO Respect the bounds,unlessGuide:Moving Fast and Moving SlowUnderstanding Scale and Bounds of Unbounded Data The Dreaded OOM attacks(or destroys your home to create a bypass)WHEN OOMs ATTACKGuide:Moving Fast and Moving SlowUnderstanding Scale and Boun

13、ds of Unbounded Data Understanding the“current”table bounds,establishes a baseline to make approximate decisions now,and to help predict the future?But how can we predict the futureNo one can accurately predict the future.However,the past can help light the way.Even if only for a short whileGuide:Mo

14、ving Fast and Moving SlowStreaming is essentially a Goldilocks Problem To small(OOM)To big!($to dumpster)Just Right?(maybe future OOM)Image via snooper booksGuide:Moving Fast and Moving SlowProbing the Delta Metadata and Finding“approximate”boundsGuide:Moving Fast and Moving SlowDelta Table Metadata

15、:101:What a Table can teach Us about ItselfCalculate Table Freshness To answer the universal question of Hey,How Fresh Is It?.Which matters if the table“should”be growingAlso(if a table is“done”or in a terminal fixed state,then it also may not be stale,just complete.Freshness checks only matter if“w

16、e expect”the table to be growing.)Guide:Moving Fast and Moving SlowDelta Table Metadata:101:What a Table can teach Us about ItselfCalculate Table Freshness To answer the universal question of Hey,How Fresh Is It?.Which matters if the table“should”be growingGood.Still FreshGuide:Moving Fast and Movin

17、g SlowDelta Table Metadata:101:What a Table can teach Us about ItselfHow Fast is the Table Growing?Size does matter.For example,If we have two tables:1.tableA is 100gb and has createdAt date of one year ago2.tableB is also 100gb and was created yesterdayWhich table is the more probable scalability m

18、onster?Using some similar micro hacks like the freshness technique,we can calculate the days a table has existed,and calculate the avg bytes per day using sizeInBytes to ensure we avoid attacks by the nefarious OOMs.Guide:Moving Fast and Moving SlowDelta Table Metadata:101:What a Table can teach Us

19、about ItselfHow Fast is the Table Growing?It is fairly simple to calculate the bounds,and I mean this in terms of daily approximates(bad math implied)Step 1:Calculate the Totals(even convert from bytes to something more legible,the world is your oyster)Guide:Moving Fast and Moving SlowDelta Table Me

20、tadata:101:What a Table can teach Us about ItselfHow Fast is the Table Growing?It is fairly simple to calculate the bounds,and I mean this in terms of daily approximates(bad math implied)Step 2:Calculate the Approximate Overhead,in terms of rows per day,avg row size,and rows per file.Guide:Moving Fa

21、st and Moving SlowDelta Table Metadata:101:What a Table can teach Us about ItselfHow Fast is the Table Growing?It is fairly simple to calculate the bounds,and I mean this in terms of daily approximates(bad math implied)Step 2:Calculate the Approximate Overhead,in terms of rows per day,avg row size,a

22、nd rows per file.Guide:Moving Fast and Moving SlowSummary:Look Before you LeapRemember.Before beginning a new adventure,to plan ahead.What works easily for some use cases wont for others.What does your application do?Does it read data and dump it into a Table?Or is it more complicated?Are there spec

23、ific SLAs or other expectations dictating Speed?Can you optimize for cost?This leads us to the next logical step:“ramping up”-learning to rate limit based on volume and frequency Guide:Moving Fast and Moving SlowMoving Fast is Cool.Remember OOMs are not.$is still$This leads us to the next logical st

24、ep:“ramping up”-learning to rate limit based on volume and frequency Guide:Moving Fast and Moving SlowWhen Making Tradeoffs Between Speed and CostRate Limiting You can limit the volume of data by using maxFilesPerTrigger or maxBytesPerTrigger.Start small.You can always ramp up(increase)the volume of

25、 the data you process in stream.Guide:Moving Fast and Moving SlowWhen Making Tradeoffs Between Speed and CostRate Limiting You can limit the microbatch frequency by adding a Trigger.Then Build Up Using Triggers.-trigger(once=True)-runs once,ignores config for rate limiting the volume of data(maxFile

26、sPerTrigger).-trigger(processingTime=42 seconds)-takes an arbitrary interval(like 42 seconds)and will run at least every 42 seconds.Guide:Moving Fast and Moving SlowWhen Making Tradeoffs Between Speed and CostRate Limiting You can limit the microbatch frequency by adding a Trigger.Then Build Up Usin

27、g Triggers.-trigger(once=True)-runs once,ignores config for rate limiting the volume of data(maxFilesPerTrigger).-trigger(processingTime=42 seconds)-takes an arbitrary interval(like 42 seconds)and will run at least every 42 seconds.As of Delta Lake 2.1.0-trigger(availableNow=True)-runs once but hono

28、rs rate limiting configs as it runs in microbatches.Guide:Moving Fast and Moving SlowWhen Making Tradeoffs Between Speed and CostWhoa.Fast Processing Rates Huh.Viewing the microbatch statistics lets you appreciate what is possible“locally”-just use the open-source guideUse the StreamingQuery object

29、to observe the application behavior.-streamingQuery.status:to see if the app is running-streamingQuery.stop:to stop running the query-streamingQuery.lastProgress to see whats going on.https:/bit.ly/hitchhikers-guide-to-dlsGuide:Schema Evolution at LightspeedMoving Fast is Awesome.Crashing at Lightsp

30、eed is not.Flexibility is Great.Sneaking is not.What we sometimes do in batch or as a quick hack can sneak into our data and pollute it.Dont be Sneaky.Be IntentionalAlso,turn off overwriteSchema and mergeSchema to keep yourself honest.Guide:Schema Evolution at LightspeedMoving Fast is Awesome.Crashi

31、ng at Lightspeed is not.Sneaky Local ProcessDistributed Schema EnforcementGuide:Schema Evolution at LightspeedMoving Fast is Awesome.Crashing at Lightspeed is not.Sneaky Local ProcessNO!Guide:Schema Evolution at LightspeedMoving Fast is Awesome.Crashing at Lightspeed is not.Schema for your State:can

32、 be corrupted(we can rebuild with time)Schema for a Table:can be corrupted(and repaired in most cases*)Using mergeSchema,overwriteSchema:with cautionSchema for an upstream Table:can be corrupted(in your eyes).Was the change necessary?Are we mad now?Guide:Communication is KeyQuickly:Table Properties

33、are a Great Place for storing things for 3am brain 1_DAIS_Title_SlideOne more thingFirst Steps:Delta Lake 101Delta Lake Tables contain Metadata as well as Physical Parquet FilesFirst Steps:Delta Lake 101Parquet Files love being compressed.UP to 66%Better Compression than Snappy.1_DAIS_Title_Slideand

34、 one more thingSo Long and Thanks for all The FishThis is the end of the road Read the Early Release Contribute to the Hitchhikers Guide to Delta Lake Streaming.https:/bit.ly/hitchhikers-guide-to-dlshttps:/bit.ly/dldgv2Guide:Didnt Make the CutGuide:Delta Lake Table PropertiesExploiting Table Metadat

35、a For Good and Profit.See the Delta Lake Definitive Guide(chapter 6)for more detailsGuide:Optimization PatternsFor Squeaky Clean and Fast Streaming Tables Introduction to Streaming Patterns(can be one)showcasing Upsert conditionally How upserts can affect downstream streaming readers How upserts(and

36、 other DML)can affect Z-ORDER and bin-packing OPTIMIZE See chapter 6 and chapter 9 for more detailsGuide:When Cleaning up Actually Creates More of a MessUsing Vacuum,Drop,and Delete Commands Effectively Expectations for Delete and Reality of(ooops I did it again)but ahh,we can“undo”and restore to an earlier point in time How DROP should be removed from training material use with caution with OSS DeltaLake Tying up Loose Ends using Table Properties to set“gdpr”and other policies for automating delete functionality

友情提示

1、下载报告失败解决办法
2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。
3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

本文（三角洲湖流媒体的搭便车指南.pdf）为本站（2200）主动上传，三个皮匠报告文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知三个皮匠报告文库（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。