《为什么三角洲湖是熊猫分析的最佳存储格式.pdf》由会员分享,可在线阅读,更多相关《为什么三角洲湖是熊猫分析的最佳存储格式.pdf(36页珍藏版)》请在三个皮匠报告上搜索。
1、Why Delta Lake is the best storage format for pandasMatthew Powers,CFADatabricks-Delta LakeDeveloper Advocate at DatabricksWorked in finance for 5 years before programmingRuby/Rails web dev=data engineeringLong time Spark blogger()Now blogging at delta.io/blogCreated multiple popular Spark open sour
2、ce projects(GitHub:MrPowers)Written 2 Spark booksMatthew Powers,CFA5 Reasons Delta Lake is Awesome for pandas1.File skipping allows for faster queries2.Time travel/versioned data3.Schema enforcement4.Better partition management5.Small file compaction&Z OrderingReason 1:File skipping makes queries run fasterReason 2:Time travel/versioned dataReason 3:schema enforcement prevents bad appendsReason 4:Better partition management(adding&deleting partitions)Reason 5:Small file compaction&vacuumThe Lakehouse architecture is great for pandas tooProblems with data lakesdelta.io/blog