上海品茶

您的当前位置:上海品茶 > 报告分类 > PDF报告下载

2017年airbnb数据平台实践.pdf

编号:92372 PDF 64页 5.23MB 下载积分:VIP专享
下载报告请您先登录!

2017年airbnb数据平台实践.pdf

1、Airbnb?HONGBO ZENGData Platform at AirbnbCluster EvolutionIncremental Data Replication-ReAirUnified Streaming and Batch Processing-AirStreamAgendaData Platform at AirbnbCluster EvolutionIncremental Data Replication-ReAirUnified Streaming and Batch Processing-AirStreamAgenda13B35PB1400+Warehouse Size

2、#Events CollectedMachinesHadoop+Presto+SparkScale of Data Infrastructure at Airbnb5xYoY Data GrowthEvent LogsMySQL DumpsGold ClusterHDFSHiveKafkaSqoopSilver ClusterSpark ClusterSparkReAirAirflow SchedulingS3Presto ClusterAirPalSuperSetTableauData PlatformYarnHDFSHiveYarn5AirStreamEvent LogsMySQL Dum

3、psGold ClusterHDFSHiveKafkaSqoopSilver ClusterSpark ClusterSparkReAirAirflow SchedulingS3Presto ClusterAirPalSuperSetTableauData PlatformYarnHDFSHiveYarn6AirStreamData Platform at AirbnbCluster EvolutionIncremental Data Replication-ReAirUnified Streaming and Batch Processing-AirStreamAgendaSetupSing

4、le HDFS,MR and Hive installationc3.8xlarge(32 cores/60G mem/640GB disk)+3TB of EBS volume800 nodesTested DN on different AZsAll data managed by HiveOriginal ClusterChallengesLimited isolation between production/adhocAdhoc-Difficult to meet SLAs-Harder for capacity plan Disaster recoveryDifficult rol

5、l outsTwo independent HDFS,MR,Hive metastoresd2.8xlarge w/48TB local250 instances in final setupReplication of common/critical data-Silver is superof GoldFor disaster recovery,separate AZsTwo ClustersGold ClusterHDFSHiveSilver ClusterReplicationYarnHDFSHiveYarnAdvantagesFailure isolation with user j

6、obsEasy capacity planningGuarantee SLAsAble to test new versionsDisaster RecoveryMulti-Cluster Trade-OffsDisadvantagesData synchronizationUser confusionOperational overheadAdvantagesFailure isolation with user jobsEasy capacity planningGuarantee SLAsAble to test new versionsDisaster RecoveryMulti-Cl

7、uster Trade-OffsDisadvantagesData synchronizationUser confusionOperational overheadData Platform at AirbnbCluster EvolutionIncremental Data Replication-ReAirUnified Streaming and Batch Processing-AirStreamAgendaBatchScan HDFS,metastoreCopy relevant entriesSimple,no stateHigh latencyWarehouse Replica

8、tion ApproachesIncrementalRecord changes in sourceCopy/re-run operations on destinationMore complex,more stateLow latency(seconds)Record Changes on Source Convert Changes to Replication Primitives Run Primitives on the DestinationIncrementalReplication14 Hive provides hooks API to fire at specific p

9、oints-Pre-execute-Post-execute-Failure Use post-execute to log objects that are created into an audit log In critical path for queries Record ChangesOn Source15Example Audit Log Entry16 3 types of objects-DB,table,partition 3 types of operations-Copy,rename,drop 9 different primitive operations Idem

10、potentConvert Changes to Primitive Operations17CREATE TABLE srcpart(key STRING)PARTITIONED BY(ds STRING)Copy TableINSERT OVERWRITE TABLE srcpart PARTITION(ds=1)SELECT key FROM src Copy PartitionALTER TABLE srcpart SET FILEFORMAT TEXTFILE Copy TableALTER TABLE srcpart RENAME to srcpart_old Rename tab

11、lePrimitive Example18Copy Table Flowsource exists?dest exists and the same?copy to temp locationverify the copytmp-destadd metadatadoneYNYNData Platform at AirbnbCluster EvolutionIncremental Data Replication-ReAirUnified Streaming and Batch Processing-AirStreamAgendaBatch Infrastructure21Event LogsM

12、ySQL DumpsGold ClusterHDFSHiveKafkaSqoopSilver ClusterSpark ClusterSparkReAirAirflow SchedulingS3Presto ClusterAirPalSuperSetTableauYarnHDFSHiveYarnAirStream22SourceProcessSinkStreaming at Airbnb-AirStream23ClusterSpark StreamingAirflow SchedulingHBaseHDFSSourcesKafkaS3HDFSSinksDatadogKafkaDynamoDBE

13、lasticSearchLambda ArchitectureBatchAirStreamHiveSpark SQLLambda Architecture 25StreamingKafkaSpark StreamingState StorageSources26Streamingsource:name:source_example,type:kafka,config:topic:example_topic,Batchsource:name:source_example,type:hive,sql:select*from db.table where ds=2017-06-05;Computat

14、ion27Streaming/Batchprocess:name=process_example,type=sql,sql=SELECT listing_id,checkin_date,context.source as source FROM source_example WHERE user_id IS NOT NULL Sinks28Streamingsink:name=sink_example input=process_example type=hbase_update hbase_table_name=test_table bulk_upload=false Batchsink:n

15、ame=sink_example input=process_example type=hbase_update hbase_table_name=test_table bulk_upload=true StreamingComputation Flow29SourceProcess_AProcess_BProcess_A1Sink_A2Sink_B2BatchSourceProcess_AProcess_BProcess_A1Sink_A2Sink_B2Unified API through AirStreamDeclarative job configurationStreaming so

16、urce vs static sourceComputation operator or sink can be shared by streaming and batch job.Computation flow is shared by streaming and batchSingle driver executes in both streaming and batch mode job30Shared State StorageAirStreamShared Global State Store32HBase TablesSpark StreamingSpark StreamingS

17、park StreamingSpark StreamingSpark BatchSpark BatchSpark BatchSpark BatchWell integrated with Hadoop eco systemEfficient API for streaming writes and bulk uploadsRich API for sequential scan and point-lookups Merged view based on version 33Why HBaseUnified Write API34DataFrameHBaseRegion 1Region 2Re

18、gion NRe-partitionPutsHFile BulkLoadRich Read API 35HBase TablesSpark Streaming/Batch JobsMulti-GetsPrefix ScanTime Range ScanMerged Views36Row KeyR1V200TS200R1V150TS150R1V01TS01TimeStreaming WritesStreaming WritesStreaming WritesMerged Views37Row KeyR1V200TS200R1V150TS150R1V01TS01TimeStreaming Writ

19、esStreaming WritesStreaming WritesR1V100TS100Batch Bulk UploadOur FoundationsUnify streaming and batch processShared global state store38MySQL DB Snapshot Using Binlog Replay Large amount of data:Multiple large mysql DBs Realtime-ness:minutes delay/hours delay Transaction:Need to keep transaction ac

20、ross different tables Schema change:Table schema evolvesDatabase Snapshot40Move Elephant41Binlog Replay on Spark20+hr4+hrAirStream Job5 mins15 1 hrspinal tapseedStreaming and Batch shares Logic:Binlog file reader,DDL processor,transaction processor,DML processor.Merged by binlog position:Idempotent:

21、Log can be replayed multiple times.Schema changes:Full schema change history.42Log ParserTransaction ProcessorChange ProcessorSchema ProcessorHBASELambda ArchitectureBinlog(realtime/history)DMLDDLXVIDMysql InstanceStreaming Ingestion&Realtime Interactive QueryRealtime Ingestion and Interactive Query

22、44HBaseAirStreamSpark StreamingKafkaQuery EngineDataPortalSpark SQLHive SQLPresto SQLInteractive Query in SqlLab45ThanksRealtime OLAP with DruidRealtime Ingestion for Druid48DruidAirStreamSpark StreamingKafkaDimensionMetricsDruid BeamSuperset Powered by Druid49Realtime IndexingHiveRealtime Indexing5

23、1Elastic Searches_version=mutation idAirStreamSpark StreamingSpark BatchTable AEventEventEventKafkaTable BTable CBackup SlidesTipsMoving Window ComputationLong Window Computation55What if window is weeks,months,or even years?Distinct in a Large Window56I dont want approximation.What should I do?Dist

24、inct Count57Row KeyListing 1 Visitor 01TS100Listing 1 Visitor 02TS100Listing 1 Visitor 04TS98Listing 1 Visitor 03TS99Prefix Scan with TimeRangePrefix Scan with TimeRangeTimeMoving Average58Row KeyListing 1Total Review Cnt:100TS100Listing 1Total Review Cnt:98TS99Listing 1Total Review Cnt:01TS01Listin

25、g 1Total Review Cnt:50TS50Count Difference/Time ElapsedCount Difference/Time ElapsedTimeWindow 1Window 2Schema EnforcementStreaming EventsThrift-DataFrame60ThriftEventhttps:/ ClassThrift ObjectFieldMetaDataStruct TypeFieldValueRowDataFrameSummaryUnify Batch and Streaming Computation62Global State Store Using HBase63Serial execution-Easy to reason about operations-Very slowParallel execution-Fast and scalable-Ordering is important:e.g.create table before copying a partition-DAG of primitive operationsRun Primitiveson Destination64

友情提示

1、下载报告失败解决办法
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。

本文(2017年airbnb数据平台实践.pdf)为本站 (云闲) 主动上传,三个皮匠报告文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知三个皮匠报告文库(点击联系客服),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。
会员购买
客服

专属顾问

商务合作

机构入驻、侵权投诉、商务合作

服务号

三个皮匠报告官方公众号

回到顶部