《滕昱-解构流存储 – Pravega.pdf》由会员分享,可在线阅读,更多相关《滕昱-解构流存储 – Pravega.pdf(32页珍藏版)》请在三个皮匠报告上搜索。
1、Deconstructing Stream StorageFlavio Junqueira Senior Director,Senior Distinguished Engineer滕昱Director,Software Engineering Dell TechnologiesStorage Abstractions Block abstraction,first disk storage units in the 1950s(650 RAMAC)DECs VAXcluster,pool of block-level storage in 1983 Storage Area Networks
2、(SANs)are a more recent development(late 1990s)BlockFileLate 1990sA Cost-Effective High-Bandwidth Storage Architecture,Gibson et al.,ASPLOS 1998ObjectSource:CTSS Programers GuideIn the 1960sStream as a storage abstraction“How hard is it to append to a file?”A skeptical colleagueAppending to a file I
3、O write size matters for throughput Batching consequently is desirable,but needs to be balanced with latency Preallocation is desirable to avoid a perfomance penalty due to block allocation Durability is criticalReadAppendReplicating for dependabilityAppendRead Starts getting into distributed system
4、s,replication problems Lack of a correct protocol can lead to unsafe systemsShared-nothing is limitingAppendReadServer 3Server 2Server 1 Storage capacity of a single stream is limited by the capacity of individual servers Scale-out storageScale-out Storage(file or object)Removes an obstacle to unbou
5、nded streamsAppendReadPlurality of data setsScale-out Storage(file or object)Co-exist with other non-stream data setsE.g.,Apache Parquet files,Apache Iceberg tables Enable use cases that join streaming data and historical/static dataQuery enginesStructured and unstructureddata at restStreamPrimary s
6、torage for all dataStreaming data lands in the data lake without data movementScale-out storage Small writes per stream leads to poor performanceScale-out Storage(file or object)AppendReadScale-out storage+LogAppendRead Durability while guaranteeing high throughput and low latency Scale-out Storage(
7、file or object)LogsParallelismAppendRead Durability while guaranteeing high throughput and low latency Scale-out Storage(file or object)LogsSingle stream with degree of parallelism three or three independent streamsAppend-only sequenceParallelism&Traffic changes123123Append21ReadScale-out Storage(fi
8、le or object)Dynamically change the degree or parallelismPool of storage serversLogsLogs3Caching for recent trafficAppend21ReadScale-out Storage(file or object)Tail reads served from cache /Writes to storage served from cacheLogs3CacheConsistent writes and readsAppendReadScale-out Storage(file or ob
9、ject)API calls and properties to enable safetyLogsPool of storage serversLogsNo duplicates or missesAtomicity for multiple eventsCacheCheckpointsConsistent positions to roll back to AppendReadScale-out Storage(file or object)LogsNo duplicates or missesAtomicity for multiple eventsCacheCheckpointsCon
10、sistent positions to roll back to Durable streaming data再Tiered storage Object or fileEnables end-to-end exactly once semantics TransactionsAuto-scale streamsPerformanceArchitecture comparisonKafkaStorageClientClientClientClientPublishConsumeBrokerLogLogBroker storageMain storageServes consume reque
11、stsNot flushed by defaultOffloading to tiered storage is optionalNot yet available in Apache KafkaPulsarClientClientClientClientPublishConsumeBrokerAppendReadLogStorageClientClientClientClientWriteReadSegment StoreStorageAppendRecoverLogPravegaMain storageDurable LogApache BookKeeperTemporary storag
12、eRead only upon recoveryFlushed by defaultManaged LedgerApache BookKeeperMain storageServes consume requestsFlushed by defaultOffloading to tiered storage is optionalStorage dependenciesArchitecture comparisonBatchingPravegaClientClient001First-level BatchSecond-levelBatch01010
13、011101001 1101001 11011001 1101001BookieBookieBookieSegment StorePulsarClientBatchClientBatchBookieBookieBookieBookieKafkaClientBatchClientBatchLeaderBrokerFollowerBrokerFollowerBrokerBatching example5,000 sensors10 samples/s100 bytes per sampleBatchBatchB
14、atchBatchBatchBatchBatchBatchBatchBatchStream 1Stream 2Stream 3Stream 4Stream 5Stream 6Stream 7Stream 8Stream 9Stream 10Stream 11Stream 12Stream 13Stream 4,998Stream 4,999Stream 5,000Stream 14Logs1,000 bytes per stream per second1,000 bytes per stream per second500K bytes per log per second expected
15、Stream 1Stream 2Stream 3Stream 4Stream 5Stream 6Stream 7Stream 8Stream 9Stream 10Stream 11Stream 12Stream 13Stream 4,998Stream 4,999Stream 5,000Stream 14Accumulate and write asynchronouslyLTSData Ingestion and ParallelismSource:When Speed meets Parallelism Pravega performance under parallel streamin
16、g workloadsFixed TP=250MBps 1KB events Random KeysSegments/Partitions050000500Write Throughput(MBps)300Fixed TP=250MBps 1KB events Random KeysSegments/Partitions050000500Write Throughput(MBps)300Kafka(flush,producers=10)Kafka(flush,producers=100)Kafka(prod
17、ucers=10)Kafka(producers=50)Kafka(producers=100)Pravega(producers=10)Pravega(producers=50)Pravega(producers=100)Pulsar(producers=10)Pulsar(producers=50)Pulsar(no keys,ackQ=3,producers=10)Pulsar(no keys,ackQ=3,producers=100)Pravega(producers=10)Pravega(producers=50)Pravega(producers=100)Pulsar(produc
18、ers=100)Consuming stream storage dataSourceSinkPravega CheckpointsPravega TransactionsSource:Flink connectors repository Source and Sink Connectors Exactly-once semantics end to endFrom edge to coreSource:Data Flow from Sensors to the Edge and the Cloud using PravegaSensorSensorPravega SensorCollect
19、orMicro-EdgeSensorSensorPravega SensorCollectorMicro-EdgeApache FlinkEdgePravega Segment StoreLong Term StorageApache FlinkEdgePravega Segment StoreLong Term StorageApache FlinkCloud/Data CenterPravega Segment StoreLong Term StorageStream ParallelismSourcePravega CheckpointsStream What about auto-sc
20、aling end-to-end?Automatically distributes new segments to source tasksSourcePravega CheckpointsStream What about auto-scaling end-to-end?Auto-scaling end-to-endReacts to ingest traffic changes1Reacts to job resource utilization(e.g.,CPU)2Job reacts to stream signals?Stream backlog Number of segment
21、s3Auto-scaling end-to-endNot entirely a new idea:Scaling Streaming Data Pipelines Flink Forward,2019,Junqueira and RohrmannData SourcePravegaFlinkMetrics ReporterKubernetes HPAChanges to number of segmentsSignal to Horizontal Pod ScalerAdjust number of task managers accordinglyIngest data to Pravega
22、 streamRead data from Pravega stream and process itShow time Demo videoDue to Brian ZhouThe Road AheadSource:Star History https:/star- historyPravega/pravega200211.0K1.5K0.5KDateGithub staryPravega is open sourcehttps:/cf.io/pravega-community/Community driven Streaming storage technology Cloud nativeStream-DB convergenceComputational StorageData at restData in motionProcess data earlyTHANKS