《【陆家靖】SkyWalking BanyanDB:时序数据库的查询引擎和流式计算.pdf》由会员分享,可在线阅读,更多相关《【陆家靖】SkyWalking BanyanDB:时序数据库的查询引擎和流式计算.pdf(32页珍藏版)》请在三个皮匠报告上搜索。
1、S Sk ky yWWa al lk ki in ng g B Ba an ny ya an nD DB B时序数据库的查询引擎和流式计算陆陆家家靖靖收钱吧框架工具负责人复旦大学核物理博士收钱吧框架工具团队负责人从事可观测性平台、API网关和服务治理平台研发APACHE SKYWALKING PMC MEMBER陆陆家家靖靖0 01 1可可观观测测性性与与时时序序数数据据库库可可观观测测性性三三大大支支柱柱指标、链路、日志*The three pillar of the Observability.Image source:Metrics,tracing,and logging,P.Bourg
2、on.https:/peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.htmlLowvolumeHighvolumeRequest-scoped eventsRequest-scoped metricsT Tr ra ac ci in ng gRequestscopedMMe et tr ri ic cs sAggregatableL Lo og gg gi in ng gEventsRequest-scoped,aggregatable eventsAggregatable eventse.g.rollupsTraci
3、ng&LoggingWorkflow-centric,distributedCausalMetricsStatistic/Aggregatable(rollups)Temporal:fixed interval,compression 时时序序数数据据的的数数据据结结构构Tag&Fields*How ClickHouse inspired us to build a high performance time series database.Aliaksandr Valialkin(valyala).https:/ IndexSeriesID(UInt64)*Frame of Referenc
4、e and Roaring Bitmaps,A.Grand.https:/www.elastic.co/blog/frame-of-reference-and-roaring-bitmapshash(all Tags):InfluxDB,VictoriaMetrics,etc.hash(partial TagValues):BanyanDB时时序序数数据据的的数数据据结结构构高基数问题H Ho oww Q Qu ui ic ck kl ly y D Do oe es s C Ca ar rd di in na al li it ty y G Gr ro oww?*What is High Ca
5、rdinality,R.Skillington.https:/chronosphere.io/learn/what-is-high-cardinality/时时序序数数据据的的数数据据结结构构高基数问题1 https:/ https:/ https:/ https:/ https:/ via Prometheus Recording Rules1TimescaleDB:Tiered B-Tree,Chunks2VictoriaMetrics/VictoriaLogs:MergeSet3,High-cardinality TSDB benchmarks4InfluxDB IOx:columnar
6、 built on Apache Arrow and Parquet 5BanyanDB:(tailored for SkyWalking)Partial tags for seriesIDCompact seriesID(xxhash)时时序序数数据据的的数数据据结结构构读写模式Vertical writeHorizontal readInsertions lookupsOld data is less likely to be 时时序序数数据据的的存存储储RUM Conjecture“We cannot design an access method for a storage syste
7、m that is optimal in all the following three aspects-Reads,Updates,and,Memory.”*Designing Access Methods:The RUM Conjecture.M.Athanassoulis et al.Proc.19th International Conference on Extending Database Technology(EDBT),March 15-18,时时序序数数据据的的存存储储LSM-tree+WiscKey:Badger Log Memtable Sorting String Ta
8、bles(SSTables)Periodic Compation:Write amplification 3x 10 x Read amp.10 x 300 x WiscKey:Key-Value separation LSM-tree:Key,Pointer value-log online,lightweight Garbage Collection*WiscKey:Separating Keys from Values in SSD-conscious Storage,L.Lu et al.FAST16 0 02 2B Ba an ny ya an nD DB B简简介介B Ba an
9、ny ya an nD DB B数据模型GroupIndexRule DIndexRule CIndexRule BIndexRule AMeasureStreamTopNAggregationPropertyIndexRuleBindingSchemalessSB Ba an ny ya an nD DB B存储结构Group 1Group 2Group 3Group 4Group 5Group 6superdatasetRaft-based Metadata(ETCD)S St to or ra ag ge e N No od de eSeries MetadataS Sh ha ar r
10、d dB Bl lo oc ck kSegment2023-05-29Segment2023-05-30S Se eg gmme en nt tGlobal Index(e.g.Trace ID)Block2023-05-2921:00Block2023-05-2922:00Index blockData block1:NB Ba an ny ya an nD DB B数据压缩Facebook Gorilla:(TS,Value)Timestamp:fixed interval=derived timestampValue:XORCompress big chunk(1M)*Gorilla:a
11、 fast,scalable,in-memory time series database,T.Pelkonen et al.Proceedings of the VLDB EndowmentVolume 8Issue 12pp 18161827 B Ba an ny ya an nD DB B数据压缩02M4M6M8M10M12M14M16Mtracelogmetric可可观观测测性性数数据据压压缩缩率率encodedraw30%(16 Bytes-5 Bytes/DataPoint)13%10%Facebook Gorilla:(TS,Value)Timestamp:fixed inter
12、val=derived timestampValue:XORCompress big chunk(1M)B Ba an ny ya an nD DB B查询子系统流程gRPC endpointl li ia ai is so on nQueue12Measure/StreamQueryMetadataQuery Analyzer34SeriesID(Partition)indexS St to or ra ag ge eKV EB Ba an ny ya an nD DB BIterator模式Limit 10Offset 5OrderBy latency DESCProjection(ser
13、vice_id,service_instance_id,latency)IndexScan(Shard 1)SortBy TimestampIndexScan(Shard 2)SortBy TimestampIndexScan(Shard 3)SortBy Timestamp S SE EL LE EC CT T service_id,service_instance_id,latency F FR RO OMM service_instance_cpm WWH HE ER RE E zone=“Shanghai”O OR RD DE ER R B BY Y latency D DE EC C
14、S S O OF FF FS SE ET T 5 L LI IMMI IT T 10Iterator(Interface,modular)Open()Next()-ItemClose()B Ba an ny ya an nD DB B查询优化器Limit 10Offset 5OrderBy latency DESCProjection(service_id,service_instance_id,latency)IndexScan(Shard 1)SortBy TimestampIndexScan(Shard 2)SortBy TimestampIndexScan(Shard 3)SortBy
15、 TimestampOptimizingLimit 10Offset 5Projection(service_id,service_instance_id,latency)Merge SortIndexScan(Shard 1)SortBy IndexIndexScan(Shard 2)SortBy IndexIndexScan(Shard 3)SortBy I0 03 3T To op pN N流流式式计计算算S Sk ky yWWa al lk ki in ng g中中T To op pN NSkyWalking OAPEndpoint CPMSuccess RateLS Sk ky yW
16、Wa al lk ki in ng g中中T To op pN NElasticSearch实现*https:/www.elastic.co/blog/found-elasticsearch-常常用用T To op pN N算算法法Space-Saving Given an error rate ,keep counters Suppose N N incoming elements are processed from the stream S S G Gu ua ar ra an nt te ee e:all elements with frequent are g gu ua ar ra
17、 an nt te ee ed d t to o b be e r re ep po or rt te ed d Applications:Apache Kylin,citusdata/postgresql-topn*Efficient Computation of Frequent and Top-k Elements in Data Streams,A.Metwally et al.Part of the Lecture Notes in Computer Science book series(LNCS,volume 3363)常常用用T To op pN N算算法法Count-Min
18、Sketch*https:/ Given an error and a probability ,set and ,where b is a const*.Hash collision:d pair-wise independent hash functions G Gu ua ar ra an nt te ee e:with a probability of ,the error is at most ,where is the sum of all count 流流式式处处理理时间?事件事件 Event time处理事件 Processing time=Event timeBanyanDB
19、:use E Ev ve en nt tT Ti imme e of the source measure*https:/jet-start.sh/docs/4.3.1/concepts/event-time*https:/ Use T Tu ummb bl li in ng g wwi in nd do oww with the same interval as the source measure,e.g.1min,1hr Keep N N TopN entries for each group in a window Keep MM windows in 流流式式处处理理乱序问题和水位线
20、*https:/jet-start.sh/docs/4.3.1/concepts/event-time*https:/nightlies.apache.org/flink/flink-docs-master/docs/concepts/time/流流式式处处理理乱序问题和水位线Use Measure timestamp as watermark:Strictly monoticKeep M windows in memory to accept late measures within allowed-latenessFlush at 40%of the 流流式式处处理理最终设计gRPC en
21、dpointl li ia ai is so on nqueueS St to or ra ag ge e N No od de eMeasureData BlockIndex12FilterMapperGroupByTumblingWindowTopN Op.TopNP流流式式处处理理性能对比05540455K10K25K50K100KTopN 性能对比(查询时间/秒)FullScanPreAggregation Memorize 1,000 TopN entries per bucket Write measure with cardinality(5K,10K,25K,50K,100K)per minute Query Top 10 w/both FullScan and PreAB Ba an ny ya an nD DB B Q Qu ue er ry y S Su ub bs sy ys st te emmRoadmap Merge Query Plan/Executor for TopN and Measure query(OSPP)Add Self-Observability Cluster-mode ST TH HA AN NK KS S