1、2016年云栖大会PostgreSQL专场技术交流群PostgreSQL 9.6New advances inFull Text SearchOleg BartunovPostgres Professional3PostgreSQL CORELocale supportPostgreSQL extendability:GiST(KNN),GIN,SP-GiSTFull Text Search(FTS)NoSQL(hstore,jsonb)Indexed regexp searchCustom AM&Generic WALPluggable table engines(WIP)Extension
2、s:Intarray,Hstore,LtreeMajor contributors to PostgreSQLCo-founders of Postgres ProfessionalAlexander Korotkov,Teodor Sigaev,Oleg Bartunov4AgendaInitial design of Postgres and innovationsHistory of some particular innovative features of PostgresFull Text Search in 9.6New RUM index5Original design of
3、PostgresThe main design goals of the new system are to:1)provide better support for complex objects,2)provide user extendibility for data types,operators and access methods,3)provide facilities for active databases(i.e.,alerters and triggers)and inferencing including forward-and backward-chaining,4)
4、simplify the DBMS code for crash recovery,5)produce a design that can take advantage of optical disks,workstations composed of multiple tightly-coupled processors,and custom designed VLSI chips,and6)make as few changes as possible(preferably none)to the relational model.*Stonebraker M.,Rowe L.A.The
5、design of Postgres.ACM,1986.15.2.340-355.6Original design of PostgresThe main design goals of the new system are to:1)provide better support for complex objects,2)provide user extendibility for data types,operators and access methods,3)provide facilities for active databases(i.e.,alerters and trigge
6、rs)and inferencing including forward-and backward-chaining,4)simplify the DBMS code for crash recovery,5)produce a design that can take advantage of optical disks,workstations composed of multiple tightly-coupled processors,and custom designed VLSI chips,and6)make as few changes as possible(preferab
7、ly none)to the relational model.*Stonebraker M.,Rowe L.A.The design of Postgres.ACM,1986.15.2.340-355.7Extendability.Is like a SHOPPING MALL8Rent a place in the mall(vs.having your own shop)ProUse all common facilities of mallUse existing buyers base of the mallConcentrate on your own contentConsHav
8、e to pay the rent9Writing extension to DBMS(vs.writing your own specific DBMS)ProUse all common features of DBMS:concurrency,recovery,transactions etc.Use existing users base of the DBMSConcentrate on your domain specific logicConsHave to pay some overhead10Extendability need APIs11What can we exten
9、d in the DBMS?Data typesHow we can operate with this data types?(functions,operators,aggregates etc.)How we can search this data types?(indexes)What could be the source of data?(FDW)How could we store the data?(table engines)(not yet delivered to Postgres)12New types of indexesare especially hard im
10、plement because we need to deal with:concurrency(low-level locking etc.),packing data into pages,WAL-logging,This is a very hard task.Only DBMS core developer could solve it.Application developer cant.13The solution:add nested API14The solution:add nested APIIndex access method is the template which
11、 could be applied to particular data type using operator class(opclass).btree is template for different linear orderingsGiST is template forbalanced treesSP-GIST is template for non-balanced treesGIN is template for inverted indexes of composite objects BRIN is template for bounding aggregates per b
12、lock ranges 15Propagation of improvementsIf you upgrade your camera to another compatible which have higher resolution,this improvement will apply to all the compatible lenses.In PostgreSQL 9.4 GIN got 2 major improvements:posting list compression and fast scan.Opclasses received these improvements
13、automatically.16ExtendabilityProvides fast feature developingHstore(first version)several hoursFTS(tsearch2)1 week(NY holidays)KNN-GiST 1 weekjsonb_path_ops several hours in restaurantJsonb(prototype)2-3 monthsJsquery 2-3 monthsQuadtree 360 locPostgreSQL 9.4+Open-sourceRelational databaseExtendable
14、databaseStrong support of NoSQLFuture of JSONBDictionary compression for jsonbDuplicate keys storage in jsonb is the problem.Pluggable compression mechanism(extendability!).Could be applied to any data type.Each jsonb column have own dictionary of keys.Conversion on the fly.Will be released soon by
15、Postgres Professional.Dictionary compression for jsonbCustomers reviews datasetcustomer_reviews_jsonb 307 MBcustomer_reviews_jsonbc 123 MBcustomer_reviews_array 139 MBLess space than array version because of numerics compression!Subscription for jsonbNew query syntax:UPDATE js SET jskey=value WHERE
16、jsid=1;Generic mecanism,extendable for any data type instead of hack for arrays we currently have.On commitfest by Postgres Professionalhttps:/commitfest.postgresql.org/11/793/K-Nearest Neighbors SearchK-Nearest Neighbors Search Traditional search algorithms are not effective Index doesnt helps,sinc
17、e there is no predicate Full table scan-sort-limit Ad-hoc solutions are not effective Postgres innovation Use special index scan strategy to get k-tuples in right order Several orders of magnitude speedup!Use ORDER BY distance to express KNN in SQL KNN-GiST,KNN-Btree,KNN-SPGiSTK-Nearest Neighbors Se
18、arch1,000,000 randomly distributed pointsFind K-closest points to the point(0,0)Scan&SortSELECT*FROM qq ORDER BY point_distance(p,(0,0)ASC LIMIT 10;Limit(actual time=291.524.291.526 rows=10 loops=1)-Sort(actual time=291.523.291.523 rows=10 loops=1)Sort Key:(point_distance(p,(0,0):point)Sort Method:t
19、op-N heapsort Memory:26kB-Seq Scan on qq(actual time=0.011.166.091 rows=1000000 loops=1)Planning time:0.048 msExecution time:291.542 ms(7 rows)K-Nearest Neighbors Search1,000,000 randomly distributed pointsFind K-closest points to the point(0,0)KNN-GiST(GiST index for points)SELECT*FROM qq ORDER BY(
20、p (0,0)ASC LIMIT 10;Limit(actual time=0.046.0.058 rows=10 loops=1)-Index Scan using qq_p_s_idx on qq(actual time=0.046.0.058 rows=10 loops=1)Order By:(p (0,0):point)Planning time:0.052 msExecution time:0.081 ms(5 rows)KNN is 3500 times faster!K-Nearest Neighbors SearchKNN-BtreeFind 10 closest events
21、 to the Sputnik launch Union of two selects(btree index on date)select *,date 1957-10-04:date as dt from(select*from(select id,date,event from events where date=1957-10-04:date order by date asc limit 10)t2)t3 order by dt asc limit 10;Execution time:0.146 msK-Nearest Neighbors SearchKNN-BtreeFind 10
22、 closest events to the Sputnik launch Parallel Btree iindex-scans in two directionsselect id,date,event from events order by date 1957-10-04:date asclimit 10;Limit(actual time=0.030.0.039 rows=10 loops=1)-Index Scan using btree_date_idx on events(actual time=0.030.0.036 rows=10 loops=1)Order By:(dat
23、e 1957-10-04:date)Planning time:0.101 msExecution time:0.070 ms(5 rows)KNN is 2 times faster!Full Text SearchWhat is a Full Text Search?Full text search Find documents,which match a querySort them in some order(optionally)Typical SearchFind documents with all words from queryReturn them sorted by re
24、levanceWhy FTS in Databases?Feed database content to external search enginesThey are fast!BUTThey cant index all documents-could be totally virtualThey dont have access to attributes-no complex queriesThey have to be maintained headache for DBASometimes they need to be certifiedThey dont provide ins
25、tant search(need time to download new data and reindex)They dont provide consistency search results can be already deleted from databaseFTS in DatabasesFTS requirementsFull integration with database engineTransactionsConcurrent accessRecoveryOnline indexConfigurability(parser,dictionary.)Scalability
26、Traditional text search operators(TEXT op TEXT,op-,*,LIKE,ILIKE)No linguisticsupportWhat is a word?What to index?Word normalization?Stop-words (noise-words)No ranking-all documents are equally similar to querySlow,documents should be seq.scanned9.3+index support of*(pg_trgm)select*from man_lines whe
27、re man_line*(?:(?:p(?:ostgres(?:ql)?|g?sql)|sql)(?:(?:(?:mak|us)e|do|is);One of(postgresql,sql,postgres,pgsql,psql)space One of(do,is,use,make)FTS in PostgreSQLOpenFTS 2000,Pg as a storageGiST index 2000,thanks RamblerTsearch 2001,contrib:no rankingTsearch2 2003,contrib:configGIN 2006,thanks,JFG Net
28、worksFTS 2006,in-core,thanks,EnterpriseDBFTS(ms)2012,some patches committedRUM 2016,Postgres ProfessionalFTS in PostgreSQLtsvector data type for document optimized for search Sorted array of lexemsPositional informationStructural information(importance)tsquery textual data type for query with boolea
29、n operators&|!()Full text search operator:tsvector tsqueryOperators ,Sort(actual time=476.104.476.104 rows=3 loops=1)Sort Key:(ts_rank(text_vector,titl:tsquery)DESCSort Method:top-N heapsort Memory:25kBBuffers:shared hit=149804 read=87416-Bitmap Heap Scan on ti2(actual time=6.894.469.215 rows=47855
30、loops=1)Recheck Cond:(text_vector titl:tsquery)Heap Blocks:exact=4913Buffers:shared hit=149804 read=87416-Bitmap Index Scan on ti2_index(actual time=6.117.6.117 rows=47855 loops=1)Index Cond:(text_vector titl:tsquery)Buffers:shared hit=1 read=12Planning time:0.255 msExecution time:476.171 ms(15 rows
31、)HEAP IS SLOW470 ms!Some FTS problems:#2No phrase search“A&B”is equivalent to“B&AThere are only 92 posts with person Tom Good,but FTS finds 34039 postsCombination of FTS+regular expression works,but slow and can be used only for simple queries.Some FTS problems:#3Combine FTS with ordering by timesta
32、mp SELECT sent,subject from pglist WHERE fts to_tsquery(english,tom&lane)ORDER BY abs(sent 2000-01-01:timestamp)ASC LIMIT 5;Limit(actual time=545.560.545.560 rows=5 loops=1)-Sort(actual time=545.559.545.559 rows=5 loops=1)Sort Key:(CASE WHEN(sent-2000-01-01 00:00:00:timestamp without time zone)Bitma
33、p Heap Scan on pglist(actual time=87.545.507.897 rows=222813 loops=1)Recheck Cond:(fts tom&lane:tsquery)Heap Blocks:exact=105992-Bitmap Index Scan on pglist_gin_idx(actual time=57.932.57.932 rows=222813 loops=1)Index Cond:(fts tom&lane:tsquery)Planning time:0.376 msExecution time:545.744 mssent|subj
34、ect-+-1999-12-31 13:52:55|Re:HACKERS LIKE fixed(?)for non-ASCII collation orders2000-01-01 11:33:10|Re:HACKERS dubious improvement in new psql1999-12-31 10:42:53|Re:HACKERS LIKE fixed(?)for non-ASCII collation orders2000-01-01 13:49:11|Re:HACKERS dubious improvement in new psql1999-12-31 09:58:53|Re
35、:HACKERS LIKE fixed(?)for non-ASCII collation orders(5 rows)Time:568.357 msInverted Index in PostgreSQLENTRYTREEPosting listPosting treeNo positions in index!Inproving GINImprove GIN indexStore additional information in posting tree,for example,lexemes positions or timestampsUse this information to
36、order resultsImproving GIN9.6 opens Pandora boxCreate access methods as extension!Lets call it RUMCREATE INDEX.USING RUMUse positions to calculate rank and order resultsIntroduce distance operator tsvector tsquery CREATE INDEX ti2_rum_fts_idx ON ti2 USING rum(text_vector rum_tsvector_ops);SELECT doc
37、id,ts_rank(text_vector,to_tsquery(english,title)AS rankFROM ti2WHERE text_vector to_tsquery(english,title)ORDER BYtext_vector plainto_tsquery(english,title)LIMIT 3;QUERY PLAN-L Limit(actual time=54.676.54.735 rows=3 loops=1)Buffers:shared hit=355-Index Scan using ti2_rum_fts_idx on ti2(actual time=5
38、4.675.54.733 rows=3 loops=1)Index Cond:(text_vector titl:tsquery)Order By:(text_vector titl:tsquery)Buffers:shared hit=355Planning time:0.225 msExecution time:54.775 ms VS476 ms!(8 rows)CREATE INDEX.USING RUMTop-10(out of 222813)postings with Tom LaneGIN index 1374.772 ms SELECT subject,ts_rank(fts,
39、plainto_tsquery(english,tom lane)AS rank FROM pglist WHERE fts plainto_tsquery(english,tom lane)ORDER BY rank DESC LIMIT 10;QUERY PLAN-Limit(actual time=1374.277.1374.278 rows=10 loops=1)-Sort(actual time=1374.276.1374.276 rows=10 loops=1)Sort Key:(ts_rank(fts,tom&lane:tsquery)DESCSort Method:top-N
40、heapsort Memory:25kB-Bitmap Heap Scan on pglist(actual time=98.413.1330.994 rows=222813 loops=1)Recheck Cond:(fts tom&lane:tsquery)Heap Blocks:exact=105992-Bitmap Index Scan on pglist_gin_idx(actual time=65.712.65.712 rows=222813 loops=1)Index Cond:(fts tom&lane:tsquery)Planning time:0.287 msExecuti
41、on time:1374.772 ms(11 rows)CREATE INDEX.USING RUMTop-10(out of 222813)postings with Tom LaneRUM index 216 ms vs 1374 ms!create index pglist_rum_fts_idx on pglist using rum(fts rum_tsvector_ops);SELECT subject FROM pglist WHERE fts plainto_tsquery(tom lane)ORDER BY fts plainto_tsquery(tom lane)LIMIT
42、 10;QUERY PLAN-Limit(actual time=215.115.215.185 rows=10 loops=1)-Index Scan using pglist_rum_fts_idx on pglist(actual time=215.113.215.183 rows=10 loops=1)Index Cond:(fts plainto_tsquery(tom lane:text)Order By:(fts plainto_tsquery(tom lane:text)Planning time:0.264 msExecution time:215.833 ms(6 rows
43、)Phrase Search(8 years old!)Queries A&B:tsquery and B&A:tsquery produce the same result Phrase search-preserve order of words in a queryResults for queries A&B and B&A should be different!Introduce new FOLLOWED BY ()operator:Guarantee an order of operands Distance between operandsa b=a&b&(i,j:pos(b)
44、i pos(a)j=n)Phrase search-definitionFOLLOWED BY operator returns:falsetrue and array of positions of the right operand,which satisfy distance condition FOLLOWED BY operator requires positions A B=AB A B matches the word with two different forms(infinitives)TSQUERY phraseto_tsquery(CFG,TEXT)Stop word
45、s are taken into account.select phraseto_tsquery(PostgreSQL can be extended by the user in many ways);phraseto_tsquery-postgresql extend user mani way(1 row)Phrase search-propertiesPrecendence of tsquery operators-!&|Use parenthesis to control nesting in tsqueryselect a&b c:tsquery;tsquery-a&b csele
46、ct b c&a:tsquery;tsquery-b c&aselect b (c&a):tsquery;tsquery-b c&b aPhrase search-Examples 1.1 mln postings(postgres mailing lists)There is overhead of phrase operatortomlane tom&laneSeqScan :2.6s 2.2 s GIN :1.2s 0.48 s need recheckRUM:0.5s 0.48 s use positions to filter Phrase search with RUM index
47、 has negligible overhead!select count(*)from pglist where fts to_tsquery(english,tom lane);count-222777(1 row)Some FTS problems:#3Combine FTS with ordering by timestampStore timestamps in additional information in timestamp order!create index pglist_fts_ts_order_rum_idx on pglist using rum(fts rum_t
48、svector_timestamp_ops,sent)WITH(attach=sent,to=fts,order_by_attach=t);select sent,subject from pglistwhere fts to_tsquery(tom&lane)order by sent 2000-01-01:timestamp limit 5;-L Limit(actual time=84.866.84.870 rows=5 loops=1)-Index Scan using pglist_fts_ts_order_rum_idx on pglist(actual time=84.865.8
49、4.869 rows=5 loops=1)Index Cond:(fts to_tsquery(tom&lane:text)Order By:(sent 2000-01-01 00:00:00:timestamp without time zone)Planning time:0.162 msExecution time:85.602 ms vs 645 ms!(6 rows)Some FTS problems:#3Combine FTS with ordering by timestampStore timestamps in additional information in timest
50、amp order!select sent,subject from pglistwhere fts to_tsquery(tom&lane)and sent 2000-01-01:timestamp order by sent desc limit 5;explain analyze select sent,subject from pglistwhere fts to_tsquery(tom&lane)order by sent=|2000-01-01:timestamp limit 5;Speedup 1x,since tom lane is popular filter-select
51、sent,subject from pglistwhere fts to_tsquery(server&crashed)and sent 2000-01-01:timestamp order by sent desc limit 5;select sent,subject from pglistwhere fts to_tsquery(server&crashed)order by sent=|2000-01-01:timestamp limit 5;Speedup 10 xRUM TodoAllow multiple additional infoadd opclasses for arra
52、y(similarity and as additional info)and int/floatimprove ranking function to support TF/IDFImprove insert time(pending list?)Improve GENERIC WAL to support shiftAvailability:9.6+only:https:/ details about new FTS featureshttp:/www.sai.msu.su/megera/postgres/talks/pgopen-2016-rum.pdfTHANKS FOR YOUR ATTENTION感谢大家!