Oracle Real Application Clusters (RAC) Cache Fusion Performance Optimizations on Exadata
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Technic l Brief Oracle Real Application Clusters (RAC) Cache Fusion Performance Optimizations on Exadata M y, 2021 Copyright © 2021, Or cle nd/or its fli tes Public 1 Techn cal Br ef / Or cle Re l Applic tion Clusters (RAC) C che Fusion Perform nce Optimiz tions on Ex d t Copyright © 2021, Or cle nd/or its fli tes / Public
D scla mer This document in ny form, softw re or printed m ter, cont ins propriet ry inform tion th t is the exclusive property of Or cle. Your ccess to nd use of this confdenti l m teri l is subject to the terms nd conditions of your Or cle softw re license nd service greement, which h s been executed nd with which you gree to comply. This document nd inform tion cont ined herein m y not be disclosed, copied, reproduced or distributed to nyone outside Or cle without prior writen consent of Or cle. This document is not p rt of your license greement nor c n it be incorpor ted into ny contr ctu l greement with Or cle or its subsidi ries or fli tes. This document is for inform tion l purposes only nd is intended solely to ssist you in pl nning for the implement tion nd upgr de of the product fe tures described. It is not commitment to deliver ny m teri l, code, or function lity, nd should not be relied upon in m king purch sing decisions. The development, rele se, nd timing of ny fe tures or function lity described in this document rem ins t the sole discretion of Or cle. Due to the n ture of the product rchitecture, it m y not be possible to s fely include ll fe tures described in this document without risking signifc nt dest biliz tion of the code. 2 Techn cal Br ef / Or cle Re l Applic tion Clusters (RAC) C che Fusion Perform nce Optimiz tions on Ex d t Copyright © 2021, Or cle nd/or its fli tes / Public
Table of contents D scla mer 2 Execut ve Summary 4 L st of Performance Opt m zat ons 5 Ex fusion 5 Zero Copy Block Sends 6 Sm rt Fusion Block Tr nsfer 6 Undo Block RDMA Re ds 7 In-Memory Commit C che 7 F st Index Split 7 Persistent Memory Commit Acceler tor 8 Sh red D t Block nd Undo He der RDMA Re ds 8 Bro dc st-on-Commit over RDMA 9 Conclus on 10 References 10 3 Techn cal Br ef / Or cle Re l Applic tion Clusters (RAC) C che Fusion Perform nce Optimiz tions on Ex d t Copyright © 2021, Or cle nd/or its fli tes / Public
Execut ve Summary Or cle Re l Applic tion Clusters commonly referred to s Or cle RAC is n option to the Or cle D t b se th t provides line r horizont l sc l bility nd high v il bility. Or cle RAC C che Fusion is component of Or cle RAC responsible for synchronizing the c ches mong Or cle RAC inst nces m king it possible for pplic tions to se mlessly utilize the computing resources of ll the Or cle RAC inst nces without m king ny ch nges. C che Fusion utilizes dedic ted priv te network for c che synchroniz tion. Applic tion sc l bility therefore relies on the l tency nd b ndwidth provided by the underlying priv te network. Ex d t , with its doption of advanced network ng components l ke RDMA over Converged Ethernet (RoCE) or Infn Band, enables Oracle to further mprove performance and scalab l ty. In ddition to benefting from the improved wire speed of the underlying network, we re-engineered signifc nt portions of Or cle RAC C che Fusion l yer to lever ge the dv nced protocols nd RDMA c p bilities v il ble on Ex d t . For ex mple, on Ex d t , Or cle RAC inst nces directly tr nsfers bufers to the wire nd byp sses the Oper ting System (OS) kernel. This results in block tr nsfers th t h ve ultr -low l tency nd th t incur dr m tic lly lower CPU cost. Or cle RAC on Ex d t lso uses new protocols th t elimin te w its in the perform nce critic l p rts of tr ns ction commits. This p per will expl in these Ex d t -specifc optimiz tions th t h ve been implemented since Or cle 12c. 4 Techn cal Br ef / Or cle Re l Applic tion Clusters (RAC) C che Fusion Perform nce Optimiz tions on Ex d t Copyright © 2021, Or cle nd/or its fli tes / Public
L st of Performance Opt m zat ons Exafus on Tr dition lly, Or cle RAC mess ging w s implemented using the commonly used networking model using network sockets. In this model, ll communic tions (sends nd receives) would go through the OS kernel, thus requiring context switches nd memory copies between user sp ce nd OS kernel for every RAC mess ge being exch nged. Ex fusion is the next gener tion networking protocol v il ble on Ex d t since 12c (on both RoCE nd InfniB nd), which llows for d rect-to-w re messag ng from user sp ce, completely bypass ng the OS kernel. By elimin ting the context switches nd OS kernel overhe d, Ex fusion en bles Or cle to process round trip mess ges in less th n 50 µs (micro-seconds), wh ch s 3x faster than a trad t onal socket- based mplementat on, and a further 33% mprovement compared to the frst generat on of Exadata which used the RDS protocol for mess ging. Addition lly, the CPU cost ssoci ted with sending nd receiving mess ges is lower with Ex fusion, llowing for higher block tr nsfer throughput nd incre sed he droom in LMS processes before they could become s tur ted. Faster messag ng not only benefts runt me appl cat on performance, t also makes every Oracle RAC operat on faster - this includes dyn mic lock rem stering (DRM), Or cle RAC reconfgur tion ( ssoci ted with inst nce or PDB membership ch nges), nd inst nce recovery. C che Fusion Tr nsfer L tency Comp rison 150 µs 75 µs 50 µs Non-Ex d t (UDP) Ex d t 11g (RDS) Ex d t (Ex fusion) The adopt on of Exafus on s the foundat on of subsequent performance opt m zat ons for RAC on Exadata, nclud ng zero copy transfers and adopt on of RDMA. Ex fusion nd the subsequent optimiz tions described in this document do not require extr OS resources to oper te. When Ex fusion is en bled, one m y notice th t the IPC0 b ckground process uses high RSS memory us ge in “ps”, however this is due to the f ct th t Or cle inst nce registers (pins) ll IPC bufers with the Host Ch nnel Ad ptor (HCA) on beh lf of ll processes running in the 5 Techn cal Br ef / Or cle Re l Applic tion Clusters (RAC) C che Fusion Perform nce Optimiz tions on Ex d t Copyright © 2021, Or cle nd/or its fli tes / Public
inst nce, nd does not nd cate excess ve memory usage or memory leaks. Further det ils c n be found in MOS note 2407743.1. Zero Copy Block Sends RoCE nd InfniB nd network d pters support Zero Copy mess ging. User sp ce bufers re registered with the HCA nd the HCA directly pl ces the contents of user sp ce bufers on the wire, unlike tr dition l mess ging protocols where the OS kernel frst m kes copy of the user sp ce bufer nd then pl ces them on the wire. Since Or cle RAC 12c, we use this fe ture on Ex d t for inter-inst nce communic tions. Elimin tion of the CPU cycles required for copying bufers mproved the transfer latenc es by up to 5% compared to Exafus on w thout Zero Copy sends. Smart Fus on Block Transfer Tr dition lly, Or cle RAC inst nce would h ve to w it for redo log fush to complete before sending dirty block to nother inst nce. This is common ccess p tern in OLTP systems with frequent DML’s. The redo fush is done to ensure d t b se consistency in the event of n inst nce f ilure. This me ns th t inter-inst nce tr nsfer l tency for frequently modifed blocks which h ve redo pending w s lw ys dependent on redo fush I/O l tency, nd w s subject to outliers c used by intermitent spikes in I/O perform nce. Or cle RAC 12c utilizes Sm rt Fusion Block Tr nsfer optimiz tion, which llows n Or cle RAC inst nce to send the block once the redo I/O s n-f ght to the Ex d t stor ge server. Or cle RAC LMS process is permited to initi te block tr nsfer before receiving I/O completion cknowledgment, llowing sessions on the requestor inst nce to st rt ccessing th t block while the redo I/O m y still be pending. The requestor inst nce checks for I/O completion before it commits further ch nges to the s me block. The commiting process is required to w it for the “remote og force - commit” w it event if the I/O is yet to complete. This is r re occurrence, which is only seen when there re extreme I/O outliers. Such I/O outliers re mostly elimin ted on Ex d t with the Sm rt Fl sh Logging fe ture. Sm rt Fusion Block Tr nsfer optimiz tion llows for improved concurrency cross Or cle RAC inst nces to improve over ll pplic tion perform nce. This optimiz tion results in reducing the “gc curre t block busy” wa t t mes by 3x t mes for worklo ds th t upd tes hot blocks concurrently. 2. Tr nsfer block 2. Tr nsfer block 1. Disp tch log write I/O 3. Check for I/O 1. Disp tch nd w it for completion completion before log write I/O commiting Or g nal Protocol Smart Fus on Block Transfer 6 Techn cal Br ef / Or cle Re l Applic tion Clusters (RAC) C che Fusion Perform nce Optimiz tions on Ex d t Copyright © 2021, Or cle nd/or its fli tes / Public
Undo Block RDMA Reads Undo blocks need to be fetched from other Or cle RAC inst nces when there re tr ns ction rollb cks etc. In Or cle RAC 18c, undo block tr nsfers h ve been optimized to use RDMA-b sed tr nsfer protocol, repl cing the tr dition l mess ging-b sed protocol. By leverag ng RDMA, foreground processes are able to d rectly read the undo blocks from the remote nstance’s SGA. The undo block re ds no longer nvoke processes on the remote inst nce, removing the server-side CPU nd context switch overhe ds which were lw ys p rt of tr dition l Or cle RAC communic tions. Addition lly, the tr nsfer l tencies re no longer fected by OS process or over ll system CPU lo d on the remote inst nce, wh ch helps susta n determ n st c read latenc es even n the case of a load sp ke on the remote inst nce. RDMA re d of remote block would typic lly complete in less than 10 µs, which is 5x mprovement over the best l tencies we would get with the tr dition l mess ge-b sed protocol using Ex fusion. In-Memory Comm t Cache Applic tions th t h ve long running b tch jobs nd concurrent queries m y exhibit high volumes of “undo he der” CR block tr nsfers. In Or cle 18c, an n- memory comm t cache has been added on Exadata. E ch inst nce would m int in c che of loc l tr ns ctions nd their respective st tes (commited or not) in the SGA, nd the c che c n be looked up remotely. This is f ster th n tr nsferring the undo he der blocks, e ch sized 8kb, to the remote inst nce. The st te of multiple tr ns ction ID’s (XID’s) c n be looked up in single mess ge, which helps reduce the number of roundtrip mess ges in Or cle RAC, nd lso the CPU overhe d in LMS processes which is responsible for responding to remote lookup requests. With the in-memory commit c che, we are able to batch up to 30 XID lookups n a s ngle roundtr p message which would h ve been 30x 8k block tr nsfers prior to this optimiz tion. With the commit c che optimiz tion, we c n expect lot of the “gc cr b ock 2- NOTE: The “gc transaction tab e way” w its corresponding to “undo he der” tr nsfers to be repl ced with 2-way” w it is used in rele ses sm ller number of “gc transaction tab e 2-way” w its. A single “gc transaction st rting with Or cle 21c. E rlier rele ses (Or cle 18c nd tab e 2-way” w it represents remote lookup of multiple XID’s in one roundtrip. 19c) would use the “gc transaction tab e” w it event inste d. Fast Index Spl t When there is B-tree index le f block split (frequently seen in OLTP worklo ds with right-growing indices), pplic tions ccessing the spliting le f & br nch blocks on ll Or cle RAC inst nces would need to w it for the split oper tion to complete. This m y c use intermitent hiccups (periods of lmost zero ctivity) in pplic tion perform nce. Tr dition lly, these w its were implemented under TX enqueue (“enq: TX-index contention” w its). These split w its h ve been optimized on Ex d t in Or cle 19c, to use less expensive C che Fusion b sed mech nism in lieu of glob l enqueues. The fast ndex spl t wa ts w ll be under the new “gc i dex operatio ” wa t event (“i dex split completio ” n 21c onwards), wh ch replaces the trad t onal TX enqueue wa ts. 7 Techn cal Br ef / Or cle Re l Applic tion Clusters (RAC) C che Fusion Perform nce Optimiz tions on Ex d t Copyright © 2021, Or cle nd/or its fli tes / Public
Pers stent Memory Comm t Accelerator Ex d t X8M introduces the Persistent Memory Commit Acceler tor, which implements redo log I/O w th RDMA wr tes to pers stent memory on the storage servers. This optimiz tion signifc ntly improves redo fush I/O perform nce, which would further improve inter-inst nce concurrency on systems experiencing high volumes of dirty bufer sh ring (see Sm rt Fusion Block Tr nsfer). Shared Data Block and Undo Header RDMA Reads In Or cle 21c, RDMA support for Cache Fus on has been extended to support reads for data blocks, space blocks and undo header blocks. Simil r to the Undo Block RDMA re d optimiz tion in 18c, this will contribute to f ster re ds of d t c ched in remote inst nces, nd further reduction in LMS CPU since LMS will not be invoked when d t is re d vi RDMA. Tr dition lly, foreground process would send request to re d block to the m ster inst nce, then the m ster inst nce would forw rd the request to the holder inst nce, nd the request is fulflled by 3-w y C che Fusion tr nsfer (“gc current b ock 3-way”). This is common ccess p tern in re d intensive OLTP worklo ds running on l rge clusters of 3+ nodes. In l rge clusters, the size of e ch inst nce is typic lly sm ll, which me ns th t it is less likely th t d t is c ched on the loc l inst nce, but ch nces re higher th t it is c ched on nother inst nce. With d t & sp ce block RDMA, the m ster inst nce will respond to the requestor with lock gr nt (permission to re d the d t ), long with inform tion bout the holder inst nce for the block requested. The requesting client c n then RDMA-re d the block directly from the holder inst nce. This will remove the m ster-holder mess ging, which will help improve re d l tency nd reduce LMS CPU on the holder inst nce (who tr dition lly h d to send b ck the block to the requestor). Inst nce 1 Inst nce 1 (Requestor) (Requestor) FG FG 3. Block Tr nsfer 1. Request 3. Direct Re d 1. Request 2. Gr nt LMS LMS LMS 2. Forw rd Inst nce 3 Inst nce 2 Inst nce 3 Inst nce 2 (Resource Holder) (Resource M ster) (Resource Holder) (Resource M ster) Or g nal Protocol RDMA-based Protocol In this c se, the foreground will see the following sequence of w it events inste d of the tr dition l “gc current b ock 3-way” w it: “gc current grant 2-way” w it, followed by, A short “gc current b ock direct read” w it event The “gc current b ock direct read” w its re typ cally less than 10us, nd the combined w it time for the gr nt & re d is usu lly shorter th n the tr dition l 3- w y tr nsfer l tency. 8 Techn cal Br ef / Or cle Re l Applic tion Clusters (RAC) C che Fusion Perform nce Optimiz tions on Ex d t Copyright © 2021, Or cle nd/or its fli tes / Public
If the requestor is lso the m ster inst nce, the “gc current grant 2-way” in the ex mple bove c n be elimin ted, bec use the inst nce c n gr nt itself permission to re d d t without involving ny mess ging. In this c se, the request c n be quickly fulflled by single “gc current b ock direct read”. This would repl ce some “gc current b ock 2-way” w its th t were tr dition lly seen in Or cle RAC, including 2 node clusters. Addition lly, if remote m ster inst nce is lso the holder inst nce, LMS would respond with gr nt mess ge, then the requestor will RDMA-re d the d t from the holder (who is lso the m ster). This is simil r to the 3-w y scen rio described bove, except th t the m ster nd holder inst nces re the s me. In this c se, the tr dition l “gc current b ock 2-way" w its re repl ced by “gc current grant 2-way” nd “gc current b ock direct read”. While the re d l tencies won’t improve much in this c se, the cost for LMS to grant a lock s cheaper compared to send ng back a data block, so the RDMA opt m zat on w ll help reduce LMS CPU usage. Broadcast-on-Comm t over RDMA Before commiting tr ns ction, the Bro dc st-on-Commit protocol ensures th t the system ch nge number (SCN) on ll the inst nces in cluster is t le st s high s the commit SCN. This is required to ensure the Consistent Re d (CR) property of Or cle tr ns ctions. Tr dition lly, the Bro dc st-on-Commit protocol used mess ges to bro dc st the SCN to ll the inst nces in cluster. The LGWR process sends the SCN in mess ge to the LMS process on ll inst nces. LMS process, upon receiving n SCN mess ge, upd tes its inst nce’s SCN nd sends b ck n SCN ACK mess ge to the LMS process on the initi ting inst nce. Once the redo I/O completes, LGWR checks whether the redo SCN h s been cknowledged by ll inst nces. If so, LGWR notifes the foreground processes w iting for the tr ns ction th t the commit oper tion h s completed. If the redo SCN w s not cknowledged by the time the redo I/O completes, then the commit won’t complete until ll SCN ACKs h ve been received. Clients will see high “ og f e sync” w it times in this c se. In Or cle 21c, Bro dc st-on-Commit h s been optimized to use RDMA for the following re sons: 1. RDMA l tency is lower th n mess ging: As I/O l tency improves on Ex d t , bro dc sting SCN using mess ging could potenti lly become botleneck. 2. Reducing lo d on LMS processes: Running OLTP pplic tions, we see th t SCN mess ges ccount for me sur ble portion of mess ging tr fc, especi lly on clusters with l rge number of inst nces. Although these mess ges re r rely in the critic l p th l tency-wise (bec use the ctu l IO would typic lly t ke longer), reducing these mess ges will h ve beneft of reducing LMS lo d, giving us more he droom so th t the system c n beter toler te lo d spikes. 9 Techn cal Br ef / Or cle Re l Applic tion Clusters (RAC) C che Fusion Perform nce Optimiz tions on Ex d t Copyright © 2021, Or cle nd/or its fli tes / Public
For ex mple, running l rge CRM (OLTP) worklo d on 3 inst nce cluster, we s w th t 12% of over ll RAC mess ges were for SCN bro dc sts. With RDMA, these mess ges will no longer invoke the LMS process. SCN Msgs (9%) SCN ACK All SCN mess ging Mess ges repl ced with RDMA (3%) OtherC che OtherC che Fusion Fusion Mess ges Mess ges(88%) Message Trafc D str but on Message Trafc D str but on n Earl er Releases n Oracle 21c In the Broadcast-on-Comm t over RDMA mode, the LGWR process d rectly updates the SCN on each remote nstance n the cluster us ng remote atom c operat ons. This m kes the commit protocol f ster s it is not fected by the remote LMS process’s context switch l tency or the CPU lo d on the remote inst nces. Conclus on These re some ex mples of how Or cle RAC lever ges the dv ncements in h rdw re on Ex d t to further optimize Or cle RAC C che Fusion perform nce resulting in dr m tic pplic tion sc l bility improvements without requiring ny pplic tion ch nges. Or cle continues to invest in further innov tions, by engineering the softw re to t ke dv nt ge of the l test h rdw re technologies v il ble in the m rket. References Or cle Re l Applic tion Clusters (RAC) White P per Or cle RAC Intern ls – The C che Fusion Edition Or cle RAC 12c Pr ctic l Perform nce M n gement nd Tuning Or cle RAC fe tures on Ex d t Or cle RAC 12c Rele se 2 – New Av il bility Fe tures 10 Techn cal Br ef / Or cle Re l Applic tion Clusters (RAC) C che Fusion Perform nce Optimiz tions on Ex d t Copyright © 2021, Or cle nd/or its fli tes / Public
Connect w th us C ll +1.800.ORACLE1 or visit oracle.com. Outside North Americ , fnd your loc l ofce t: oracle.com/contact. blogs.or cle.com f cebook.com/or cle twiter.com/or cle Copyright © 2021, Or cle nd/or its fli tes. All rights reserved. This document is Or cle nd J v re registered tr dem rks of Or cle nd/or its fli tes. Other n mes m y be provided for inform tion purposes only, nd the contents hereof re subject to tr dem rks of their respective owners. ch nge without notice. This document is not w rr nted to be error-free, nor subject to Intel nd Intel Xeon re tr dem rks or registered tr dem rks of Intel Corpor tion. All SPARC ny other w rr nties or conditions, whether expressed or lly or implied in l w, tr dem rks re used under license nd re tr dem rks or registered tr dem rks of SPARC including implied w rr nties nd conditions of merch nt bility or ftness for Intern tion l, Inc. AMD, Opteron, the AMD logo, nd the AMD Opteron logo re tr dem rks or p rticul r purpose. We specifc lly discl im ny li bility with respect to this document, registered tr dem rks of Adv nced Micro Devices. UNIX is registered tr dem rk of The nd no contr ctu l oblig tions re formed either directly or indirectly by this Open Group. 0120 document. This document m y not be reproduced or tr nsmited in ny form or by ny me ns, electronic or mech nic l, for ny purpose, without our prior writen permission. Authors: Atsushi Morimur , N mr t J mp ni, Anil N ir Contributing Authors: Neil M cn ughton, Avneesh P nt, Mich el Zoll 11 Techn cal Br ef / Or cle Re l Applic tion Clusters (RAC) C che Fusion Perform nce Optimiz tions on Ex d t Copyright © 2021, Or cle nd/or its fli tes / Public
You can also read