Algorithms for VLSI physical design automation 046918
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
046918 Algorithms for VLSI physical design automation Partly adapted from: Eduardo Maayan Rajeev Murgai Anrew B. Kahng Jens Leinig Igor L. Markov Jin Hu D. Pan Shiyan Hu Rupesh Shilar
VLSI design flow overview Introduction to algorithms and optimization Backend CAD optimization problems: ◦ Design partitioning ◦ Technology mapping ◦ Floorplanning ◦ Placement ◦ Clock tree synthesis ◦ Routing Introduction to layout Layout optimization and verification ◦ Layout analysis ◦ DRC and LVS checks ◦ Finding objects in the layout
Intro to clock networks ◦ Basic definitions ◦ Clock topologies Tree Mesh Mixed topologies ◦ Design considerations Zero-skew clock network generation ◦ Means and medians ◦ Recursive geometric matching Low-power clock tree synthesis
System Specification Partitioning Architectural Design ENTITY test is port a: in bit; end ENTITY test; Functional Design Floor Planning and Logic Design Circuit Design Placement Physical Design Clock Tree Synthesis Physical Verification DRC and Signoff LVS Signal Routing ERC Fabrication Timing Closure Packaging and Testing Chip 4
Different from other signal nets, clock and power are special routing problems ◦ For clock nets, need to consider clock skew as well as delay. ◦ For power nets, need to consider current density (IR drop) => specialized routers for these nets. Automatic tools for ASICs Often manually routed and optimized for microprocessors, with help from automatic tools
For synchronized designs, data transfer between functional elements are synchronized by clock signals Clock signal are generated externally (e.g., by PLL) Two types of synchronizing elements: ◦ FLIP-FLOPS Edge sensitive – captures data on switching clock edge and is closed rest of the time ◦ LATCHES Level sensitive – captures data on switching clock edge , has transparent phase and closed phase
Clock skew is the maximum difference in the arrival time of a clock signal at two different components. Clock skew forces designers to use a large time period between clock pulses. This makes the system slower. So, in addition to other objectives, clock skew should be minimized during clock routing.
What are the main concerns for clock design? Skew ◦ No. 1 concern for clock networks ◦ For increased clock frequency, skew may contribute over 10% of the system cycle time Power ◦ very important, as clock is a major power consumer! ◦ It switches at every clock cycle! Noise ◦ Clock is often a very strong aggressor ◦ May need shielding Delay ◦ Not really important ◦ But slew rate is important (sharp transition)
Given a source and n sinks. Connect all sinks to the source by an interconnect network (tree or non-tree) so as to minimize: ◦ Clock Skew = maxi,j |ti - tj| ◦ Delay = maxi ti ◦ Total wirelength ◦ Noise and coupling effect
Clock signal is global in nature, so clock nets are usually very big ◦ Significant interconnect capacitance and resistance So what are the techniques? ◦ Routing Clock tree versus clock mesh (non-tree or grid) Balance skew and total wire length ◦ Buffer insertion Clock buffers to reduce clock skew, delay, and distortion in waveform. ◦ Wire sizing To further tune the clock tree/mesh
A path from the clock source to clock sinks Clock Source FF FF FF FF FF FF FF FF FF FF
H-tree
A path from the clock source to clock sinks Clock Source FF FF FF FF FF FF FF FF FF FF
Clock sinks or local sub-networks Spines [Su et. al, ICCAD’01] Clock sinks or local sub-networks Applied in Pentium processor [Kurd et. al. JSSC’01] Applied in IBM microprocessor Very effective, huge wire Clock sinks or local sub-networks [Restle et. al, JSSC’01]
Non-tree = tree + links How to select link pairs is the key problem Link = link_capacitors + link_resistor Key issue: find the best links that can help the skew variation reduction the most! u i C/2 u Rl w w C/2 Rl u w C/2 C/2 [Rajaram et al, DAC’04]
Clock source n x n uniform mesh Distributed array of k x k buffers drives the mesh. flip flops Buffers driven by global H-tree. Flip-flops directly connected to the nearest mesh segment Advantages ◦ Excellent for low skew ◦ Robust to variations Disadvantages ◦ Higher wiring area, capacitance, power ◦ Difficult to analyze Loops and redundancy
Hybrid Structured Clock Network Construction [Hu & Sapatnekar, ICCAD 01] ◦ Hybrid clock topology simple top-level global mesh zero-skew local trees at bottom source ◦ Presents wire sizing scheme a c to achieve latency and skew reduction. iterative LP to minimize wire d width (area) of top-level mesh, b given delay bound uses Elmore delay τ = G-1C sensitivity-based post-layout clock tree tuning to reduce skew.
Clock source Mesh -- excellent for low skew, jitter -- high power, area, capacitance Flip-flops flip flops -- difficult to analyze -- clock gating not easy Tree -- low cost (wiring, power, cap) -- higher skew, jitter than mesh -- widely used in ASIC designs Clock source Best architecture depends on the application -- clock gating easy to incorporate Flip flops crosslink crosslink tree Local trees Hybrid: tree + cross-links Flip flops -- low cost (wiring, power, cap) Hybrid: mesh + local trees -- smaller skew, jitter than tree -- suitable for coarse mesh -- difficult to analyze
H-tree Exact zero skew due to the symmetry of the H-tree Used for top-level clock distribution, not for the entire clock tree ◦ Blockages can spoil the symmetry of an H-tree © 2011 Springer Verlag ◦ Non-uniform sink locations and varying sink capacitances also complicate the design of H-trees 26
Method of Means and Medians (MMM) Can deal with arbitrary locations of clock sinks Basic idea: ◦ Recursively partition the set of terminals into two subsets of equal size (median) ◦ Connect the center of gravity (COG) of the set to the centers of gravity of the two subsets (the mean) 27
Method of Means and Medians (MMM) Find the Partition S by Find the center Connect the Final result after center of the median of gravity for the center of gravity recursively gravity left and right of S with the performing MMM subsets of S centers of on each subset gravity of the © 2011 Springer Verlag left and right subsets 28
Method of Means and Medians (MMM) Input: set of sinks S, empty tree T Output: clock tree T if (|S| ≤ 1) return (x0,y0) = (xc(S),yc(S)) // center of mass for S (SA,SB) = PARTITION(S) // median to determine SA and SB (xA,yA) = (xc(SA),yc(SA)) // center of mass for SA (xB,yB) = (xc(SB),yc(SB)) // center of mass for SB ROUTE(T,x0,y0,xA,yA) // connect center of mass of S to ROUTE(T,x0,y0,xB,yB) // center of mass of SA and SB BASIC_MMM(SA,T) // recursively route SA BASIC_MMM(SB,T) // recursively route SB 29
Recursive Geometric Matching (RGM) RGM proceeds in a bottom-up fashion ◦ Compare to MMM, which is a top-down algorithm Basic idea: ◦ Recursively determine a minimum-cost geometric matching of n sinks ◦ Find a set of n / 2 line segments that match n endpoints and minimize total length (subject to the matching constraint) ◦ After each matching step, a balance or tapping point is found on each matching segment to preserve zero skew to the associated sinks ◦ The set of n / 2 tapping points then forms the input to the next matching step 30
Recursive Geometric Matching (RGM) Set of n Min-cost Find balance or Min-cost Final result after sinks S geometric tapping points geometric recursively matching (point that achieves matching performing RGM zero skew in the on each subset subtree, not always © 2011 Springer Verlag midpoint) 31
Recursive Geometric Matching (RGM) Input: set of sinks S, empty tree T Output: clock tree T if (|S| ≤ 1) return M = min-cost geometric matching over S S’ = Ø foreach ( €M) TPi = subtree of T rooted at Pi TPj = subtree of T rooted at Pj tp = tapping point on (Pi,Pj) // point that minimizes the skew of // the tree Ttp = TPi U TPj U (Pi,Pj) ADD(S’,tp) // add tp to S’ ADD(T,(Pi,Pj)) // add matching segment (Pi,Pj) to T if (|S| % 2 == 1) // if |S| is odd, add unmatched node ADD(S’, unmatched node) RGM(S’,T) // recursively call RGM 32
Exact Zero Skew Adopts a bottom-up process of matching subtree roots and merging the corresponding subtrees, similar to RGM Two important improvements: ◦ Finds exact zero-skew tapping points with respect to the Elmore delay model rather than the linear delay model ◦ Maintains exact delay balance even when two subtrees with very different source-sink delays are matched (by wire elongation) 33
Exact Zero Skew Tapping point tp R(w1) t(Ts1 ) z C(w1) C(w1) C(s1) 2 2 z 1–z Tapping point tp, w1 w2 where Elmore delay s1 s2 R(w2) to sinks is equalized t(Ts2 ) 1–z C(w2) C(w2) C(s2) 2 2 Subtree Ts1 Subtree Ts2 © 2011 Springer Verlag 34
Local Clock Capacitance Distribution in a Microprocessor •Interconnects contribute to major Interconnect portion of total capacitance Sequentials 42% 48% •Clocks are the most active nets in the design Buffers •Minimizing interconnect capacitance 10% in clocks leads to reduction in dynamic power Distribution generated from several blocks in a microprocessor 36
Local Clock Network: CTS Solution Space Global Clock Distribution Using Multiple spines RCBs LCBs Regional Clock Buffers PL Local L Clock Buffers RCBs LCBs To state Tunable Grid Buffers elements Clock Grid • Clock network in a processor: Distributed as a grid followed by tree 37
Logical Sequentials • Performed after the RTL Clock (x,y), sizes placement/sizing of Tree Logic sequentials Clock Buffer Synthesis Duplication • Converts logical clock tree into physical one Physical Routing • Flow employed in Synthesis Clock Nets several microprocessor designs CTS Sizing Clock Buffers Routing CTS (Simplified version) 3 8
• Given a clock buffer, duplicate it to Duplication meet delay, slope, RC, skew constraints Decides K-stage buffers receivers driven by the same driver the clock tree topology Duplication • Applied recursively in reverse topological order K-stage receivers • Driven by clustering or partitioning Often intractable when capacity constraints specified Many heuristics available 39
Effect of Clustering on Capacitance 4 placed Solution 1 Solution 2 Solution 3 sequentials • A cluster implies a clock buffer • Interconnect capacitance varies significantly for different solutions even with same number of clusters 40
Clustering Targeting Power •Find the clusters such that total local clock power is minimum – Power in local clock, PLocal Clock = PDynamic+ PLeakage – PDynamic = PSequential Cap + PBuffer Cap + PRouting Cap – PLeakage and PBuffer Cap can be shown proportional to total cap – PSequential Cap is fixed for CTS purposes – Reducing PLocal Clock is equivalent to minimizing interconnect cap •Find the clusters such that total interconnect capacitance is minimum 41
Routing-aware Clustering: Chicken- and-Egg Problem Routing cap is unknown till the clustering is performed ? Clustering cannot be performed till routing cap is known 42
• Let’s assume minimum spanning tree (MST) routing estimates Other candidates: metrics we saw in placement lecture Correlated with actual clock tree wirelength MST possesses submodularity property suitable for greedy optimization • Can the problem be solved optimally, i.e., can we perform clustering such that the routing cap./overall power is minimum? • Yes, it can be (if capacity constraints are dropped) 43
• Given: Set of receivers S = {s1, …, sn}, their loads (csi), and locations (xsi, ysi) • Find: A set of clusters, Sclusters = {c1, …, cm} such that Σi α + MST (ci) is minimum • Subject to Constraints (or Design Parameters): Maximum # of receivers Due to process, routing, etc. Maximum load in a cluster Due to library Bounding box width/height To control RC delay and variations in it 44
• Similar to Kruskal’s MST construction algorithm • Steps in algorithm: Create complete graph G(S, E, W) Assign each edge estimated capacitance as the weight Create trivial solution with each cluster containing a receiver For each edge, in ascending order of weights Merge clusters till the cost function is minimized 45
Example 1 A cluster An edge 4 5 5 4 The weight 2 •Constraint: maximum # of receivers constraint 3 47
Example 1 4 5 5 4 2 •Constraint: maximum # of receivers constraint 3 4 8
Example 1 4 5 5 4 2 •Constraint: maximum # of receivers constraint 3 49
Example 1 4 5 5 4 2 •Constraint: maximum # of receivers constraint 3 •Power-aware clustering results in clusters with total MST value of 3, which is optimal in this case 50
• Ensures optimality when no capacity constraints (max. load, # of receivers) specified Reduces to minimum spanning forest problem • Runs in O(n2 log n) time in number of receivers Handles blocks with ~5K sequentials easily 1.34 seconds for clustering of 1037 sequentials • Run-times practical and comparable to competitive algorithms Clock buffer duplication takes minutes on ~5K sequential blocks 51
You can also read