A Post-Silicon Trace Analysis Approach for System-on-Chip Protocol Debug
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
A Post-Silicon Trace Analysis Approach for System-on-Chip Protocol Debug Yuting Cao, Hao Zheng Sandip Ray, Jin Yang Computer Science & Engineering Strategic CAD Lab U. South Florida Intel Tampa, FL 33620 Hillsboro, OR {cao2, haozheng}@usf.edu {sandip.ray, jin.yang}@intel.com ABSTRACT arXiv:2005.02550v1 [cs.AR] 6 May 2020 in an observed trace. Reconstructing system-level behavior from silicon traces is a Previous work [15] proposed a method for correlating sil- critical problem in post-silicon validation of System-on-Chip icon traces with system-level protocol specifications. The designs. Current industrial practice in this area is primarily idea was to reconstruct protocol execution scenarios from a manual, depending on collaborative insights of the archi- partially observed silicon trace, which provide abstract views tects, designers, and validators. This paper presents a trace of system internal executions to facilitate post-silicon SoC analysis approach that exploits architectural models of the debug. While that work showed promising results, it has system-level protocols to reconstruct design behavior from a number of deficiencies precluding its applicability in prac- partially observed silicon traces in the presence of ambigu- tice. First, there was no way to qualify or rank the quality of ous and noisy data. The output of the approach is a set protocol execution scenarios generated by the reconstruction of all potential interpretations of a system’s internal execu- procedure. Under poor observability condition, it was pos- tions abstracted to the system-level protocols. To support sible for the algorithm to generate hundreds or thousands of the trace analysis approach, a companion trace signal selec- potential protocol execution scenarios consistent with a par- tion framework guided by system-level protocols is also pre- tially observed trace. Without a metric to rank the quality sented, and its impacts on the complexity and accuracy of of these reconstructions, the debugger is faced with the un- the analysis approach are discussed. That approach and the enviable task of wading through these potential scenarios to framework have been evaluated on a multi-core system-on- infer what may actually have happened in a specific silicon chip prototype that implements a set of common industrial execution. Moreover, based on past experiences, interleav- system-level protocols. ings of different protocol executions are a major source of functional bugs. Since the method developed in [15] does not capture orderings among different protocol executions, 1. INTRODUCTION the results obtained with that method offer little help for Post-silicon validation makes use of pre-production silicon bug localization and root causing. integrated circuit (IC) to ensure that the fabricated system This paper addresses the above deficiencies by introduc- works as desired under actual operating conditions with real ing an optimized trace analysis approach. Central to this software. It is a critical component of the design valida- optimized approach is a new formulation of protocol execu- tion life-cycle for modern microprocessors and system-on- tion scenarios that comprehends ordering relations among chip (SoC) designs. Unfortunately, it is also highly com- protocol executions. Quantitative metrics are also devel- plex, performed under aggressive schedules and accounting oped so that the quality of the results derived by and the for more than 50% of the overall design validation cost [12]. efficiency of the analysis approach can be measured. Trace An SoC design is often composed of a large number of pre- signal selections can have great impacts on the complexity designed hardware or software blocks (often referred to as and accuracy of the trace analysis. Therefore, a companion “intellectual properties” or “IPs”) that coordinate through trace signal selection framework is proposed. This frame- complex protocols to implement system-level behavior [6]. work is communication-centric, and guided by system-level An execution trace of a system typically involves activities protocols. Its objective is to facilitate the trace analysis to from the CPU, audio controller, display controller, wireless produce high quality interpretations of observed silicon traces radio antenna, etc., reflecting the interleaved execution of efficiently. Various trace signal selection strategies are eval- a potentially large number of communication protocols. As uated and analyzed based on their impacts on the trace anal- SoCs integrate more IPs, the interactions among the IPs ysis approach applied to a non-trivial multi-core SoC model are increasingly more complex. Moreover, modern intercon- that implements a number of common industrial system- nects are highly concurrent allowing multiple transactions level protocols. to be processed simultaneously for scalability and perfor- mance. They are an important source of design errors. On the other hand, observability limitations allow only a small 2. FLOW SPECIFICATION number of participating signals to be actually traced dur- An SoC model as shown in Figure 1 is used to illustrate ing silicon execution. Furthermore, electrical perturbations and experiment the work described in this paper. It con- cause silicon data to be noisy, lossy, and ambiguous. It is sists of two CPUs (CPU X), each with a private Data Cache non-trivial during post-silicon debug to identify all partici- (Cache X), a graphics engine (GFX), a power management pating protocols and pinpoint the interleavings that result unit (PMU), a system memory, and three peripheral blocks:
p1p1 CPU 0 0 CPU CPU CPU1 1 t1t:1 (CPU : (CPUX X: Cache : CacheX X: wr : wrreq) req) p2p2 GFX GFX PMU PMU Cache 0 0 Cache Cache Cache1 1 t2 t:2 (Cache : CacheX0X:0 snp : (CacheX X: Cache : snpwrwrreq) req) p3p3 t10: :(Cache t10 (CacheXX: :CPU CPUXX: :wr wr resp) resp) Bus Bus 0 : (CacheX0X: Cache t3 t:3 (Cache : CacheX X: snp : snpwrwrresp) resp) Audio Audio USB USB UART UART Memory Memory p4p4 t4t:4 (Cache : (CacheX X: Bus : Bus: wr : wrreq) req) t9 t:9(Cache : (Cache : CPU X :XCPU : wrresp) X :Xwr resp) p5p5 Figure1: 1:The Figure Theblock blockdiagram diagramofofthe thesimple simpleSoC SoC Figure 1: The block diagram of the simple SoC model. model. model. t5t:5 (Bus : (Bus: Mem : Mem: rd : rdreq) req) p6p6 an an anaudio audio audio control control control unit unit unit (Audio), (Audio), (Audio), a UART aa UART UART controller(UART), controller controller (UART), t : (Mem : Bus : rd resp) and andaaUSB a USB USBcontroller controller controller (USB). (USB). (USB). All All All these these these blocks blocks blocks are are connected are connected connected t6 :6 (Mem : Bus : rd resp) and through through anan interconnect interconnect fabric fabric (Bus). (Bus). p7 through an interconnect fabric (Bus). p7 System System operations operations areare realized realized byby executions executions performedinin performed System operations are realized by executions performed in t7 : (Bus : Cache X : wr resp) various various blocks blocks thatthat are are coordinatedbybysystem-level coordinated system-levelproto- proto- t7 : (Bus : Cache X : wr resp) various cols. blocks These that protocols are coordinated by system-level proto- cols. cols. These These protocols protocols areare are typically typically typically specifiedininarchitecture specified specified inthe architecture architecture p8 p8 documents documents as as message message flowflow diagrams, diagrams, where where the words words “pro- “pro- documents tocol” andas “flow” message are flow used diagrams, interchangeably. where the In words this “pro- paper, t8 : (Cache X : CPU X : wr resp) tocol”and tocol” and“flow” “flow”are are used used interchangeably. interchangeably. In In this this paper, paper, as asas t8 : (Cache X : CPU X : wr resp) in [15], system flows are formalized using Labeled Petri-nets p9 inin[15], [15],system systemflows (LPNs). Figure flows are are formalized 2 shows formalized using a memory write using Labeled Labeled Petri-nets Petri-nets protocol initiated p9 (LPNs). (LPNs). Figure 22 shows shows a memory memory write protocol protocol initiated from a CPU CPU_X in LPN where X 2 {0, 1} and00Xinitiated Figure a write 0 = 1 X. from a CPU CPU_X in LPN where X 2 {0, 1} and X = 11 − X. X. Figure2:2: LPN Figure LPN formalization formalization of of a CPU write proto- from Ana CPU LPN CPU_X is a tuple in LPN (P, T, E, L,Xs0∈) {0, where where1} andP isX a=finite set Figure 2: LPN formalization of aa CPU CPU write writeproto- proto- An An of LPN LPN is is aa tuple tuple (P, (P, T, T, E, E, L, L, ss ) where P is 0) where P is a finite seta finite set col. Each col. Each LPNLPN transition transition isis labeled labeled with with an an event event places, T is a finite set of transitions, E is a finite set 0 col. Each LPN transition is labeled with an event ofof places, places, TT isis aa finite of events, finite set and L : T ! set of of transitions, E E istransitions, a labeling function E isis aa finite finite set that maps set (src, (src, dest, (src, dest, cmd) dest, cmd) where cmd) where cmd where cmd is a command is aa command cmd is sent command sent from from aaa sent from ofofevents, events, and and LL :: TT → !E E isis aa labeling labeling function function that that mapsmaps source component source component src to a destination src to destination component each transition t 2 T to an event e 2 E. For each transition component src to aa destination component component each transitiontt∈2TT to eachtransition to an an event event ee ∈ 2 E. For each each transition transition Theplaces dest. The places without without outgoing outgoing edges are edges are termi- termi- are termi- t 2 T , its preset, denoted as •t ✓E.P ,For is the set of places dest. The places without outgoing edges t 2 T , its preset, denoted as •t ✓ P , is the set of places places nals,which whichindicate indicate termination termination of of protocols repre- protocolsrepre- repre- t ∈ connected T , its preset, to t, and denoted as •t denoted its postset, ⊆ P , is as thet• set ✓ Pof , is the set nals, which indicate termination of protocols connected tothat ✓ sented by the LPNs. connected of places to t,t,and and its t isits postset,to. postset, connected denoted denotedA state asst• as t•✓⊆ P , is P Pof, is the set the a LPN set is a sented byby the the LPNs. LPNs. of places of places subsetthatthat t is connected t is connected of places marked with to. to. A A state s state sThere tokens. ✓ P ⊆ P of of a area twoLPN is LPNspecial is aa subset subset ofofplaces states places marked associated marked with with with tokens. each tokens. LPN; There ✓ P which s0There are two are two is thespecial set of special states associated with each LPN; s ✓ P which isinitial the set of statesinitially associatedmarked with places, eachalso LPN; s00 ⊆ Ptowhich referred as theis the setstate, of initially and marked the end places, state s also whichreferred is the tosetasofthe initial places state, notstate, going initially marked places,end also referred to as the initial trace into a sequence of flow events, higher-level architec- and the end state send which is the set of places not going trace andtothe to any anyend transitions. state send transitions. which is the set of places not going turalinto trace into aa sequence constructssequence of includingof flow flow events, e.g.,events, higher-level higher-level messages, operations, architec- architec-etc, to any transitions. t can be executed in a state s if •t ✓ s. A transition tural and (2) tural constructs including including e.g., trace interpretation, constructs e.g., messages, which operations, infers possible messages, operations, flowetc, ex- etc, A transition t canthe be executed in a be state s if •t ✓leads s. and (2) AExecuting transitiont causes Executing t causes t can the 0 be labeled executed labeled event event in to to a state be emitted, emitted, s if •t and and⊆ leads s. and (2) trace ecution trace interpretation, scenarios that are compliant interpretation, which which infers with infers possible flow the abstracted possible flowex- ex- to a new Executing state sthe=labeled t causes (s •t) [ t•.toTherefore, event be emitted,executingand leadsan ecution scenarios scenarios that event sequence. ecution that are are compliant compliant with with the the abstracted abstracted to a newleads state s0 0a=sequence (s •t) of [ t•. Therefore, executing an to aLPN new state to s = (s − •t) ∪ events. t•. Execution Therefore, executing of a LPN an event eventTosequence. illustrate the basic idea, consider the system flow in sequence. LPN leads completes to a sequence of events. Execution of a LPN LPN leads toifaitssequence send is reached. of events. Execution of a LPN To To illustrate Figure 2, whichthe illustrate webasic the call Fidea, basic consider . Suppose idea, 1 consider thatthe thesystem the following system flow flowflowinin completes if its send is reached. For example, completes if its sendinis Figure reached. 2, t1 can be executed in s0 = Figure execution Figure 2, 2, which tracewe which call call FF11.. Suppose is abstracted we from anthat Suppose the thefollowing observed that silicon trace following flow flow For {p example, in Figure 2, t can be executed in s0 t= 1 }. Event (CPU X : Cache X 1: wr req) is emitted after 1 is execution tracea isdesign abstracted from an observed silicon trace by executing that implements F1 . {pFor}. example, Event and 1executed, (CPUinthe XFigure : LPN Cachestate2,X :t1wrcan req)beis {p becomes executed emitted 2 }. The in st0 = after is end 1state execution trace is abstracted by executing a design that implements F1 . from an observed silicon trace {p 1 }. Event (CPU X : Cache X : wr req) is emitted after t1 is by executing at1design t2 t1 tthat implements F . executed, is send = and {p9the}. LPN state becomes {p2 }. The end state t t t 2 3 3 4 4 5 6 5 6t t t t t 1 . . . executed, is send and }.the LPN state {p9specification becomes {p2 }. The end state t 1 t 2 t 1 t 2 t 3 t 3 t 4 t4 t 5 t 6 t5 t 6 . . . A= flow may also contain multiple branches is sA = {p endflow 9 }. Here the flowtevents 1 t2 t1 are t2 treferred 3 t3 t4 t4to t5by t6their t5 t6 transition ... names specification describing di↵erent ways a system may also contain canmultiple execute branches such flow. A flow specification maya also contain multiplesuch branches Here in theLPN. the flow events The arefour first referred events toresult by theirin transition the following names flow describing For example, di↵erent the flowways shown system in Figure can execute 2 has three flow. branches Here the flow events are referred to by their transition names inexecution the LPN. The first four events result in the following flow describing Forcovering example, different thethe flow cases waysshown where a system in cache the Figure can2 execute has three (snoop) such operation flow. branches is hit in the LPN.scenario The first four events result in the following flow For example, execution scenario covering or miss. the the casesflow whereshown the in Figure cache 2 has operation (snoop) three branches is hit execution scenario {(F1,1 , {p3 }), (F1,2 , {p3 })}. (1) covering or miss. the cases where the cache (snoop) operation is hit {(F1,1 , {p3 }), (F1,2 , {p3 })}. (1) or miss. A flow execution{(F 1,1 , {p3is scenario }),defined (F1,2 ,as {pa3 })}. set of flow instances (1) 3. POST-SILICON TRACE ANALYSIS Aand flowtheir execution respectivescenariocurrent is defined states afteras a set some of events flow instances are pro- 3. POST-SILICON TRACE ANALYSIS Acessed flow execution scenario is states defined as asome set ofevents flowofinstances 3. 3.1 POST-SILICON Previous WorkTRACE ANALYSIS and and their their respective [15]. It cancurrent respective be viewed current states as after an abstraction after some events are pro- system states [15]. cessed wrt system It can flows. be viewed The as above execution scenario an abstraction of aresystem pro- in- 3.1 This Previous Workthe previous approach in [15]. The cessed [15]. It can be viewed as an abstraction of system 3.1objective Previous section Work recaps dicates that the sequence of the states wrt system flows. The above execution scenario in- states wrt first four events is a result of the trace analysis is to reconstruct design in- This section recaps the previous approach in [15]. The dicates thatsystem from executing the those flows. sequence two of The flow the above instances execution first four of events F1 from scenario is their a result in- ini- This section ternal objective of therecaps behavior wrtthe trace previous given analysis is toapproach system-level flowinspecifications reconstruct [15]. designThein- dicates tial that states tothethe sequence shown of states. the For from executing those two flow instances of F1 from their first the four first events event is t 3 , aitresult may ini- objective F from ternal ofa the behavior trace partially analysis observed wrt given is to reconstruct silicon system-level trace a design flowonspecificationsin- small num- from aexecuting be states tial result thethose to from two executing shown flow states. F1,1instances For orthe ,ofbut F1,2first Feventfrom 1 exactlyt3their , itwhich may ini- ternal frombehavior F ber ofahardwarewrtsignals. partially given system-level observed The o↵-chip silicon flow on aspecifications analysis trace includes small num-two tial beone states to is unknown a result the shown states. due to limited from executing For the F1,1 observability.first event or F1,2 , but exactly t , Both possible 3 it whichmay Fber from broad a phases: partially of hardware observed (1) traceThe signals. silicon trace abstraction, o↵-chip on amaps which analysis smalla num- includes silicon two becases one aisresult fromdue are considered, unknown executing toand two limited F1,1 or F1,2 ,scenarios execution observability. but Bothexactlybelow which possible are ber of hardware broad phases: (1)signals. trace The off-chip analysis abstraction, which maps includes two a silicon one isare cases unknown considered, due to andlimited observability. two execution scenarios Both below possible are broad phases: (1) trace abstraction, which maps a silicon cases are considered, and two execution scenarios below are
derived as a result from interpreting t3 . of IP blocks networked by an on-chip interconnect fabric. These blocks communicate with each other through com- {(F1,1 , {p4 }), (F1,2 , {p3 })} (2) munication links, each of which implements a protocol, such {(F1,1 , {p3 }), (F1,2 , {p4 })}. as ARM AXI, over a set of wires. The approach presented After handling the next event t3 , the above two execution in this paper is communication centric in that it works on scenarios are reduced to the one as shown below. silicon traces on a selected number of wires of a selected number of communication links for observation. Suppose {(F1,1 , {p4 }), (F1,2 , {p4 })}. that there are n communication links, and some wires from After the remaining six events are handled, the following each link are selected for observation. A silicon trace is as- execution scenario is derived. sumed to be a sequence of α0 , α1 , . . . such that each αi is a vector defined as {(F1,1 , {p7 }), (F1,2 , {p7 })} αi = hα0,i , . . . αn,i i As another example, now suppose that the design with a bug generates the flow trace below. where αk,i is a state on link k in step i. If all wires of a link are observable, then a state on that t1 t2 t1 t2 t3 t3 t4 t4 t5 t6 t5 t11 . . . . link can be uniquely mapped to a flow event of the same link. This sequence is almost the same as the previous one ex- Under limited observability, a state on a link is typically cept that the last event is t11 : (Cache_X:CPU_X:rd_resp) mapped to a set of flow events. Therefore, a silicon trace is instead of t6 : (Mem:Bus:rd_resp) in the previous trace. t11 abstracted to a sequence E~0 , E~1 , . . . where is an event used in a different flow specification describing ~i = hE0,i , . . . , En,i i E (4) a CPU memory read protocol. Analyzing the trace right before t11 leads to the execution scenarios below. is a vector of sets of flow events abstracted from αi , and each Ek,i in E ~i is a set of flow events abstracted from state {(F1,1 , {p7 }), (F1,2 , {p6 })} αk,i in αi . No temporal orderings exist among all events in (3) {(F1,1 , {p6 }), (F1,2 , {p7 })}. ~i . On the other hand, for two events, ei ∈ E ~ i and ej ∈ E~j E However, t11 cannot be a result from executing either flow such that i < j, then ei happens before ej . instances in both scenarios, which indicates a noncompli- Based on different levels of information captured, this pa- ance of the design implementation with respect to the given per classifies flow execution scenarios as follows. flow specification. Such an event is referred to as being in- • Type-1 execution scenarios capture the number of in- consistent. In this case, the algorithm halts, and returns t11 stances of each flow specification initiated from a sili- and the derived flow execution scenarios as shown in (3) for con trace, and their relative orderings of initiations. debugger to examine further. • Type-2 execution scenarios, on top of what is captured by Type-1 scenarios, capture completion of each flow 3.2 Flow Execution Scenarios instance. This additional information can be used to The trace analysis approach in [15] does not capture order- identify potential problems if there is any flow instance ings among flow instances for execution scenarios. However, that is not completed. Furthermore, Type-2 execution from a debugger’s point of view, communication protocols scenarios capture the relative orderings among all flow can be related. For example, a firmware loading protocol instances as described above. always happens before a firmware execution protocol. If • Type-3 execution scenarios, on top of what is captured a firmware execution protocol is found to happen before a by Type-2 scenarios, capture information on execution firmware loading protocol, that possibly indicates an error paths followed by individual flow instances. This in- in the system implementing such protocols. Such properties formation can provide a means to debuggers to have cannot be checked by the previous approach. a detailed examination on how each flow instance is To address that problem, this paper presents a new defi- executed. nition of flow execution scenarios as These different execution scenarios can be used to provide {(Fi,j , si,j , start i,j , end i,j ) | Fi ∈ F} different views of system execution, from coarse-grained to more detailed ones, at different stages of debug. where start i,j and end i,j are two indices representing rela- tive time when Fi,j is initiated and completed. The ordering 3.3 Algorithms relations can be derived by comparing their start and end Algorithm 1 shows the top-level procedure for detecting indices. For example, for two flow instances in an execution internal flow executions based on a partially observed silicon scenario, (Fu,v , su,v , start u,v , end u,v ) and (Fx,y , sx,y , start x,y , trace, and checks the compliance wrt a given flow specifica- end x,y ), Fu,v is initiated before Fx,y if start u,v < start x,y , or tion. It takes as inputs F, a set of system level flow spec- Fx,y is initiated after Fu,v is completed if end u,v < start x,y . ifications, and a signal trace ρ, which is assumed to be a The ordering relations can provide more accurate informa- sequence of states on a set of observable trace signals, and tion for understanding system execution under limited ob- each state is uniquely indexed starting from 0. servability. Section 3.3 explains how start i,j and end i,j are This algorithm scans trace ρ starting from index h initial- decided during the trace analysis. ized to 0, extracts all possible flow events from ρ at index h In order to support the new definition of flow execution as described in section 3.2 (line 6), and maps each of those scenarios, the trace abstraction, which maps an observed sil- extracted flow events to update already detected execution icon trace to a linear sequence of flow events as in [15], is scenarios (line 11). The algorithm terminates if one of two also generalized. A SoC design can be viewed as a group conditions holds. If an inconsistence is encountered, the set
Algorithm 1: Check-Compliance(F, ρ) Algorithm 2: Analysis(F, scen, e, h) 1 /* F: a set of flow specification */ 1 /* scen = {(Fi,j , si,j , starti,j , endi,j )} */ 2 /* ρ: a partially observed silicon trace */ 2 /* e is a flow event abstracted from silicon trace at 3 M ← {∅} index h */ 4 h←0 3 R=∅ 5 while h ≤ |ρ| do 4 /* Check if e can change state any existing flow 6 E~ ← abstract(ρ, h) instances of scen */ 7 foreach Ei of E ~ do 5 foreach (Fi,j , si,j , starti,j , endi,j ) ∈ scen do 8 inconsistent ← true 6 s0i,j ← accept(Fi,j , si,j , e) 9 foreach e ∈ Ei do 7 if s0i,j 6= ∅ then 10 foreach scen ∈ M do 8 Let scen0 be a copy of scen 11 Scens ← analysis(F, scen, e, h) 9 Replace si,j of scen0 with s0i,j 12 if Scens 6= ∅ then 10 if s0i,j = Fi .send then 13 inconsistent ← false 11 Update endi,j of scen0 with h 14 M ← (M − scen) ∪ Scens 12 R ← R ∪ scen0 15 if inconsistent = true then 13 /* Check if e can extend scen by initiating new flow 16 return (M, h, i) instances */ 17 h←h+1 14 foreach Fi ∈ F do 15 create a new instance Fi,h 18 return (M, −1, −1) 16 s0i,h ← accept(Fi,h , Fi .s0 , e) 17 if s0i,h 6= ∅ then 18 Let scen0 be a copy of scen of detected partial execution scenarios along with two in- 19 scen0 ← scen0 ∪ (Fi,h , s0i,h , h, −1) dices h and i are returned (line 16). Index h provides tem- 20 R ← R ∪ scen0 poral information on when the inconsistency occurs, while 21 return R i provides spatial information on which communication link an inconsistent event is transmitted. If no inconsistency is found, the set of all execution scenarios compliant with the The contributing factors to the complexity and accuracy observed trace is returned (line 18) when index h is larger problems are explained below. than the length of the trace. Algorithm 2 takes the specification F, an execution sce- 1. A signal event mapped to a set of flow events − Due to nario scen, a flow event e, and index h of the trace where the limited observability, a signal event of an observed e is extracted, and it produces a set of execution scenarios silicon trace is often interpreted as a number of differ- R consistent with e. This algorithm performs two tasks. In ent flow events, which typically leads to derivation of the first task (lines 5-12), the algorithm checks every flow a number of different execution scenarios. This situa- instance to decide if e can be accepted. If such an instance tion is exacerbated by the fact that silicon traces are is found (line 7), then it is updated with the new state as often very long, which could lead to excessively large the result of e (line 9). Furthermore, if e causes the flow in- numbers of possible execution scenarios derived during stance to complete, its index end i,j is set to h (line 10-11), or at the end of the analysis. indicating the completion of that instance due to event e at 2. A flow event mapped to different temporal flow in- step h of the trace. In task 2, all possibilities where e can stances − Temporal flow instances refer to the flow in- initiate a new flow instance are considered (line 14-20). If stances activated by the same component, e.g. read/write a new instance can be initiated, its start i,j is set to h, indi- flows activated by CPU_0. If several temporal instances cating the initiation of that instance due to a signal event of some flows are activated by a component, mapping at step h of the trace. flow events to those flow instances can be ambigu- ous. For example, suppose that an execution scenario 3.4 On the Complexity and Accuracy includes two instances of the flow as shown in Fig- Due to the limited observability, reconstructing system ure 2 activated by CPU_0, one in state {p2 }, and the level executions from an observed silicon trace is an impre- other one in state {p8 }. An instance of flow event cise process. The large number of execution scenarios typi- (Cache 0 : CPU 0 : wr resp) can be mapped to either cally derived during the analysis would take large amounts flow instance leading to two new execution scenarios of runtime and memory to process and to store, thus making from the current one. it less efficient. This is referred to as the complexity prob- 3. A flow event mapped to flow instances activated by dif- lem of the trace analysis. After the analysis is done, a large ferent components − This situation can happen when number of derived execution scenarios make it difficult to flow instances that share some common events are ac- understand the analysis results, thus being less helpful for tivated by different components. For example, suppose debugging. Obviously, a single flow execution scenario de- an execution scenario has two instances of the flow as rived at the end of the trace analysis provides much more shown in Figure 2, one activated by CPU_0 and the precise information for debug than ten candidate flow exe- other one by CPU_1, and both are in state {p6 }. A cution scenarios. This is referred to as the accuracy problem flow event (Mem : Bus : rd resp) can be mapped to ei- of the trace analysis. ther one of these two instances, leading to two new
Bit-Level Selection execution executionscenarios scenariosderived derivedfrom fromthethecurrent currentone.one. Flow Specification F The Theabove aboveissues issuescancanbe bemitigated mitigatedby bygood good signal signal selec- selec- tions tionstotobebediscussed discussedininthe thefollowing followingsection. section. In In order order to to evaluate evaluatethe theimpacts impactsofofdifferent di↵erenttrace tracesignal signal selections selections onon System-Level Selection xecution scenarios the derived thecomplexity complexity and from and the of accuracy accuracy current ofthe thetraceone. trace analysis, analysis,thisthis pa- pa- Flow Specification per perintroduces introduces two two quantitative quantitative above issues can be mitigated by good signal selec- metrics. metrics. The The Trace Signals complexity complexity isis F Sets of Flow measured measuredby bythe thepeak peakcount countofofflowflowexecution executionscenarios scenariosen-en- Events RTL Model o be discussed in theduring countered counteredduring following the section. theanalysis analysisprocess, Ini.e., process,i.e., order the to thelargest largestsize sizeofof te the impacts M of di↵erent trace signal selections M encountered during the execution of Algorithm 1.1. The encountered during the execution of Algorithmon The System-Level Selection mplexity and accuracy accuracy accuracy of the trace isismeasured measured by bythetheanalysis, final final count countthis pa-execution ofof flow flow execution Bit-Level Selection roduces twoscenarios scenarios derived quantitative atatthe theend derivedmetrics. ofofthe endThe analysis analysisprocess, thecomplexity is i.e., process, i.e.,the the sizeofofM size Mreturned returnedon oneither eitherline line16 16oror1818ofofAlgorithm Algorithm1.1. Sets of Flow ed by the peak count of flow execution scenarios en- Events Trace Signals RTL Model Figure 3: A framework of trace signal selection. red during the analysis process, i.e., the largest size of 4.4. TRACE TRACESIGNAL SIGNALSELECTION ountered during the execution of Algorithm 1. The SELECTION Trace Tracesignal signalselection selectionisisaacritical criticalstep stepininpost-silicon post-silicon de- de- cy is measured Figure 3: AA framework framework of trace trace signal signal selection. selection. bug.by bug. the ItItincludesfinal includes twocount two different di↵erentofefforts: flow execution e↵orts: pre-silicon pre-siliconandand post- post- Bit-Level Figure 3: Selection of os derived at the end silicon. silicon.During of the During analysis pre-silicon pre-silicon process, selection, selection, aafew i.e., few the signals thousand thousand signals M returnedamongon either among a avast line vastnumber number16ofof orinternal 18 ofsignals internal Algorithm signalsare 1. for aretapped tapped forobser- obser- vation. vation.All Allnecessary necessarysignals signalsmust mustbe beselected selectedat atthis thisstage, stage, Trace Signals otherwise,expensive otherwise, expensivere-design re-designalong alongwith withsilicon siliconre-spinre-spinare are TRACE SIGNAL SELECTION required.During required. Duringpost-silicon post-silicondebug, debug,aasmall smallsubset subsetofofthose those tappedis tapped e signal selection signals signals arerouted are a critical routed steptoto the inthe chipinterface chip post-silicon interface de- for tracing for tracing duringsystem during systemexecution. execution. Figure 3: A framework of trace signal selection. t includes twoPrevious di↵erent Previous work e↵orts: work suchasas[4] such pre-silicon [4]isistypically typically and post- applied applied totogate gatelevel level (a) (b) During pre-silicon designmodels, design selection, models, andandthethe aquality few thousand quality ofofthe theresults signals results isisevaluated evaluatedby by a vast number theof the internalused commonly commonly signals used state state are tapped ratio. restoration restoration for obser- ratio. However, itit isis However, difficult difficult totoscale scale those those methodstoto large and complex SoC SoC Figure 4: Examples of flow structures. All necessary signals must be methods selected atlarge thisand stage,complex designs.More designs. Moreimportantly, importantly,signals signalsselected selectedat atthe thegategatelevel level ise, expensivearere-design oftenirrelevant along with irrelevanttotosystem-level silicon re-spin system-level functionalities. are functionalities. There There are often d. During post-silicon attemptdebug, isisananattempt raisea totoraise small the the subsetlevel abstraction abstraction of those level fortrace for trace signal signal • In Figure 4(b), branches split and then join, and the signals areselection routedtoto selection to the the chiptransfer register interface transfer for level(RTL) (RTL) tracing guidedby byasser- asser- flow ends with a common event. In this case, an unique the register level guided tions [11], however that work does not consider system level event needs to be selected for each branch. system execution. tions [11], however that work does not consider system level functionalitieseither. either.InIn[3], [3],aasystem systemlevel levelprotocol protocolguided guided functionalities ious work such as [4] is typically applied to gate level The flow shown (a) in (a) (b) branches with a (b) approachisisproposed. approach models, andare the arebased quality basedon proposed.ItItisissimilar of onsystem the systemlevel results similartotoour levelprotocols.is ourwork evaluated protocols. However, workininthat However, theby (a) thatbothboth the selec- selec- (b) Figure 2 has three structure similar to Figure 4(b). Its start and end events mmonly used state restoration ratio. are {t1 , t8 , t9 , t10 }. Note that t8 , t9 , and t10 actually refer to tion tion techniques techniques developed developed [3]However, inin[3] are simple it are simple and and isirrelevant irrelevant Figure 4: Examples of flow structures. the same event. Figure 4: Examples There is no of choiceflow for structures. the right branch as t to scale those methods to totounderstanding understanding large silicon silicon and traces traces atatcomplex systemSoC thesystem the level,and level, andthe the t10 must be selected. To identify the left two branches from evaluationwas . More importantly, evaluation wasperformed signals performed selected ononan anabstract at abstract the transaction gatetransaction level level level the right one, either t2 or t3 needs to be selected. Similarly, model. model. • In Figure 4(b), branches split and then join, and the en irrelevant to This system-level section introduces functionalities. a framework There shown in Figure 3 for one of events in {t4 , t5 , t6 , t7 } needs to be selected in order flow ends with a commonsplit event.and In this case, an unique This section introduces a framework shown in Figure 3 for • IntoFigure identify 4(b), the leftbranches then one. join, and the ttempt to raise trace on to the register totothe the tracesignal signal thetransfer pagelimit, page Figure 4: Examples of flow structures. abstraction selectionguided selection level limit, this(RTL) this level guided paperonly paper byfor by guided trace signal system-level system-level onlyconsiders by asser- considers protocols. Due protocols. the pre-silicon the Due pre-silicon flow ends all possible withevent a branch from event needs to be selected for each branch. common selections event. for the middle that flow In this Therefore, are case, an unique tracesignal 11], howevertrace that signalselection. work selection. does notSince Since consider the pre-silicon system selection level the pre-silicon selection needs needs eventTheneedsflow{tshown to beinselected Figure 2for haseachthreebranch. 1 , t8 , t9 , t10 } ⇥ {t2 , t3 } ⇥ {t4 , t5 , t6 , t7 }. branches with a to support all types of execution scenarios, it is sufficient to structure similar to Figure 4(b). Its start and end events to support nalities either. In [3], allatypes system of execution level protocol scenarios,guidedit is sufficient to Among consideronly onlyType-3 Type-3scenarios scenariosasasthey theysupersede supersedeType-1 Type-1or orThe are flow t8 , tpossibly {t1 ,shown 9 , t10in }. Note largethat Figure number t9of 2t8 ,has , andevent three t10 selections, actually branches referthereto a with consider ch is proposed. It is similar to our work in that both are two types of events that have interesting characteris- the same event. There is no choice for the right branch as -2-2scenarios. scenarios. structure similar to includes Figure 4(b). that Its are start andtoend events sed on system level protocols. However, the selec- ttics. 10 must One betypeselected. events To identify the left unique two branches specific from 4.1 System 4.1 System in [3]LevelLevel Selection Selection are {t1the ,flows t8 ,right t9(ref. , t10 }.unique Noteevents),t2that or t3tneeds , t9 , to and t10 actually refer to chniques developed erstanding silicon • In Figure 4(b), branches split and then join, and the During Duringtraces the system the system are simple at the level level and irrelevant selection, system selection, level,di↵erent subsets andsubsets different the of flow of the same flow one event. events of events section one, shared considers either There 4 , tis inby{tmultiple their 5 , tno impacts 8while 7choice 6 , tflows } needs the (ref.for on the beother toshared be selected. the typeSimilarly, right events). selected includes inbranch This as order eventsfor forobservation observationare areselected selectedfrom fromgivengivenflow flowspecifica- specifica-t10 mustto identify be selected. the leftTo branch identify from the the leftcomplexity middle two one.branches and ac-from Therefore, events tion was performed tions. flow ends with a common event. In this case, an unique on those Then, an abstract results are transaction passed tions. Then, those results are passed to the more refined bit levelselection. selection. to the levelrefined more bit the right curacy all possible theone, of the event either and complexity trace t2 or analysisforand selections t3 needs accuracy that of the the flow to trace signal are be selected. selection. For analysis, only Similarly, is- level one of events in{t {t , t , , tt , , t t }, ×t } {t needs , t } × to {t , be t , t selected , t }. in order section introduces To allallflow ignal selection guided event needs to be selected for each branch. Tosupport a framework support Type-3scenarios, Type-3 flowspecifications specifications by system-level shownthe scenarios, mustbe must in theFigure beselected. startand start and selected. IfIfaaDue protocols. 3 endfor endevents eventsof flowspecifica- flow specifica- of sues to identify Issue Among #2 the #1 is and 1 left #3 possibly 4 in5 section 8 9 branch relevant 106 7 3.4 are considered in this section. to the large from 2 number 3 theof middle bit level 4 selection. 5 6 event selections, 7 one. Therefore, there tion has multiple branches, additional events may need all to possible are It is two important eventtypes selections of to events select for that events that have that flow can interesting are be mapped to characteris- page limit, tion has thisselected papermultipleonly branches, considers additional the events may need to pre-silicon smallest Onenumber of flow instances during the trace to analysis bebeselected ignal selection. Since soso the thatthe that thebranch pre-silicon branchfollowed followedby selection byaaflow needs flow instance instance tics. in order to type includes reduce the events that complexity are unique and improve the specific accu- duringsystem systemexecution executioncan canbe becaptured. captured. Figure Figure 44 shows shows flows{t(ref. t9 , t10 }events), 1 , t8 , unique ⇥ {t2 , twhile 3 } ⇥the {t4 ,othert5 , t6type , t7 }.includes The flow shown in Figure 2 has three branches with a port all types during two ofexamples two execution examples scenarios, ofofdifferent di↵erent it is sufficient branching branching structuresfor structures to forflows. flows. Among racy. Refer to the flow shown events shared by multiple flows (ref. shared events). This t3 arepossibly used only largein that number in flow. Duringof Figure event 2. Events the selections, t trace analysis, 2 and there er only Type-3 scenarios section considers their impacts on the complexity and ac- • In Figure as they eachsupersede branch endsType-1 or arios.structure similar to Figure 4(b). Its start and end events • In Figure 4(a), 4(a), each branch ends with an unique with There is no need to select additional events as observ- an unique event. event. are two they types curacy tics. One are just of of the mappedthat events trace to the analysis have and instances flow. Issue #2 can be addressed in that instances of di↵er- typeinitiated includes the interesting signal of that particular characteris- selection. For by events the samethat are uniqueanalysis, to are ignoredspecific There is no need to select additional events as observ- the complexity and accuracy of the trace only is- ing di↵erent end events can clearly identify the branch ent flows component for SystemareLevel{t1Selection , t8 , t9 , t10 }. Note that followed during system execution. followed during system execution. t8 , t9 , and t10 actually refer to ing different end events can clearly identify the branch flows sues #2unique (ref. and #3 events), in section 3.4 are considered those events. Issue #3 is addressed in that instances ofincludes Issue #1 is relevant to the bit level selection. while the other in this type section. any ng the system level selection, di↵erent subsets of flow events shared by multiple flows (ref. shared events). This the same for observation event. are selected There from given flow specifica- is no sectionchoice for considers their the impacts on right branch the complexity and ac- as curacy of the trace analysis and the signal selection. For Then, those results are passed to the more refined bit t10 must be selected. To identify lection. the complexity sues #2 and the #3 in left3.4two and accuracy section are branches of the trace analysis, only from considered in this is- section. upport Type-3 scenarios, the start and end events of the right w specifications one, either must be selected. t or tIssueneeds If a flow specifica- totobe #1 is relevant selected. the bit level selection. Similarly,
It is important to select events that can be mapped to high distinguishing power helps to address issue #1 as dis- smallest number of flow instances during the trace analysis cussed in Section 3.4. in order to reduce the complexity and improve the accu- RTL models may contain additional implementation in- racy. Refer to the flow shown in Figure 2. Events t2 and formation that can help to address issue #2 and #3. For t3 are used only in that flow. During the trace analysis, example, memory operations may be executed out-of-order. they are just mapped to the instances of that particular In this case, CPUs usually assign unique sequence IDs to flow. Issue #2 can be addressed in that instances of differ- flow instances to maintain data and control dependency in ent flows initiated by the same component are ignored for the original programs. If sequence IDs are available, select- those events. Issue #3 is addressed in that instances of any ing signals implementing them can help address issue #2. flows initiated by different components are ignored for them. If the on-chip interconnect needs to handle events from t4 and t7 have a similar characteristic. different components in a system, the events are usually as- On the other hand, events t5 and t6 are used in many signed with tags to identify their originating components. different flows of different components. Those flows can be Selecting tags can affect how events are selected. Refer to read/write flows of CPU_0 or CPU_1. During the trace anal- Figure 2 for the following discussion. ysis, if there are multiple instances of such flows, it is im- 1. If unique events such as t4 or t7 are selected, observing possible to know which of those flow instances cause those tags is not needed. events to be generated. Therefore, the analysis algorithm has to map those events to those flow instances in all pos- 2. Shared events t5 or t6 are selected along with tags. sible ways. That can cause huge negative impacts on the For option 2, tags can help to map events to the flow in- complexity and accuracy of the trace analysis. stances with the same tags during the trace analysis, thus In terms of trace signal selection, those two types of events addressing issue #3. Even though additional signals for tags can lead to different results. If unique events are selected. are selected, the total number of events may be smaller if then the total number of events selected can be large, and as the shared events are used in many different flows, therefore a result, a large number of trace signals need to be selected in resulting in reduced signals for observation overall. order to observe those events. On the other hand, the total The following discussion illustrates yet another example of number of events can be smaller if shared events are selected. how implementation information can allow different events That leads to a smaller number of trace signals that need to to be selected. Refer to Figure 2. That flow contains two be selected. The negative impacts of selecting shared events branching places, p2 and p4 . When a flow instance reaches can also be mitigated if certain implementation details are p2 , which branch to take next depends on whether the cache available. Next section gives more discussions on that point. operation is hit or miss. Similarly, which branch to take at p4 depends on whether the cache snoop operation is hit or 4.2 Bit Level Selection miss. If these two status signals are available and included The bit level selection takes as inputs the set of event selec- for observation, there is no need to select branch events. Ob- tions produced in the previous step and an RTL model that serving start/end events plus those status signals are suffi- implements the system flow specifications, and performs two cient to identify branches followed by a flow instance during tasks for each event selection: system execution. 1. Evaluate its quality wrt the three issues discussed in Section 3.4; 5. EXPERIMENTAL RESULTS 2. Choose one selection, and generate a set of candidate To the best of our knowledge, this work is the first to trace signals that implement the selected events. present a systematic approach to post-silicon trace analy- sis guided by system level protocols. We are not able to The ultimate goal of the bit level selection is to produce a find any similar previous work where ours can be evaluated reduced set candidate trace signals optimized for the trace and compared with. The closest work to ours is [15]. How- analysis approach. Since the bit level selection depends on ever, our work is more general and developed with practical implementation specifics, this section can only discuss some considerations. Additionally, the work in [15] is discussed general guidelines and tradeoffs. Note that flow specifica- and evaluated based on an abstract transaction-level model tions are typically independent of memory address and data while our approach is evaluated on a RTL model. information. Therefore, the address and data bits included in event implementations can be generally ignored. 5.1 The Model Signals that implement the Cmd field of flow events are se- The ideas and techniques presented in this paper are eval- lected based on their respective distinguishing power. Given uated on a multi-core SoC prototype, as shown in Figure 1, a set of flow events E and a set of signals W that implement which implements a number of common industrial system- E, the distinguishing power of Wi ⊆ W , is defined by E can level protocols including cache coherence and power man- be partitioned wrt Wi . A finer partition means higher dis- agement. This prototype is a cycle- and pin-accurate RTL tinguishing power. For example, suppose two flow events on model written in VHDL. Even though this model is simple link (cpu0, DCache0) implemented by eight signals b7 . . . b0 compared to real SoC designs, it is much more sophisticated with the following encodings. than the gate-level benchmark suites typically considered as targets for post-silicon analysis [10, 4, 5]. (cpu0 : DCache0 : wr req) 0100 0000 Since the proposed trace analysis approach is communica- (cpu0 : DCache0 : rd req) 1000 0000 tion centric, the focus of this model is the implementation of Under these encodings, signals b5 . . . b0 have zero distin- system-level protocols. The CPUs are treated as a test envi- guishing power. b7 and b6 have the equal power, therefore ronment where software programs are simulated in VHDL to selecting either one would be fine. Selecting signals with trigger various protocols. Therefore, there is no instruction
cache as no instructions are involved when the CPUs are S3 The start and end events of all protocols are selected. simulated. The peripheral blocks, GFX, PMU, Audio, etc, Furthermore, for each branch in each flow, a highly are also described as abstract models that generate events shared event is selected. to initiate flows or to respond incoming requests. S4 The start and end events of all protocols are selected. More details of some system-level protocols implemented Instead of selecting events for branches in each flow, in our model can be found in [3]. They include downstream signals whose states control the flow branching are se- read/write protocols for each CPU, upstream read/write for lected. the peripheral blocks, and system power management pro- At the bit level, the Addr and Data fields are not consid- tocols, which are abstracted from real industrial protocols. ered. On the other hand, the Val bit is always selected so These system-level protocols are supported by inter-block that valid messages can be identified from observed traces. communication protocols based on the ARM AXI4-lite [1]. For selections S2, S3, and S4, experiments are performed to A total of 16 flows are implemented for this prototype. evaluate all combinations of Cmd, Tag and Sid fields. A flow event is generated from a source and consumed by a destination by messages transmitted over that link. In our 5.3 Result Analysis model, each message is organized as follows. In Table 1, a means that all signals implementing a par- hVal(1), Cmd(8), Tag(8), Sid(8), Addr(32), Data(32)i ticular field for all selected events in selection SX are traced. Otherwise, all those signals are not traced. Third row (# The meanings of the message fields are given below. The Bits) shows the total numbers of single-bit signals are traced numbers following the individual fields indicate their respec- for different selections. As discussed in section 4, system- tive widths. Note that not all fields are used on all links. level selection may choose events unique to particular flows That model has over four thousand single bit signals. or events shared by multiple flows. From the table, we can see that selecting shared events leads to a smaller number Val indicates validity of a message. of trace signals (S3) compared with selecting unique events Cmd carries operations to be performed by the target block. (S2). However, if status signals controlling flow branching Tag is used by Bus to identify the original sources of mes- are selected without selecting any branch events, S4 leads to sages from different blocks that go to the same desti- the smallest trace signal selection. nation, e.g. memory wr_req from Bus in response to From the table, it is quite obvious that not selecting Cmd wr_req from both CPUs. or Sid has severe impacts on the trace analysis as explained Sid is an unique number generated by a component to rep- in issues #1 and #2 in section 3.4. On the other hand, resent sequencing information of flows initiated by the not selecting Tag has negative impacts, but not as severe. same component. The trace analysis can still finish even though it takes more time and memory. Next, compare the results obtained by Addr carries the memory address at the target block where selecting Cmd and Sid but no Tag under S2−S4. The results Cmd is applied. with S4 are much better than S2 or S3. This is due to that Data carries data to a target or from a source. Its width can no branch events are selected for S4, therefore, issues #2 vary depending on the links where a message is sent. and #3 are avoided. Combined with the benefit of reduced On the links between Cache to Bus, the width is equal trace signals, S4 appears to be the best option. On the other to the size of the cache block, which is 64 bytes. For hand, not selecting any branch events may cause difficulty all the other links, the width is 32 bits. in understanding flow execution if a branch is long and a system execution fails to reach the end of that branch. 5.2 Experiment Setup In the above discussion, selections of Cmd, Tag and Sid are Test Environment The prototype is simulated in a ran- applied to all events as the result of the system-level selec- dom test environment where CPUs, GFX, and other pe- tion. A finer selection can be used to reduce trace signals if ripheral blocks are programmed to randomly select a flow unique events and shared events are considered separately. to initiate in each clock cycle. The contents of Cmd, Addr, For unique events, the sources where they are generated are and Data in each activated flow are set randomly. Addi- known from flows, therefore Tags need not be traced. Shared tionally, CPUs can activate power management protocols events may be results of flow instances initiated by different non-deterministically. Each of these blocks activates a total components, therefore tracing Tags are necessary. On the of 100 flow instances during entire simulation. other hand, tracing or not tracing Cmds has little impact on Trace Signal Selection In the experiments, different the trace analysis. These points are supported by the re- selections of trace signals are produced as discussed in sec- sults shown in columns under “U S00 . Under S2, compare tion 4, and their impacts on the complexity and accuracy the results under “U S00 against those with all three fields of the trace analysis approach are evaluated. The list below selected. We can see that the runtime performance and the explains the selections at the system level while information complexity and accuracy of the trace analysis are similar on the bit level selection is given in Table 1. while the trace signals are reduced with the finer selection. Comparing the results under “U S00 against those obtained S1 All events of all flow specifications, and all signals im- with only Cmd and Sid selected, the complexity is signifi- plementing each event are selected. This selection of- cantly dropped. The same conclusion can be drawn for S3 fers full observation, and provides a baseline for com- and S4. paring with other selections. From the above discussion, it is necessary to trace signals S2 The start and end events of all protocols are selected. implementing Cmd and Sid whenever possible, and trace as Furthermore, for each branch in each flow, one unique many signals implementing Tag as allowed to reduce com- event is selected. plexity of the trace analysis even more. If Tag or Sid is not
You can also read