The Xbox One System on a Chip and Kinect Sensor
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
This article has been accepted for publication in IEEE Micro but has not yet been fully edited. Some content may change prior to final publication. The Xbox One System on a Chip and Kinect Sensor John Sell, Patrick O’Connor, Microsoft Corporation Figure 2 shows a block diagram of the system. 1 Abstract The main SoC contains all of the principal computation components. The South Bridge The System on a Chip at the heart of the Xbox chip expands the SoC input and output to One entertainment console is one of the largest access optical disc, hard disc, and flash storage, consumer designs to date with five billion HDMI input, Kinect, and wireless devices. transistors. The Xbox One Kinect image and voice sensor uses time of flight technology to provide high resolution, low latency, lighting- independent three-dimensional image sensing. Together they provide unique voice and gesture interaction with high performance games and other entertainment applications. 2 Terms CPU Central Processing Unit DRAM Dynamic Random Access Memory DSP Digital Signal Processor GPU Graphics Processing Unit HDMI High Definition Multi-media Interface MMU Memory Management Unit PCI (e) Peripheral Component Interface SoC System on a Chip SRAM Static Random Access Memory 3 Xbox One System Figure 2: Xbox One System The Xbox One system pictured in figure 1 includes the Kinect image and audio sensors, 4 Main SoC console, and wireless controller. A single SoC departs from the initial implementations of previous high performance consoles. One chip enables the most efficient allocation of memory and other resources. It avoids the latency, bandwidth limitations, and power consumption of communicating between computation chips. Microsoft collaborated with Advanced Micro Devices (AMD) to develop the SoC. SRAM and GPU circuits with redundancy comprise more than 50% of the 370-mm2 chip, resulting in yield comparable to much smaller designs. Figure 3 shows the SoC organization. The SoC provides simultaneous system and user services, video input and output, voice Figure 1: Xbox One Kinect, Console, and recognition, and three-dimensional image Wireless Controller recognition. Digital Object Indentifier 10.1109/MM.2014.9 0272-1732/$26.00 2014 IEEE
This article has been accepted for publication in IEEE Micro but has not yet been fully edited. Some content may change prior to final publication. page addresses and uses large pages where Significant features include: appropriate to avoid significant performance x Unified, but not uniform, main memory impact from the two-dimensional translation. x Universal host-guest virtual memory management System software manages physical memory x High bandwidth CPU cache coherency allocation. System software and hardware keep x Power islands matching features and page tables synchronized so that CPU, GPU, performance to active tasks and other processors can share memory, pass pointers rather than copying data, and a linear data structure in a GPU or CPU virtual space can have physical pages scattered in DRAM and SRAM. The unified memory system frees applications from the mechanics of where data is located, but GPU-intensive applications can specify which data should be in SRAM for best performance. The GPU graphics core and several specialized processors share the GPU MMU, which supports 16 virtual spaces. PCIe input and output and audio processors share the IO MMU, which supports virtual spaces for each PCI bus/device/function. Each CPU core has its own MMU (CPU access to SRAM maps through a CPU MMU and the GPU MMU). The design provides 32 GB/second peak DRAM access with hardware-maintained CPU cache coherency for data shared by the CPU, GPU, and other processors. Hardware-maintained coherency improves performance and software reliability. The implementation restricts shared CPU- cache-coherent data (and PCIe and audio data, most of which is CPU-cache-coherent) to DRAM Figure 3: SoC Organization for simplification and cost savings. GPU SRAM access and non-CPU-cache-coherent DRAM access bypass CPU cache coherency checking. 4.1 Main Memory Main memory consists of 8 Gbytes of low cost 4.2 CPU DDR3 external DRAM and 32 Mbytes of internal SRAM. This provides necessary bandwidth The CPU contains eight AMD Jaguar single- while saving power and considerable cost over thread 64-bit x86 cores in two clusters of four. wider or faster external DRAM-only alternatives. The cores contain individual first level code caches and data caches. Each cluster contains Peak DRAM bandwidth is 68 Gbytes per a shared 2 MB second level cache. second. Peak SRAM bandwidth ranges between 109 and 204 Gbytes per second, depending on The CPU cores operate at 1750 MHz in full the mix of transactions. Sustainable total peak performance mode. Each cluster can operate at bandwidth is about 200 Gbytes per second. different frequencies. The system selectively powers individual cores and clusters to match MMU hardware maps guest virtual addresses to workload requirements. guest physical addresses to physical addresses for virtualization and security. The Jaguar provides good performance and implementation sizes caching of fully translated excellent power-performance efficiency. Digital Object Indentifier 10.1109/MM.2014.9 0272-1732/$26.00 2014 IEEE
This article has been accepted for publication in IEEE Micro but has not yet been fully edited. Some content may change prior to final publication. The CPU contains minor modifications from and GPU processing. Kinect makes extensive earlier Jaguar implementations to support two use of combined CPU-GPU computation. clusters and increased CPU cache coherent bandwidth. The graphics core contains two graphics command and two compute command 4.3 GPU processors. Each command processor supports 16 work streams. The two geometry primitive Figure 4 shows the graphics core and the engines, 12 compute units, and four render independent processors and functions sharing backend depth and color engines in the graphics the GPU MMU. The GPU contains AMD core support two independent graphics contexts. graphics technology supporting a customized version of Microsoft DirectX graphics features. The graphics core operates at 853 MHz in full Hardware and software customizations provide performance mode. System software selects more direct access to hardware resources than lower frequencies, and powers the graphics core standard DirectX. They reduce CPU overhead to and compute unit resources to match tasks. manage graphics activity and combined CPU Figure 4: GPU Digital Object Indentifier 10.1109/MM.2014.9 0272-1732/$26.00 2014 IEEE
This article has been accepted for publication in IEEE Micro but has not yet been fully edited. Some content may change prior to final publication. 4.4 Independent GPU Processors and Functions Eight independent processors and functions share the GPU MMU. These engines support applications and system services. They augment GPU and CPU processing, and are more power- performance efficient at their tasks. Four of the engines provide copy, format conversion, compression, and decompression services. The video decode and encode engines support multiple streams and a range of formats. The audio-video input and output engines support multiple streams, synchronization, and digital rights management. Audio-video output includes resizing and compositing three images, and saving results in main memory in addition to display output. 4.5 Audio Processors The SoC contains eight audio processors and supporting hardware shown in figure 5. The processors support applications and system services with multiple work queues. Collectively they would require two CPU cores to match their audio processing capability. The four DSP cores are Tensilica-based designs incorporating standard and specialized instructions. Two include single precision vector floating point totaling 15.4 billion operations per second. The other four audio processors implement: x Sample rate conversion x Equalization and dynamic range Figure 5: Audio Processors compression x Filter and volume processing x 512 stream Xbox Media Audio format 5 Xbox One Kinect decompression The Xbox One Kinect is the second-generation The audio processors use the IO MMU. This Microsoft three-dimensional image and audio path to main memory provides lower latency sensor. It is an integral part of the Xbox One than the GPU MMU path. Low latency is system. The three-dimensional image and audio important for games, which frequently make sensors and the SoC computation capabilities instantaneous audio decisions, and Kinect audio operating in parallel with games and other processing. applications provide an unprecedented level of voice, gesture and physical interaction with the system. Digital Object Indentifier 10.1109/MM.2014.9 0272-1732/$26.00 2014 IEEE
This article has been accepted for publication in IEEE Micro but has not yet been fully edited. Some content may change prior to final publication. 5.1 Image Sensor Goals and x Depth resolution within 1% of distance x Minimum software resolvable object less Requirements than 2.5 cm User experience drove the image sensor goals: x Operating range from 0.8 m to 4.2 m x Resolution sufficient for software to from the camera reliably detect and track the range of x Illumination from the camera and human sizes from young children to operation independent of room lighting small and large adults: a limiting x Maximum of 14 milliseconds exposure dimension is the diameter of a small time child’s wrist, approx. 2.5cm x Less than 20 milliseconds latency from x Camera field of view wide enough for the beginning of each exposure to data users to interact close to the camera in delivered over USB 3.0 to main system small spaces and relatively far away in software larger rooms x Depth accuracy within 2% across all x Camera dynamic range sufficient for lighting, color, users, and other users throughout the space with widely conditions in the operating range varying clothing colors x Lighting independence 5.2 Time of Flight Camera x Stability and repeatability x Sufficiently low latency for natural- Architecture feeling gesture and physical interaction Figure 6 shows the three-dimensional image sensor system. The system consists of the These goals led to the key requirements: sensor chip and a camera SoC. The SoC x Field of view of 70 degrees horizontal x manages the sensor and communications with 60 degrees vertical the Xbox One console. x Aperture F# < 1.1 Figure 6: Three-dimensional Image Sensor System Digital Object Indentifier 10.1109/MM.2014.9 0272-1732/$26.00 2014 IEEE
This article has been accepted for publication in IEEE Micro but has not yet been fully edited. Some content may change prior to final publication. The time of flight system modulates a camera light source with a square wave. It uses phase 5.3 Differential Pixels detection to measure the time it takes light to travel from the light source to the object and Figure 7 shows the time of flight sensor and back to the sensor, and calculates distance from signal waveforms. A laser diode illuminates the the results. subjects. The time of flight differential pixel array receives the reflected light. The timing generator creates a modulation square wave. The system uses this signal to A differential pixel distinguishes the time of flight modulate both the local light source (transmitter) sensor from a classic camera sensor. The and the pixel (receiver). modulation input controls conversion of incoming light to charge in the differential pixel’s The light travels to the object and back in time two outputs. The timing generator creates clock ǻt. The system calculates ǻt by estimating signals to control the pixel array and a received light phase at each pixel with synchronous signal to modulate the light source. knowledge of the modulation frequency. The The waveforms illustrate phase determination. system calculates depth from the speed of light in air: 1 cm in 33 picoseconds. Figure 7: Time of Flight Sensor The light source transmits the light signal. It the pixel clock. This is the essence of time of travels out from the camera, reflects off any flight phase detection. object in the field of view and returns to the sensor lens with some delay (phase shift) and Some interesting properties of the pixel output attenuation. lead to a very useful set of output images. x (A+B) gives a ‘normal’ grey scale image The lens focuses the light on the sensor pixels. illuminated by normal ambient (room) A synchronous clock modulates the pixel lighting (‘ambient image’) receiver. When the clock is high, photons falling x (A-B) gives phase information after an on the pixel contribute charge to the A-out side arctangent g calculation (‘depth image’) of the pixel. When the clock is low, photons x gives a grey scale image contribute charge to the B-out side of the pixel. which is independent of ambient (room) lighting (‘active image’) The (A-B) differential signal provides an output whose value depends both on the returning light Chip optical and electrical parameters determine level and on the time it arrives with respect to the quality of the resulting image. It does not depend significantly on mechanical factors. Digital Object Indentifier 10.1109/MM.2014.9 0272-1732/$26.00 2014 IEEE
This article has been accepted for publication in IEEE Micro but has not yet been fully edited. Some content may change prior to final publication. Multiphase captures cancel linearity errors, and a subject. The Kinect time of flight system must simple temperature compensation ensures keep the aperture wide open to minimize the accuracy is within specifications. light power required. It takes two images back- to-back with different but fixed shutter times of Key benefits of the time of flight system are: approximately 100 and 1000 microseconds, and x One depth sample per pixel: X-Y selects the best result pixel by pixel. The design resolution is determined by chip provides non-destructive pixel reading, and light dimensions integration involves reading each pixel multiple x Depth resolution is a function of the times to select the best result. signal to noise ratio and modulation frequency, that is: transmit light power, 5.5 Sensing over Long Range receiver sensitivity, modulation contrast, and lens f-number with Fine Resolution x Higher frequency: phaseÆdistance ratio The system measures phase shift of a scales directly with modulation modulated signal, then calculates depth from the frequency resulting in finer resolution phase using: x Complexity is in circuit design. The overall system, and particularly the mechanical aspects are simplified x Sensor outputs three possible images from the same pixel data: Depth is d, C is the speed of light, and fmod is 1. Depth reading per pixel the modulation frequency. 2. ‘Active’ image is independent of room / ambient lighting Increasing the modulation frequency increases 3. Standard ‘Passive’ image, based resolution, that is the depth resolution for a given upon room / ambient lighting phase uncertainty. Power limits what modulation frequencies can be practically used and higher 5.4 Dynamic Range frequency increases phase aliasing. High dynamic range is important. To provide a Phase wraps around at 360o. This causes the robust experience in multiplayer situations, we depth reading to alias. For example, aliasing want to detect someone wearing bright clothes starts at a depth of 1.87 m with an 80 MHz standing close to the camera and modulation frequency. simultaneously detect someone wearing very dark clothes standing at the back of the play Kinect acquires images at multiple modulation space. frequencies, illustrated in figure 8. This allows ambiguity elimination as far away as the With time of flight, depth resolution is a function equivalent of the beat frequency of the different of the signal to noise ratio at the sensor, where frequencies, which is greater than 10 m for signal is the received light power and noise is a Kinect with the chosen frequencies of combination of shot noise in the light and circuit approximately 120 MHz, 80 MHz and 16 MHz. noise in the sensor electronics. We want to exceed a minimum signal to noise ratio for all pixels imaging the users in the room independent of how many users, the clothes 360° they are wearing or where they are in the room. 0° z For an optical system, the incident power density falls off with the square of distance. Reflectivity of typical clothes can vary from more 360° than 95% to less than 10%. This requires that 0° the sensor must show a per-pixel dynamic range z in excess of 2500x. A photographer can adjust aperture and shutter Figure 8: Multiple Modulation Frequencies time in a camera to achieve optimal exposure for Digital Object Indentifier 10.1109/MM.2014.9 0272-1732/$26.00 2014 IEEE
This article has been accepted for publication in IEEE Micro but has not yet been fully edited. Some content may change prior to final publication. 5.6 Depth Image Figure 11 illustrates the wide dynamic depth range applied to human figure recognition. One The GPU in the main SoC calculates depth from figure is close to the camera and the other is far the phase information delivered by the camera. away. The system captures both clearly. This takes a small part of each frame time. Figure 9 shows a depth image captured at a distance of approx. 2.5 m, direct from the camera, without averaging or further processing. The coloring is a result of test software that assigns a color to each recognized user for engineering use. Figure 11: Dynamic Range Figure Recognition 5.7 Face Recognition Face recognition is important for a personalized user experience It is difficult to achieve high quality results in many situations with normal photography due to the wide variety of room light conditions. The photo in figure 12 is an example of how room lighting and the resulting shadowing can dramatically change how a person looks to a camera, in this case from a lamp to the side of the TV. Figure 9: Depth Image Figure 10 illustrates de-aliasing performance. It shows an image of a long corridor. The system obtains smooth depth readings out to 16 m in this example without wrapping. Figure 12: High Contrast Ambient Lighting Situation Figure 13 shows the same scene captured with the Kinect three-dimensional sensor. The sensor data provides an image that is independent of Figure 10: Depth Range the wide variation in room lighting. Digital Object Indentifier 10.1109/MM.2014.9 0272-1732/$26.00 2014 IEEE
This article has been accepted for publication in IEEE Micro but has not yet been fully edited. Some content may change prior to final publication. 4. D. Piatti, F. Rinaudo, “SR-4000 and CamCube3.0 Time of Flight (ToF) Cameras: Tests and Comparison”, Remote Sens., pp. 1069-1089, 2012 5. C. S. Bamji et al., A 512×424 CMOS 3D Time-of-Flight Image Sensor with Multi- Frequency Photo-Demodulation up to 130MHz and 2GS/s ADC, ISSCC Proceedings, Feb. 2014 John Sell is a hardware architect at Microsoft, and chief architect of the Xbox One SoC. Sell has a MS in electrical engineering and computer Figure 13: Kinect Image in High Contrast science from the University of California at Ambient Lighting Situation Berkeley, and a BS in engineering from Harvey Mudd College, Claremont, CA. The resolution is lower than the high definition RGB camera that Kinect also contains. Patrick O'Connor is a Senior Director of However, the fixed illumination more than Engineering at Microsoft, responsible for compensates so that the system can provide hardware and software development of sensors robust face recognition to applications. and custom silicon. O'Connor has a BS in electrical engineering from Trinity College, 6 Conclusion Dublin. The Xbox One SoC incorporates five billion Microsoft Corporation transistors to provide high performance 1065 La Avenida computation, graphics, audio processing, and Mountain View, CA 94043 audio-video input and output for multiple, simultaneous applications and system services. The Xbox One Kinect adds low latency three- dimensional image and voice sensing. Together, the SoC and Kinect provide unique voice and gesture control. The system recognizes individual users. They can use voice and movement within many applications, switch instantly between functions, and combine games, TV, and music, while interacting with friends via services such as Skype audio and video. 7 References 1. Jeff Andrews and Nick Baker, Xbox 360 System Architecture, IEEE Micro, March/April 2006 2. AMD-V Nested Paging, July 2008, http://developer.amd.com/wordpress/media/201 2/10/NPT-WP-1%201-final-TM.pdf 3. Jeff Rupley, Jaguar, Hot Chips 24 Proceedings, August 2012, http://www.hotchips.org/archives/hc24 Digital Object Indentifier 10.1109/MM.2014.9 0272-1732/$26.00 2014 IEEE
You can also read