DocumentCode
1361453
Title
Multicore Processing and Efficient On-Chip Caching for H.264 and Future Video Decoders
Author
Finchelstein, Daniel F. ; Sze, Vivienne ; Chandrakasan, Anantha P.
Author_Institution
Nvidia Corp., Santa Clara, CA, USA
Volume
19
Issue
11
fYear
2009
Firstpage
1704
Lastpage
1713
Abstract
Performance requirements for video decoding will continue to rise in the future due to the adoption of higher resolutions and faster frame rates. Multicore processing is an effective way to handle the resulting increase in computation. For power-constrained applications such as mobile devices, extra performance can be traded-off for lower power consumption via voltage scaling. As memory power is a significant part of system power, it is also important to reduce unnecessary on-chip and off-chip memory accesses. This paper proposes several techniques that enable multiple parallel decoders to process a single video sequence; the paper also demonstrates several on-chip caching schemes. First, we describe techniques that can be applied to the existing H.264 standard, such as multiframe processing. Second, with an eye toward future video standards, we propose replacing the traditional raster-scan processing with an interleaved macroblock ordering; this can increase parallelism with minimal impact on coding efficiency and latency. The proposed architectures allow N parallel hardware decoders to achieve a speedup of up to a factor of N. For example, if N=3, the proposed multiple frame and interleaved entropy slice multicore processing techniques can achieve performance improvements of 2.64times and 2.91times, respectively. This extra hardware performance can be used to decode higher definition videos. Alternatively, it can be traded-off for dynamic power savings of 60% relative to a single nominal-voltage decoder. Finally, on-chip caching methods are presented that significantly reduce off-chip memory bandwidth, leading to a further increase in performance and energy efficiency. Data-forwarding caches can reduce off-chip memory reads by 53%, while using a last-frame cache can eliminate 80% of the off-chip reads. The proposed techniques were validated and benchmarked using full-system Verilog hardware simulations based on an existing decoder; they should- - also be applicable to most other decoder architectures. The metrics used to evaluate the ideas in this paper are performance, power, area, memory efficiency, coding efficiency, and input latency.
Keywords
cache storage; decoding; image resolution; image sequences; microprocessor chips; multiprocessing systems; system-on-chip; video coding; H.264; coding efficiency; interleaved entropy slice technique; interleaved macroblock ordering; lower power consumption; memory power; mobile device; multicore processing; multiframe processing; off-chip memory access; on-chip caching; parallel decoder; parallel hardware decoder; video decoder; video sequence; voltage scaling; H.264; low-power; multicore; parallelism; video decoders;
fLanguage
English
Journal_Title
Circuits and Systems for Video Technology, IEEE Transactions on
Publisher
ieee
ISSN
1051-8215
Type
jour
DOI
10.1109/TCSVT.2009.2031459
Filename
5229295
Link To Document