RFC 2032: RTP Payload Format for H.261 Video Streams

Network Working Group T. Turletti
Request for Comments: 2032 MIT
Category: Standards Track C. Huitema
 Bellcore
 October 1996
 RTP Payload Format for H.261 Video Streams
Status of this Memo
 This document specifies an Internet standards track protocol for the
 Internet community, and requests discussion and suggestions for
 improvements. Please refer to the current edition of the "Internet
 Official Protocol Standards" (STD 1) for the standardization state
 and status of this protocol. Distribution of this memo is unlimited.
Table of Contents
 1. Abstract ............................................. 1
 2. Purpose of this document ............................. 2
 3. Structure of the packet stream ....................... 2
 3.1 Overview of the ITU-T recommendation H.261 .......... 2
 3.2 Considerations for packetization .................... 3
 4. Specification of the packetization scheme ............ 4
 4.1 Usage of RTP ........................................ 4
 4.2 Recommendations for operation with hardware codecs .. 6
 5. Packet loss issues ................................... 7
 5.1 Use of optional H.261-specific control packets ...... 8
 5.2 H.261 control packets definition .................... 9
 5.2.1 Full INTRA-frame Request (FIR) packet ............. 9
 5.2.2 Negative ACKnowledgements (NACK) packet ........... 9
 6. Security Considerations .............................. 10
 Authors' Addresses ..................................... 10
 Acknowledgements ....................................... 10
 References ............................................. 11
1. Abstract
 This memo describes a scheme to packetize an H.261 video stream for
 transport using the Real-time Transport Protocol, RTP, with any of
 the underlying protocols that carry RTP.
 This specification is a product of the Audio/Video Transport working
 group within the Internet Engineering Task Force. Comments are
 solicited and should be addressed to the working group's mailing list
 at rem-conf@es.net and/or the authors.
Turletti & Huitema Standards Track [Page 1]

RFC 2032 RTP Payload Format for H.261 Video October 1996
2. Purpose of this document
 The ITU-T recommendation H.261 [6] specifies the encodings used by
 ITU-T compliant video-conference codecs. Although these encodings
 were originally specified for fixed data rate ISDN circuits,
 experiments [3],[8] have shown that they can also be used over
 packet-switched networks such as the Internet.
 The purpose of this memo is to specify the RTP payload format for
 encapsulating H.261 video streams in RTP [1].
3. Structure of the packet stream
3.1. Overview of the ITU-T recommendation H.261
 The H.261 coding is organized as a hierarchy of groupings. The video
 stream is composed of a sequence of images, or frames, which are
 themselves organized as a set of Groups of Blocks (GOB). Note that
 H.261 "pictures" are referred as "frames" in this document. Each GOB
 holds a set of 3 lines of 11 macro blocks (MB). Each MB carries
 information on a group of 16x16 pixels: luminance information is
 specified for 4 blocks of 8x8 pixels, while chrominance information
 is given by two "red" and "blue" color difference components at a
 resolution of only 8x8 pixels. These components and the codes
 representing their sampled values are as defined in the ITU-R
 Recommendation 601 [7].
 This grouping is used to specify information at each level of the
 hierarchy:
 - At the frame level, one specifies information such as the
 delay from the previous frame, the image format, and
 various indicators.
 - At the GOB level, one specifies the GOB number and the
 default quantifier that will be used for the MBs.
 - At the MB level, one specifies which blocks are present
 and which did not change, and optionally a quantifier and
 motion vectors.
 Blocks which have changed are encoded by computing the discrete
 cosine transform (DCT) of their coefficients, which are then
 quantized and Huffman encoded (Variable Length Codes).
 The H.261 Huffman encoding includes a special "GOB start" pattern,
 composed of 15 zeroes followed by a single 1, that cannot be imitated
 by any other code words. This pattern is included at the beginning of
Turletti & Huitema Standards Track [Page 2]

RFC 2032 RTP Payload Format for H.261 Video October 1996
 each GOB header (and also at the beginning of each frame header) to
 mark the separation between two GOBs, and is in fact used as an
 indicator that the current GOB is terminated. The encoding also
 includes a stuffing pattern, composed of seven zeroes followed by
 four ones; that stuffing pattern can only be entered between the
 encoding of MBs, or just before the GOB separator.
3.2. Considerations for packetization
 H.261 codecs designed for operation over ISDN circuits produce a bit
 stream composed of several levels of encoding specified by H.261 and
 companion recommendations. The bits resulting from the Huffman
 encoding are arranged in 512-bit frames, containing 2 bits of
 synchronization, 492 bits of data and 18 bits of error correcting
 code. The 512-bit frames are then interlaced with an audio stream
 and transmitted over px64 kbps circuits according to specification
 H.221 [5].
 When transmitting over the Internet, we will directly consider the
 output of the Huffman encoding. All the bits produced by the Huffman
 encoding stage will be included in the packet. We will not carry the
 512-bit frames, as protection against bit errors can be obtained by
 other means. Similarly, we will not attempt to multiplex audio and
 video signals in the same packets, as UDP and RTP provide a much more
 efficient way to achieve multiplexing.
 Directly transmitting the result of the Huffman encoding over an
 unreliable stream of UDP datagrams would, however, have poor error
 resistance characteristics. The result of the hierachical structure
 of H.261 bit stream is that one needs to receive the information
 present in the frame header to decode the GOBs, as well as the
 information present in the GOB header to decode the MBs. Without
 precautions, this would mean that one has to receive all the packets
 that carry an image in order to properly decode its components.
 If each image could be carried in a single packet, this requirement
 would not create a problem. However, a video image or even one GOB by
 itself can sometimes be too large to fit in a single packet.
 Therefore, the MB is taken as the unit of fragmentation. Packets
 must start and end on a MB boundary, i.e. a MB cannot be split across
 multiple packets. Multiple MBs may be carried in a single packet
 when they will fit within the maximal packet size allowed. This
 practice is recommended to reduce the packet send rate and packet
 overhead.
 To allow each packet to be processed independently for efficient
 resynchronization in the presence of packet losses, some state
 information from the frame header and GOB header is carried with each
Turletti & Huitema Standards Track [Page 3]

RFC 2032 RTP Payload Format for H.261 Video October 1996
 packet to allow the MBs in that packet to be decoded. This state
 information includes the GOB number in effect at the start of the
 packet, the macroblock address predictor (i.e. the last MBA encoded
 in the previous packet), the quantizer value in effect prior to the
 start of this packet (GQUANT, MQUANT or zero in case of a beginning
 of GOB) and the reference motion vector data (MVD) for computing the
 true MVDs contained within this packet. The bit stream cannot be
 fragmented between a GOB header and MB 1 of that GOB.
 Moreover, since the compressed MB may not fill an integer number of
 octets, the data header contains two three-bit integers, SBIT and
 EBIT, to indicate the number of unused bits in the first and last
 octets of the H.261 data, respectively.
4. Specification of the packetization scheme
4.1. Usage of RTP
 The H.261 information is carried as payload data within the RTP
 protocol. The following fields of the RTP header are specified:
 - The payload type should specify H.261 payload format (see
 the companion RTP profile document RFC 1890).
 - The RTP timestamp encodes the sampling instant of the
 first video image contained in the RTP data packet. If a
 video image occupies more than one packet, the timestamp
 will be the same on all of those packets. Packets from
 different video images must have different timestamps so
 that frames may be distinguished by the timestamp. For
 H.261 video streams, the RTP timestamp is based on a
 90kHz clock. This clock rate is a multiple of the natural
 H.261 frame rate (i.e. 30000/1001 or approx. 29.97 Hz).
 That way, for each frame time, the clock is just
 incremented by the multiple and this removes inaccuracy
 in calculating the timestamp. Furthermore, the initial
 value of the timestamp is random (unpredictable) to make
 known-plaintext attacks on encryption more difficult, see
 RTP [1]. Note that if multiple frames are encoded in a
 packet (e.g. when there are very little changes between
 two images), it is necessary to calculate display times
 for the frames after the first using the timing
 information in the H.261 frame header. This is required
 because the RTP timestamp only gives the display time of
 the first frame in the packet.
 - The marker bit of the RTP header is set to one in the
 last packet of a video frame, and otherwise, must be
Turletti & Huitema Standards Track [Page 4]

RFC 2032 RTP Payload Format for H.261 Video October 1996
 zero. Thus, it is not necessary to wait for a following
 packet (which contains the start code that terminates the
 current frame) to detect that a new frame should be
 displayed.
 The H.261 data will follow the RTP header, as in:
 0 1 2 3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 . .
 . RTP header .
 . .
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 | H.261 header |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 | H.261 stream ... .
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 The H.261 header is defined as following:
 0 1 2 3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |SBIT |EBIT |I|V| GOBN | MBAP | QUANT | HMVD | VMVD |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 The fields in the H.261 header have the following meanings:
 Start bit position (SBIT): 3 bits
 Number of most significant bits that should be ignored
 in the first data octet.
 End bit position (EBIT): 3 bits
 Number of least significant bits that should be ignored
 in the last data octet.
 INTRA-frame encoded data (I): 1 bit
 Set to 1 if this stream contains only INTRA-frame coded
 blocks. Set to 0 if this stream may or may not contain
 INTRA-frame coded blocks. The sense of this bit may not
 change during the course of the RTP session.
 Motion Vector flag (V): 1 bit
 Set to 0 if motion vectors are not used in this stream.
 Set to 1 if motion vectors may or may not be used in
 this stream. The sense of this bit may not change during
 the course of the session.
Turletti & Huitema Standards Track [Page 5]

RFC 2032 RTP Payload Format for H.261 Video October 1996
 GOB number (GOBN): 4 bits
 Encodes the GOB number in effect at the start of the
 packet. Set to 0 if the packet begins with a GOB header.
 Macroblock address predictor (MBAP): 5 bits
 Encodes the macroblock address predictor (i.e. the last
 MBA encoded in the previous packet). This predictor ranges
 from 0-32 (to predict the valid MBAs 1-33), but because
 the bit stream cannot be fragmented between a GOB header
 and MB 1, the predictor at the start of the packet can
 never be 0. Therefore, the range is 1-32, which is biased
 by -1 to fit in 5 bits. For example, if MBAP is 0, the
 value of the MBA predictor is 1. Set to 0 if the packet
 begins with a GOB header.
 Quantizer (QUANT): 5 bits
 Quantizer value (MQUANT or GQUANT) in effect prior to the
 start of this packet. Set to 0 if the packet begins with
 a GOB header.
 Horizontal motion vector data (HMVD): 5 bits
 Reference horizontal motion vector data (MVD). Set to 0
 if V flag is 0 or if the packet begins with a GOB header,
 or when the MTYPE of the last MB encoded in the previous
 packet was not MC. HMVD is encoded as a 2's complement
 number, and `10000' corresponding to the value -16 is
 forbidden (motion vector fields range from +/-15).
 Vertical motion vector data (VMVD): 5 bits
 Reference vertical motion vector data (MVD). Set to 0 if
 V flag is 0 or if the packet begins with a GOB header, or
 when the MTYPE of the last MB encoded in the previous
 packet was not MC. VMVD is encoded as a 2's complement
 number, and `10000' corresponding to the value -16 is
 forbidden (motion vector fields range from +/-15).
 Note that the I and V flags are hint flags, i.e. they can be inferred
 from the bit stream. They are included to allow decoders to make
 optimizations that would not be possible if these hints were not
 provided before bit stream was decoded. Therefore, these bits cannot
 change for the duration of the stream. A conformant implementation
 can always set V=1 and I=0.
4.2. Recommendations for operation with hardware codecs
 Packetizers for hardware codecs can trivially figure out GOB
 boundaries using the GOB-start pattern included in the H.261 data.
 (Note that software encoders already know the boundaries.) The
Turletti & Huitema Standards Track [Page 6]

RFC 2032 RTP Payload Format for H.261 Video October 1996
 cheapest packetization implementation is to packetize at the GOB
 level all the GOBs that fit in a packet. But when a GOB is too
 large, the packetizer has to parse it to do MB fragmentation. (Note
 that only the Huffman encoding must be parsed and that it is not
 necessary to fully decompress the stream, so this requires relatively
 little processing; example implementations can be found in some
 public H.261 codecs such as IVS [4] and VIC [9].) It is recommended
 that MB level fragmentation be used when feasible in order to obtain
 more efficient packetization. Using this fragmentation scheme reduces
 the output packet rate and therefore reduces the overhead.
 At the receiver, the data stream can be depacketized and directed to
 a hardware codec's input. If the hardware decoder operates at a
 fixed bit rate, synchronization may be maintained by inserting the
 stuffing pattern between MBs (i.e., between packets) when the packet
 arrival rate is slower than the bit rate.
5. Packet loss issues
 On the Internet, most packet losses are due to network congestion
 rather than transmission errors. Using UDP, no mechanism is available
 at the sender to know if a packet has been successfully received. It
 is up to the application, i.e. coder and decoder, to handle the
 packet loss. Each RTP packet includes a a sequence number field which
 can be used to detect packet loss.
 H.261 uses the temporal redundancy of video to perform compression.
 This differential coding (or INTER-frame coding) is sensitive to
 packet loss. After a packet loss, parts of the image may remain
 corrupt until all corresponding MBs have been encoded in INTRA-frame
 mode (i.e. encoded independently of past frames). There are several
 ways to mitigate packet loss:
 (1) One way is to use only INTRA-frame encoding and MB level
 conditional replenishment. That is, only MBs that change
 (beyond some threshold) are transmitted.
 (2) Another way is to adjust the INTRA-frame encoding
 refreshment rate according to the packet loss observed by
 the receivers. The H.261 recommendation specifies that a
 MB is INTRA-frame encoded at least every 132 times it is
 transmitted. However, the INTRA-frame refreshment rate
 can be raised in order to speed the recovery when the
 measured loss rate is significant.
 (3) The fastest way to repair a corrupted image is to request
 an INTRA-frame coded image refreshment after a packet
 loss is detected. One means to accomplish this is for the
Turletti & Huitema Standards Track [Page 7]

RFC 2032 RTP Payload Format for H.261 Video October 1996
 decoder to send to the coder a list of packets lost. The
 coder can decide to encode every MB of every GOB of the
 following video frame in INTRA-frame mode (i.e. Full
 INTRA-frame encoded), or if the coder can deduce from the
 packet sequence numbers which MBs were affected by the
 loss, it can save bandwidth by sending only those MBs in
 INTRA-frame mode. This mode is particularly efficient in
 point-to-point connection or when the number of decoders
 is low. The next section specifies how the refresh
 function may be implemented.
 Note that the method (1) is currently implemented in the VIC
 videoconferencing software [9]. Methods (2) and (3) are currently
 implemented in the IVS videoconferencing software [4].
5.1. Use of optional H.261-specific control packets
 This specification defines two H.261-specific RTCP control packets,
 "Full INTRA-frame Request" and "Negative Acknowledgement", described
 in the next section. Their purpose is to speed up refreshment of the
 video in those situations where their use is feasible. Support of
 these H.261-specific control packets by the H.261 sender is optional;
 in particular, early experiments have shown that the usage of this
 feature could have very negative effects when the number of sites is
 very large. Thus, these control packets should be used with caution.
 The H.261-specific control packets differ from normal RTCP packets in
 that they are not transmitted to the normal RTCP destination
 transport address for the RTP session (which is often a multicast
 address). Instead, these control packets are sent directly via
 unicast from the decoder to the coder. The destination port for
 these control packets is the same port that the coder uses as a
 source port for transmitting RTP (data) packets. Therefore, these
 packets may be considered "reverse" control packets.
 As a consequence, these control packets may only be used when no RTP
 mixers or translators intervene in the path from the coder to the
 decoder. If such intermediate systems do intervene, the address of
 the coder would no longer be present as the network-level source
 address in packets received by the decoder, and in fact, it might not
 be possible for the decoder to send packets directly to the coder.
 Some reliable multicast protocols use similar NACK control packets
 transmitted over the normal multicast distribution channel, but they
 typically use random delays to prevent a NACK implosion problem [2].
 The goal of such protocols is to provide reliable multicast packet
 delivery at the expense of delay, which is appropriate for
 applications such as a shared whiteboard.
Turletti & Huitema Standards Track [Page 8]

RFC 2032 RTP Payload Format for H.261 Video October 1996
 On the other hand, interactive video transmission is more sensitive
 to delay and does not require full reliability. For video
 applications it is more effective to send the NACK control packets as
 soon as possible, i.e. as soon as a loss is detected, without adding
 any random delays. In this case, multicasting the NACK control
 packets would generate useless traffic between receivers since only
 the coder will use them. But this method is only effective when the
 number of receivers is small. e.g. in IVS [4] the H.261 specific
 control packets are used only in point-to-point connections or in
 point-to-multipoint connections when there are less than 10
 participants in the conference.
5.2. H.261 control packets definition
5.2.1. Full INTRA-frame Request (FIR) packet
 0 1 2 3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |V=2|P| MBZ | PT=RTCP_FIR | length |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 | SSRC |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 This packet indicates that a receiver requires a full encoded image
 in order to either start decoding with an entire image or to refresh
 its image and speed the recovery after a burst of lost packets. The
 receiver requests the source to force the next image in full "INTRA-
 frame" coding mode, i.e. without using differential coding. The
 various fields are defined in the RTP specification [1]. SSRC is the
 synchronization source identifier for the sender of this packet. The
 value of the packet type (PT) identifier is the constant RTCP_FIR
 (192).
5.2.2. Negative ACKnowledgements (NACK) packet
 The format of the NACK packet is as follow:
 0 1 2 3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |V=2|P| MBZ | PT=RTCP_NACK | length |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 | SSRC |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 | FSN | BLP |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Turletti & Huitema Standards Track [Page 9]

RFC 2032 RTP Payload Format for H.261 Video October 1996
 The various fields T, P, PT, length and SSRC are defined in the RTP
 specification [1]. The value of the packet type (PT) identifier is
 the constant RTCP_NACK (193). SSRC is the synchronization source
 identifier for the sender of this packet.
 The two remaining fields have the following meanings:
 First Sequence Number (FSN): 16 bits
 Identifies the first sequence number lost.
 Bitmask of following lost packets (BLP): 16 bits
 A bit is set to 1 if the corresponding packet has been lost,
 and set to 0 otherwise. BLP is set to 0 only if no packet
 other than that being NACKed (using the FSN field) has been
 lost. BLP is set to 0x00001 if the packet corresponding to
 the FSN and the following packet have been lost, etc.
6. Security Considerations
 Security issues are not discussed in this memo.
Authors' Addresses
 Thierry Turletti
 INRIA - RODEO Project
 2004 route des Lucioles
 BP 93, 06902 Sophia Antipolis
 FRANCE
 EMail: turletti@sophia.inria.fr
 Christian Huitema
 MCC 1J236B Bellcore
 445 South Street
 Morristown, NJ 07960-6438
 EMail: huitema@bellcore.com
Acknowledgements
 This memo is based on discussion within the AVT working group chaired
 by Stephen Casner. Steve McCanne, Stephen Casner, Ronan Flood, Mark
 Handley, Van Jacobson, Henning G. Schulzrinne and John Wroclawski
 provided valuable comments. Stephen Casner and Steve McCanne also
 helped greatly with getting this document into readable form.
Turletti & Huitema Standards Track [Page 10]

RFC 2032 RTP Payload Format for H.261 Video October 1996
References
 [1] Schulzrinne, H., Casner, S., Frederick, R., and
 V. Jacobson, "RTP: A Transport Protocol for Real-Time
 Applications", RFC 1889, January 1996.
 [2] Sridhar Pingali, Don Towsley and James F. Kurose, A
 comparison of sender-initiated and receiver-initiated
 reliable multicast protocols, IEEE GLOBECOM '94.
 [3] Thierry Turletti, H.261 software codec for
 videoconferencing over the Internet INRIA Research Report
 no 1834, January 1993.
 [4] Thierry Turletti, INRIA Videoconferencing tool (IVS),
 available by anonymous ftp from zenon.inria.fr in the
 "rodeo/ivs/last_version" directory. See also URL
 <http://www.inria.fr/rodeo/ivs.html>.
 [5] Frame structure for Audiovisual Services for a 64 to 1920
 kbps Channel in Audiovisual Services ITU-T (International
 Telecommunication Union - Telecommunication
 Standardisation Sector) Recommendation H.221, 1990.
 [6] Video codec for audiovisual services at p x 64 kbit/s
 ITU-T (International Telecommunication Union -
 Telecommunication Standardisation Sector) Recommendation
 H.261, 1993.
 [7] Digital Methods of Transmitting Television Information
 ITU-R (International Telecommunication Union -
 Radiocommunication Standardisation Sector) Recommendation
 601, 1986.
 [8] M.A Sasse, U. Bilting, C-D Schulz, T. Turletti, Remote
 Seminars through MultiMedia Conferencing: Experiences
 from the MICE project, Proc. INET'94/JENC5, Prague, June
 1994, pp. 251/1-251/8.
 [9] Steve MacCanne, Van Jacobson, VIC Videoconferencing tool,
 available by anonymous ftp from ee.lbl.gov in the
 "conferencing/vic" directory.
Turletti & Huitema Standards Track [Page 11]

Document	Document type	RFC - Proposed Standard October 1996 Report errata Obsoleted by RFC 4587 Was draft-ietf-avt-h261 (avt WG)
Select version	03 RFC 2032
Compare versions
Authors	Christian Huitema , Thierry Turletti Email authors
RFC stream	IETF Logo IETF Logo
Other formats	txt html pdf bibtex
Additional resources	Mailing list discussion

RFC 2032 - RTP Payload Format for H.261 Video Streams