Multiple audio streams, video streams and subtitles are stored in one container. These are played back together.
Where is the synchronization information stored in the container format? In meta-data or somewhere else?
1 Answer 1
It depends on the container itself. Different containers handle synchronization differently. The difference lies in the way the data itself is stored.
For example, the container can choose to just store parts that should be presented at the same time together, or insert synchronization points every one in a while, or even give every data chunk a timestamp.
Some containers can be time-based, while others are just a stream of data. Of course, the latter has performance advantages, but synchronization can be off, and seeking isn't possible without a specific index (like in AVI).
For example, the MPEG-2 Transport Stream, a very early container format, uses packetized elementary streams to store the data. For example, you'll have one packetized stream for video, and one for audio. In order to synchronize them, there is the Program Clock Reference field:
To enable a decoder to present synchronized content, such as audio tracks matching the associated video, at least once each 100 ms a Program Clock Reference, or PCR is transmitted in the adaptation field of an MPEG-2 transport stream packet. The PID with the PCR for an MPEG-2 program is identified by the pcr_pid value in the associated Program Map Table. The value of the PCR, when properly used, is employed to generate a system_timing_clock in the decoder. The STC decoder, when properly implemented, provides a highly accurate time base that is used to synchronize audio and video elementary streams.
The popular MP4 container is based on the QuickTime File Format and therefore shares a common base of features. Its specification can be found here. Synchronization is not part of metadata, it's part of the actual data itself. I haven't looked into the details though.
When using RTP Hint Tracks, an MP4 file can be prepared for streaming, by indicating which packets are correlated to which timestamp in the media. So, for example, a hint track will say: This packet here is made up of this video data and that audio data. Those are ignored for local playback though.
AVI containers are based on RIFF. They store the individual data frames in "chunks". As the data itself can be split into individual frames (e.g. video frames, audio frames), this even works with variable bitrate content, as the frame length will always be the same. This document has an explanation of the AVI format in detail. The important aspect of synchronization is that the AVI is correctly multiplexed.
The Matroska container is similar to MP4 and others. It is entirely timecode-based, as you can see in this diagram. The data itself (e.g. video, audio) is split into Clusters and then Blocks. The timecode is more-or-less treated like a presentation timestamp.
One thing that I do want to mention however, to avoid confusion, is the Timecode. The quick eye will notice that there is one Timecode shown per Cluster, and then another within the Block structure itself. The way that this works is the Timecode in the Cluster is relative to the whole file. It is usually the Timecode that the first Block in the Cluster needs to be played at. The Timecode in the Block itself is relative to the Timecode in the Cluster. For example, lets say that the Timecode in the Cluster is set to 10 seconds, and you have a Block in that Cluster that is supposed to be played 12 seconds into the clip. This means that the Timecode in the Block would be set to 2 seconds
The Ogg container, which is not that popular, does it as follows:
Grouping defines how to interleave several logical bitstreams page-wise in the same physical bitstream. Grouping is for example needed for interleaving a video stream with several synchronised audio tracks using different codecs in different logical bitstreams.
However, Ogg does not know anything about the codec and a "time" concept:
Ogg does not have a concept of 'time': it only knows about sequentially increasing, unitless position markers. An application can only get temporal information through higher layers which have access to the codec APIs to assign and convert granule positions or time.
You must log in to answer this question.
Explore related questions
See similar questions with these tags.