Ogg Theora Cook Book

A Bit of Theory

Editing Theora videos using command line tools like Oggz Tools and Ogg Video Tools does not require knowledge about the intricate details of how Theora works. However, as these command line tools process Theora files at a very low level, some side effects are going to show up that can not be explained without going into some of the details of the Theora format.

If you think you can live with minor inaccuracies caused by editing, feel free to skip reading this chapter.

Anatomy of a Theora Video

A video file normally consists of a video stream and an audio stream. To store both streams into one file, a so-called container format is used. The container used with Theora video is the Ogg container, holding a Theora video stream and one or more Vorbis audio streams.  Audio and video streams are stored interleaved. That means that every stream is segmented into several blocks of data with nearly equal size. These blocks are called pages.

Every page has a time stamp that gives information about where the page is placed within the stream. 

Video and the audio streams are interleaved by concatenating the pages of both streams in  ascending order of their timestamps.  

Demultiplexing

Technically, splitting a video file into its video and audio streams is easy. This is due to the fact that the two streams can be separated by collecting the video and audio pages into different files. The process of splitting a video file into its streams is called demultiplexing.

As all necessary information required for playback is contained within the streams, without the Ogg container adding any further information, each of the split files itself constitutes a standard compliant Ogg contained stream and can be read by any Theora/Vorbis aware video or audio player.

Page Timestamps

As mentioned earlier, streams are stored segmented into several pages. A page has a header that holds so-called meta data describing this part of the video/audio stream. The header includes timing information, a unique stream identification number, a page number and some other information. 

According to the Ogg standard, timing information for every page is given by the granule position, a 64 bit value contained in the header. How granule position relates to an actual time position like the number of milliseconds from the start of the video is entirely defined by the stream, and not covered by the Ogg container specification. For that reason a software handling Ogg files needs a granule position interpreter to correctly handle the file. This interpreter needs to be aware of the codec and stream specific information to produce timing information that can be compared across different streams.   

While not defining how to interpret granule positions, the Ogg Standard specifies that all pages within an Ogg file must be stored in ascending order of the corresponding time position. So any tool that cuts or concatenates streams needs working granule position interpreters for every contained stream in order to correctly interleave the pages.

Encapsulating Codec Data

Pages are of nearly equal size by default (around 4096 Bytes). However, audio and video packets created by a specific codec, usually do not fit exactly fit into a page.  Audio packets are usually much smaller. Video packets can be of a very different sizes, smaller or larger, depending on various factors.

Data that is produced by the video and audio codecs are firstly encapsulated into an Ogg packet. These packets are then placed into the Ogg pages. A packet can either be split over multiple pages, or combined with othe packets to fill up an otherwise underfull page, as required.

Encapsulating Theora Video Data

Theora video data created by the encoder consists of two types of Ogg packets: so called key frames (often also called I-Frames), which are full pictures, and P-Frames, which carry only the differences between the last and the current picture.

In order to display a given frame at a given time, the decoder must know the previous key frame and must decode every frame (including the key frame) up to the actual time position of the given frame.

To be able to decode a video at all, the decoder needs information about the stream itself, such as the video frame size. This information is placed into the header packets at the beginning of the stream.

Encapsulating Vorbis Audio Data

Vorbis audio data created by the encoder carries a certain number of audio samples. A sample is a unit of audio data. The number of samples per Ogg packet is fixed and can only vary between two possible sizes, defined in the stream header.

Similar to video data, audio data packets depend on each other.  To decode one audio packet, the previous packet is needed .

The audio decoder, depends on the stream parameters such as sample rate, bitrate etc, which are stored in stream header packets at the stream's beginning.

Ogg Skeleton

As was stated before, a video stream can not start at arbitrary time position due to its nature of having key frames (I-Frames) and delta frames (P-Frames). In addition audio is stored in packets, which have a given timing granularity.

To ensure synchronization and a correct starting point, Ogg Skeleton pages carry information about the starting position of a each stream. A decoder, that reads the Ogg Skeleton information, can then seek to the correct audio and video position and start playing from this position.