Crowne Plaza Hotel, Seattle, USA
Multimedia streams are difficult to transmit in full quality and real-time due to their size and constraints imposed by their timely presentation. Especially in slow channels with variable throughput - such as typical Internet connections - media streaming is extremely hard. We introduce a new iterative method for the transmission of media streams. The idea is to match each iteration of the transmission process exactly to the play-out time of the respective media stream, thus allowing synchronous presentation of the stream's contents during the transmission. The quality of the playback improves with the number of completed iterations, from rough during the first pass to full quality at the end. The WebMovie system implements this iterative streaming scheme for video data. After the introduction of the system, performance measurements are discussed and compared to conventional methods of movie transmission.
Iterative transmission; real time transmission; full quality transmission; layered media encoding; Internet; WWW; multimedia; media streams; scaleable media streams; refineable media streams; data segmentation; WebMovie.
The World Wide Web (WWW) has evolved from a static document-oriented Internet service to a feature-rich multimedia system. Audio and video have become part of many Web pages; however, their integration has not been completed yet: many issues in the area of timely delivery to the client remain unsolved.
The main problem for the transmission of multimedia streams via the Web is the severe limitation in bandwidth. Although the available data rate may be exploited efficiently with sophisticated compression techniques, even short media clips with a playback duration of seconds (e.g. short audio- or video-clips) are typically of considerable size (» 106 Bytes). Hence, their transmission time is typically much longer than the presentation of the data to the user. Consequently, it is impossible to present the data in real-time during a transfer according to the commonly used file transmission paradigm. Typical systems transfer the clip before they start the playback.
In order to make real-time transmissions of multimedia data possible despite its volume and despite low-bandwidth connections, recent software systems like WebAudio/WebVideo (see [19]), WaveVideo (see [4]) and many others avoid transmitting media streams (i.e. video) in full quality. Instead, they adjust the media quality - and with it the amount of the transferred data - to the available bandwidth. In other words, the lower the bandwidth, the poorer the quality of the transmitted data's playback in terms of temporal and/or spatial resolution. Furthermore, error correction is not performed - retransmission due to packet losses would impair the presentation.
Media streams suitable for these schemes are called scaleable media streams because the amount of data to be transmitted is scaled according to the currently available bandwidth. The recent MPEG-2 standard (see [8]) supports the transmission of scaleable video streams by a layered encoding of the data. Scaleable media streams are considered well-suited for many applications demanding live transmission of video and audio streams. For other applications with clients requesting media data in full quality (e.g. video on demand), scaleable media streams are not sufficient due to the inevitable scaling-loss.
In a sense, our approach is an extension of scaleable media streams: based on the idea of adjusting the stream to the available throughput, the streams are filtered to achieve real-time presentation during the transmission. However, the data filtered out is not discarded, but transferred later to ensure complete delivery (see listing 1).
...
Open (dataChannel, UDP);
Open (cntlChannel, TCP);
ResetAcks ();
while (unsent frame available) do
nextFrame := SelectNextFrame (bandwidth);
Transmit (dataChannel, nextFrame);
if (Receive (cntlChannel, ack, frame))
UpdateAcks (frame);
bandwidth := EstimateBandwidth ();
fi
od
Close ();
...
listing 1: code fragment of a server for iterative transmission of movies (using vertical segmentation only; see chapter 3);
Transmission in full quality even through low-bandwidth connections is achieved by folding the transfer time into the presentation time: if a prerecorded media stream is too large to be streamed in real-time due to limited bandwidth, it is broken up into several layers. The layers have similar sizes, but differ in their significance for the presentation. Each of them represents a partial stream and needs less bandwidth for timely transmission than the original stream (see figure 1).
figure 1: principle of iterative streaming
In the first iteration we deliver as many
layers as possible according to the available bandwidth - similar
to scaleable media streams (see [4], [5]). But in contrast to
scaleable media streams, the transmission does not stop here: in
subsequent iterations, the layers neglected in the first
iteration are delivered.
The most significant layer with the basic information is
transmitted first, followed by the others with descending
significance for the presentation. The first layer's data
allows the presentation of the original stream's contents -
even though in rather limited quality. The subsequent
layers' information refines the basic information, improving
the quality of the playback step by step. Finally, the arrival of
the last layer completes the data for the client and allows a
presentation and even subsequent processing in full quality.
Hence, a stream that allows an iterative transmission is called a
refineable media stream; an exact definition is given in 2.1.
Although the subsequently introduced methods for iterative streaming were initially developed to design a transmission technique for media data via low-bandwidth channels, it has turned out that they are quite useful in other fields of application, too: if not the complete data of a stream is required, but only a characteristic fingerprint is needed, storage capacity and time may be saved by storing only a limited number of layers instead of the complete stream. Moreover, the reduced data volume of these fingerprints, e.g. in (movie-) databases, may improve search times and therefore the access speed.
The rest of the paper is organized in four groups: section 2 introduces new techniques for iterative streaming of time critical media data. Chapter 3 describes the recently developed WebMovie system to illustrate the streaming of video data using vertical segmentation and presents the experience with the implementation and operation of this system. Subsequently, we present measurements of the performance of the WebMovie system. Section 4 first summarizes the most important issues presented and then draws conclusions. Further work is then outlined in chapter 5.
The algorithms introduced in this paper may be applied to almost all prerecorded real-time media streams. The most common cases are probably video and audio streams. Again: the idea is not to transmit the respective stream byte by byte, but folded in time. Since we do deliver prerecorded data, it is possible to transmit all data excluded in the first iteration of the transmission in further phases. The stream is presented several times during the transmission in steadily improving quality: the stream is refined iteratively. This allows the users to follow as many iterations of the presentation as they like to. Users may interrupt the transmission at any time - even if the transmission has not been completed. Towards the completion of the transmission, during the last iteration, the presentation reaches full quality. Transmitting media streams this way creates the idea of the play-out time of the streams' contents unfolding into the transmission time (see figure 1).
After a short definition of refineable media streams several approaches to the iterative transmission of these streams are presented.
A media stream is called refineable if and only if it fulfills the following conditions:
Probably the easiest way to split the information of a refineable media stream into several layers lies in the time domain: a subsampling of the respective stream yields the first layer. The resulting stream resembles a scaleable media stream. All further layers are built in the same way as the first layer.2 Each layer contains only samples the other layers do not contain, of course (see figure 2). For the sake of an efficient error correction, exceptions from this rule may be made (see 3.4). This subsampling method is referred to as vertical segmentation.
figure 2: vertical segmentation of a refineable generic media stream
There are two ways to disassemble a media stream into layers according to the vertical segmentation:
With the static approach there is no way to
adapt the transmission parameters (i.e. the subsampling rate) to
the actually available bandwidth during the transmission. If the
bandwidth significantly decreases - e.g. due to common
fluctuations of bandwidth somewhere in the Internet - concurrent
real-time playback and transmission may not be achieved anymore.
On the other hand, if the bandwidth increases above the estimate,
it cannot be exploited and will be wasted.
The more sophisticated and flexible approach to vertical
segmentation, the dynamic scheme, offers significantly more
potential. Due to the continual observation of several network
parameters (e.g. bandwidth), the dynamic variant is more
complicated. Depending on the type of media and the
implementation, some problems may arise as discussed for video
streams in chapter
3.
Using vertical segmentation, the transmission of a media stream may be performed concurrently with the presentation of its contents, which comes at the price of jerky presentation during the first iterations. After the completion of the transmission, the stream's playback equals to the original.
Another method that may be used in order to disassemble refineable media streams is to break up the information atoms themselves. As postulated in 2.1, an atom itself will not be broken up any further, but its representation may. In other words, the total amount of information stored in an atom is not sent in one single step, but is distributed over several layers. Virtually, what is refined are not the media streams themselves, but the single information atoms. First, a rough approximation of each atom is transmitted and then refined in subsequent iterations of the transmission. This procedure can be applied to each atom and is referred to as horizontal segmentation (see figure 3).
figure 3: horizontal segmentation of a refineable generic media stream
Compared to the vertical segmentation, the advantage of this method is improved scaleability: not only may whole atoms be omitted in a layer, but the size of the atoms themselves may be scaled. This allows an almost perfect adaptation of the data rate to the available bandwidth.
However, with most easy-to-implement approaches to horizontal segmentation (e.g. defining the color components of video frames as layers), the limits of adapting the transmission to the bandwidth will be reached rather soon. Hence, the information atoms should be defined carefully.
Again, iterative transmission of a media stream and its concurrent presentation can be achieved with horizontal segmentation. Since the resolution of the single information atoms is rather low in the beginning of the transmission, the playback of the stream's contents will be rather poor during the first iterations - depending on the available bandwidth. Consequently, the first replays may look as if a low-pass filter had been applied to the atoms. As with the vertical segmentation, the quality of the presentation rapidly improves - up to that of the original stream after the transmission has been completed.
A third approach to segmentation, the hybrid segmentation, combines the two techniques presented above and is naturally the most flexible. Again, there are two possibilities to perform the merging:
figure 4: VF-hybrid segmentation of a refineable generic media stream
figure 5: HF-hybrid segmentation of a refineable generic media stream
Combining both, vertical and horizontal segmentation, the hybrid segmentation provides a flexible and efficient strategy for disassembling media streams. Allaying the drawbacks of both techniques the advantages of both of them may be exploited: the effect is minimizing the gaps between the single information atoms transmitted in the first iteration (due to the subsampling of the vertical segmentation), and maximizing the initial quality (i.e. resolution) of these atoms (which would be rather poor if only horizontal segmentation was used). The segmentation is said to have a finer granularity, which allows reacting carefully to fluctuations of the available bandwidth. Furthermore, the hybrid segmentation allows real-time transmission of media streams in full quality via connections with very low bandwidths, which would otherwise cause vertical and horizontal segmentation alone to fail.
These advantages come at the price of significant overhead and a complex implementation: sophisticated book-keeping is needed to track which data has been sent and which has been acknowledged already. If the implementation uses an unreliable transport protocol like UDP - as strongly recommended to avoid retransmission delays - and performs the necessary error correction on its own,7 even the error correction gets complicated.
The hybrid segmentation is the most sophisticated, flexible, and elegant way to perform the disassembly of media streams. On the other hand the implementation, especially regarding the necessary error correction, is not trivial at all. The retransmission overhead grows dramatically and the algorithm becomes inefficient. Moreover, it has turned out that for practical use such as the transmission of common video data (e.g. using the Common Interchange Format) through low-bandwidth channels (with current V.34 modems (33.6 kbps)), the employment of either vertical or horizontal segmentation suffices.
Recently, considerable effort has been invested in the field of real-time transmission of bulky media streams via low-bandwidth networks - namely the Internet. Basically, they use techniques such as bandwidth negotiation (with and without renegotiating), or layered respective scaleable media streams (see [4], [10]). All of them sacrifice full-quality delivery of the data if the available bandwidth drops below a certain limit. Instead, as much data is delivered as the available bandwidth allows for. This comes at the price of a reduced quality of the presentation.
We introduced a new class of algorithms for transmitting time critical media data in both full quality and real-time via low bandwidth connections. Three subclasses of these methods were outlined.
Having introduced refineable media streams in chapter 2, we will give an overview over the WebMovie system (see [12], [13]) in this chapter. WebMovie implements refineable media streams with vertical segmentation. It focuses on the full-quality delivery of video data. The information atoms (see 2.1) are defined as the single frames within a video stream.
The WebMovie system consists of three components:
The server and the client communicate via two communication channels: the data channel is used for the transmission of the movie data, the control channel for the transmission of acknowledgement and control information. The control channel uses the TCP protocol, which ensures the correct in-order delivery of all data transmitted. This is very important for the sensitive control data. The data channel uses unreliable UDP in order to implement a sophisticated error-correction scheme in the WebMovie system (see 3.4). Moreover, the data rate may be chosen by the server.
Several data compression standards are available to encode the image data. The most important ones are methods for compressing single frames like GIF (see [3]), JPEG (see [7]) on the one hand, and moving picture compression techniques like H.261 (see [9]), or MPEG (see [6]) on the other hand.
To transmit time-critical media streams, the
reduction of the data volume in advance of its transmission is
considered the most important issue. Standards like H.261 and
particularly MPEG seem to present the best solution: eliminating
the temporal redundancy between the single frames of a video
stream almost completely, compression ratios from 50:1 (H.261,
see [16]) up to 200:1 (MPEG, see [16]) may be achieved. The
shortcomings of these compression algorithms are that they
produce strong dependencies between the single frames. In other
words, decompressing one (differentially coded) frame typically
requests the decompression of other (intra-coded) frames, too.
The better the compression ratio, the stronger these
dependencies. Real random access to the single frames of a video
stream is not possible.8
We decided to use JPEG - a compression standard for single
frames. The compression ratio may reach from 8:1 (almost original
quality) up to 50:1 (poor quality). JPEG does not produce any
dependencies between the single frames, allowing the transmission
of the frames in a random order - as needed for vertical
segmentation.
In order not to forego the drastically improved compression ratio
of temporal redundancy reduction, we use an action block
algorithm in combination with JPEG9: each of the
8x8-blocks of a JPEG-coded frame is compared to the corresponding
block of the frame next to this frame, which has been
acknowledged already by the client. If no significant changes can
be detected, the respective block will not be transmitted, but
reused from the already transmitted frame.
This additional method allows a significant reduction of the data
volume, although we do not achieve compression ratios in the
range of MPEG, which uses motion vectors in addition to
differential encoding.
The big difference to H.261 and MPEG is the on-the-fly removal of the temporal redundancy. Moreover, the reference frame for each frame to be encoded differentially may be chosen on-the-fly considering the frames that have already been transmitted and the available bandwidth. The latter is an important difference to the schemes found in the literature: our compression algorithm refers to the last frame transmitted, which is not necessarily the predecessor in the movie.
It has turned out that some problems arise if vertical segmentation is used in combination with a dynamic layer generation: With the dynamic layer generation, the frame to be transmitted next has to be calculated carefully in order to make its just-in-time delivery possible (as required for real-time transmissions). This calculation crucially depends on the estimated bandwidth and on the estimated size of the frame to be transmitted next:
The bandwidth is estimated with probe-echo based methods measuring the round-trip-delay as proposed in many publications (see e.g. [18]).
Estimating the size of the frame to be delivered next is significantly more complicated: the size of a frame depends on the reference frame used for its differential encoding (see 3.2). It is necessary to encode all frames that are candidates for being sent next - just to get their sizes. Moreover, this information is not useful anymore as soon as the chosen frame has been sent because the frame sizes change if they are encoded differentially using another frame as reference. Hence, the single frames cannot really be delivered exactly just-in-time, only approximately. Obviously, the approach to encode all candidates for transmission differentially is highly inefficient. Our straight-forward approach is to estimate the frame size based on the sizes of its predecessors; this linear predictor has proven to be an adequate heuristic to estimate a differential frame's size.
This method does not work well with scene changes in movies. The reason is that the difference between a picture before a cut and a picture afterwards is typically drastic. Nearly the entire frame will change and almost no block will be the same. There are two ways to handle the cut-problem:
As mentioned above, using a transport protocol
like TCP would foil all perspectives opened up by refineable
media streams: the immediate, automatic error correction with
data retransmission and mechanisms like "slow start",
significantly slowing down the data rate after an error occurred,
is not suitable for the transmission of time-critical media
streams such as video streams.
However, we have to employ data retransmission eventually because
we want to deliver all data in full quality. The characteristics
of iterative transmissions suggest special error-correction
algorithms, such as the delayed data retransmission:
Packets lost due to the unreliable nature of UDP are not
retransmitted immediately. Instead, we continue with transmitting
the current iteration as if no data had been lost. Hence, it is
possible to maintain the real-time presentation despite data
losses. All data that has not been acknowledged for a certain
span of time is considered "unsent". Since WebMovie
composes the layers of the respective stream dynamically, lost
data is automatically retransmitted in a later iteration of the
transmission (see figure
6).
figure 6: principle of delayed data retransmission
After the transmission has been completed, all the data has been delivered. This elegant integration of error correction into the iterative movie transmission comes at no extra cost to the implementation. It actually reduces the algorithmic overhead compared to other schemes.
Our testbed consists of the WebMovie server running on a PC with an Intel-Pentium-200 CPU and the client running on a PC with an Intel 80486-66 CPU. The connection is established using a V.34 modem, which allows data rates up to 33,600 bps. In our experiments we limited the bandwidth to 28,800 bps in order to get results valid for the majority of Internet users.
To measure the performance of our WebMovie system, we chose three different, short video-clips, all recorded from television:
The characteristics of the three movies are summarized in table 1.
table 1: characteristics of the test movies
Al Bundy |
Wallace & Gromit |
Music Video |
|
Number of Frames | 304 |
162 |
608 |
Frame Dimensions | 192x136 |
240x168 |
160x120 |
MPEG (KBytes) | 1,057 |
unknown |
76610 |
JPEG/M (KBytes) | 1,632 |
931 |
1,427 |
Frame Rate (fps) | 10 |
10 |
10 |
The parameters measured using the test movies introduced above are listed in table 2: the time from the start of the transmission until the presentation of the first frame and the time it takes to complete one whole iteration of the transmission (both measured in seconds). The next two lines contain the average number of frames transmitted during the first iteration and the total number of iterations of the transmission. Finally, the total transmission time for all movies was measured. This parameter is given relative to the time needed to transmit the MPEG-compressed movie sequentially with ftp. In the first approximation, the total transmission time is directly proportional to the amount of data actually transmitted. Since this parameter is given relatively only, it may be understood as the relative volume of the transmitted data compared to ftp/MPEG as well.
table 2: results of video transmissions with WebMovie (at 28.8 kbps)
Al Bundy |
Wallace & Gromit |
Music Video |
|
Presentation of 1st Frame | < 3 sec |
< 5 sec |
< 2 sec |
Duration of 1st Iteration | 30.4 sec |
16.2 sec |
60.8 sec |
Number of Frames in 1st Iteration (average) | 10 |
5 |
30 |
Number of Iterations | 16 |
22 |
20 |
Transmission Time relative to MPEG/ftp | 125% |
< 400 sec10 |
135% |
According to these results, the first frames of
the movies are presented very quickly, only a few seconds after
the beginning of the transmission. During the first iterations of
the transmission, frames of all scenes of the respective movie
are presented - not only frames of the movie's
beginning. The transmission of every iteration of a movie takes
exactly the same time as the movie's playback in real-time.
Hence, the users may watch as many presentations in improving
quality as iterations needed for the transmission. They may
observe the progress of the transmission and abort it at any
time.
These advantages concerning presentation-oriented transmitting of
the movies are at the expense of the total transmission time,
which increases about 30% compared to the transmission of the
movies with standard methods (plain TCP, e.g. ftp) using the
highly efficient MPEG-compression standard. Although not the
focus of this research, the latter result is quite surprising:
MPEG's advantage compared to our simple action block scheme is
much smaller than anticipated.
We have introduced the completely new iterative
transmission paradigm, which is based on the ideas of scaleable
media streams and combines the advantages of the two conventional
paradigms: it makes real-time transmissions of media streams
possible in combination with the delivery of the data in full
quality - even via the Internet with its severely limited
bandwidth.
We have shown several approaches to possible implementations of
this new transmission paradigm and discussed the advantages and
disadvantages of each of them.
Then we briefly introduced the WebMovie system
as a showcase implementation of iterative streaming - focusing on
the transmission of prerecorded video data. A new
error-correction mechanism - the delayed retransmission - has
been introduced, which supports the iterative nature of the
employed transmission paradigm. Some measurements prove the
potential of iterative streaming: only seconds after the
beginning of the transmission first frames are presented - in
real-time and not only frames from the beginning of the movie,
but from all scenes of the movie. Due to an action-block
algorithm, the data volume is reduced in comparison with the
volume of the motion JPEG encoded movie.
It has also been shown that if the data is to be transmitted in
full quality the transmission takes up about 130% of the time
needed by using the conventional file transmission paradigm (e.g.
FTP) - in combination with MPEG compression. However, the file
transmission paradigm does not allow real-time presentation of
the movie concurrently with the transmission, thus straining the
patience of the user.
The client software - a Netscape plug-in currently available for
Windows95 only - can be downloaded at [13]. The reader is invited to
test it with our WebMovie server (see [13]) in order to get a
demonstration of iterative streaming.
Iterative streaming is well-suited for any field of application where the data is eventually needed in full quality, for instance for video on demand. Moreover, users may save a lot of time and storage capacity: they may choose the quality in which they want to have the data. The transmission may be gracefully aborted at any point of time during the transmission. The data is not stored in full quality if the transmission is interrupted.
The WebMovie system is currently implemented as a prototype only. Although all basic functionality is available in this prototype, there are many items to be improved and optimized. The following three paragraphs may give an idea of what needs to be done next to make the WebMovie system even more attractive:
1 However, the internal representation of these atoms can possibly be broken up further (see 2.3).
2 This subsampling is always possible due to the postulation of information atoms (see 2.1). It may be performed at least based on the respective stream's atoms.
3 An already existing vertical segmentation may be subsequently extended very easily by horizontal methods.
6 It has turned out that this procedure requires much more bookkeeping than the VF-hybrid, since the information atoms have been broken up in the first step already.
7 Since the complete stream information needs to be transmitted without omitting a single byte, data retransmission mechanisms must be provided - in contrast to scaleable media streams.
8 We considered several publications (e.g. [14], [15], [17]) on software providing full VCR functionality for MPEG-encoded data. All of them have to decode several frames in order to get a B-frame, for instance because the information atoms of an MPEG stream are the "groups-of-pictures" (see [6]), which, however, are too large for our purposes.
9 Our action block approach is very similar to the "Intra-H.261" as proposed in [11].
10 This movie has been converted with the ImageMagick-public domain software (available at ftp://sunsite.cnlab-switch.ch/mirror/MachTen/applications/68k/X11/graphics) from the MPEG format into the motion JPEG format.
11 This movie was initially motion JPEG coded.