INTERNATIONAL ORGANISATION FOR STANDARDISATION
ORGANISATION INTERNATIONALE DE NORMALISATION
ISO/IEC JTC1/SC29/WG11
CODING OF MOVING PICTURES AND AUDIO
ISO/IEC JTC1/SC29/WG11 N2079
MPEG98/
Tokyo March 1998
Source: Requirements, Audio, DMIF, SNHC, Systems, Video
Status: Approved
Title: Overview of MPEG-4 functionalities supported in MPEG-4 Version 2
In January 1999, the first set of MPEG-4 Standards will be ready. Work on MPEG-4 will continue after that date for Version 2. Version 2, the work on which has already started, will add tools to the MPEG-4 Standard. Existing tools and profiles from Version 1 will not be replaced in Version 2; technology will be added to MPEG-4 in the form of new profiles.
This document, which should be read in conjunction with the MPEG-4 Version 1 Overview (N1909), describes what will be added in MPEG-4 Version 2.
Version 2 of the MPEG-4 standard will support, in addition to the tools in Version 1:
Scene description for composition (2D/3D spatio-temporal synchronization with time response behavior) of multiple AV Objects. The description provides for :
1. 2D/3D objects grouping for ease in editing and composition,
2. Spatio-temporal 2D/3D AV Object positioning and transformation,
3. 2D/3D AV Object attribute value selection.
Specification of an API for description of AV Objects behavior,
Specification of APIs for 2D composition,
Specification of API for 2D/3D composition,
Support of downloadable executable code,
Server-side interaction via attribute value modification using standardized parametric description,
AV Objects descriptors to carry MPEG-7 data. (MPEG-7 will define a framework for identifying and describing what is inside the content.)
Intellectual Property Management and Protection (IPMP)
While MPEG-4 Version 1 offers only the possibility to identify intellectual property (by means of using international standard numbers such as ISAN, ISRC, etc.), MPEG-4 Version 2 deals as well with the management and the protection of intellectual property.
MPEG-4 Version 2 provides hooks to (non-normative) »IP Management & Protection Systems« (IPMPS, see Figure 1) to support appropriate on-line and off-line transactions among users, IP providers and/or rights holders. Amongst the functions supported by different IPMPS are:
Conditional access to intellectual property, based on criteria to be defined by the IP provider;
Verification of authenticity of source of intellectual property and integrity of intellectual property;
Identification and, wherever possible, prevention of illegal copying;
Audit trails.
Since IPMP technology is rapidly changing, methods that today are believed to be adequate, may not be adequate in the future. In order to provide the users with a persistent solution, MPEG-4 Version 2 will not standardize the IPMPS but will facilitate the injection of external solutions. This facilitation will take place by defining the interface to the IPMPS.
Figure 1: Overview of the Intellectual Property Management & Protection Framework
The Visual part of the MPEG-4 standard will be extended with tools in the following areas:
Gray Scale or alpha Shape Coding
An alpha plane defines the transparency of an object, which is not necessarily uniform. Multilevel alpha maps are frequently used to blend different layers of image sequences. Other applications that benefit from associated binary alpha maps with images are content based image representations for image data bases, interactive games, surveillance, and animation. Efficient techniques are provided that allow efficient coding of binary as well as gray scale alpha planes. A binary alpha map defines whether or not a pixel belongs to an object. It can be on or off. A gray scale map offers the possibility to define the exact transparency of each pixel.
increased flexibility in object-based scalable coding,
improved coding efficiency, especially for very low bitrate applications requiring simple decoders,
improved error robustness.
Coding of Multiple Views: Intermediate views or stereoscopic views will be supported based on the efficient coding of multiple images or video sequences. A particular example is the coding of stereoscopic images or video by redundancy reduction of information contained between the images of different views.
Body animation may be added to face animation; it uses similar tools.
Coding of generic 3D meshes allows to efficiently code synthetic 3D objects.
LOD (Level Of Detail) scalability allows the decoder to decode a subset of the total bit stream generated by the encoder to reconstruct a simplified version of the mesh containing less vertices than the original. Such simplified representations are useful to reduce the rendering time of objects which are distant from the viewer (LOD management), but also allow less powerful rendering engines to render the object at a reduced quality.
Spatial scalability allows the decoder to decode a subset of the total bit stream generated by the encoder to reconstruct the mesh at a reduced spatial resolution. This feature is most useful when combined with LOD scalability.
3D geometry compression
3D progressive geometric meshes (temporal enhancement of 3D mesh detail through progressive transmission of differential geometry and structure - similar in functionality to progressive texture and different from discrete level-of-detail modeling in graphics where detail is enhanced by replacement)
Basic mechanism to select text containers by user picking for changing text
Conditional text behavior based upon user or visual events
Future audio studies for Version 2 will center on the development of:
Version 2 will add new tools to the audio algorithms to improve their error resilience. There are two classes of tools: The first class contains algorithms to improve the robustness of the source coding itself, e.g. Huffman codeword reordering for AAC. The second class consists of general tools for error protection, allowing equal and unequal error protection of the MPEG-4 audio coding schemes. Since these tools are based on convolutional codes, they allow very flexible use of different error correction overheads and capabilities, thus accommodating very different channel conditions.
Environmental spatialization.
These new tools allow parametrization of the acoustical properties of an MPEG-4 scene (e.g. a 3-D model of a furnished room or a concert hall) created with the BIFS scene description tools. Such properties are, e.g., room reverberation time, speed of sound, boundary material properties (reflection, transmission), and sound source directivity. New functionality made possible with these scene description parameters includes advanced and immersive audiovisual rendering, detailed room acoustical modeling, and enhanced 3-D sound presentation
Low delay generic audio coding
Invoke SRM (Session and Resource Management) on demand after an initial Session has been established (with the tools present in DMIF v.1)
This allows a seamless transition of a session on a homogeneous network using DMIF v.1 to a one more involved using DMIF SRM, see below. A homogeneous network is a network composed of one transport technology only.
Allow heterogeneous connections with end-to-end agreed QoS level
A heterogeneous Network is composed of different transport technologies which are connected in tandem through InterWorking Units. Typically, an end-to-end connection may be comprised of access distribution networks at both ends to which peers are connected to and a core network between them. No restrictions are enforced on the transport technology that each segment of the end-to-end connection may use. Also, peers can try to achieve a best effort as well as a guaranteed end-to-end QoS on the end-to-end connection. It is possible that a DMIF end-to-end connection uses an Internet core with RSVP as opposed to an ATM core. This work will also incorporate network processing resources in an end-to-end connection such as transcoders/audio and video bridges/multicast servers/switched broadcast servers within a DMIF network session. A standardized signaling between SRM and InterWorking Units will be developed.
Fully symmetric consumer and producer operations within a single device.
DMIF extends the DSM-CC SRM with the concept of peers carrying receiver (consumer) or sender (producer) roles as opposed to mere Clients or Servers. This is in line with symmetric video codecs and allows for the initiation of session and adding a connection by any peer. The role of a peer as a consumer or a producer is defined after a session is established. This allows DMIF v.2 to be used in conversational as well as (DSM-CC) multimedia retrieval applications.
End-to-end "session" across multiple network provider implementations
In this case many Session and Resource Managers (SRM) each belonging to a different administrative entity and having its own subscribed peers will interoperate, such that a session may be established with peers across different SRM nodes. This will require standardized signaling to be developed between the SRM nodes. DMIF SRM groups the resources used in a service instance using session ID tags. Resources are not limited to network resources and anything that is dynamically allocated/de-allocated can be treated as a resource. The resources used in a DMIF network session can be logged, e.g. for billing. The session ID also is used for retiring the resource for later reuse once the session is terminated. All the connection resources are set up and cleared as one unit.
DMIF SRM allows the collection of accounting logs so that the revenues collected from the peers in a DMIF session are properly disbursed to those providers that supply the resources within that session.