Representation and linking mechanisms for audio in MPEG-7
Abstract
This paper proposes a general framework for the description of audio within audiovisual sequences for MPEG-7. These related descriptors and description schemes were initially defined during the first phase of MPEG-7 and then evaluated during the Lancaster Meeting held in February 1999. These proposals are based on the underlying premise that audio content can be expressed by a combination of two synergistic representations, both of which are necessary to represent audio content accurately. The first is a structured or semantic representation of audio such as a sentence, paragraph, score, or class. The second is an unstructured representation of the audio simply represented as a continuous stream of data. Since it is not possible to express all aspects of audio in a structured representation, powerful linking mechanisms are required between these two representations. We propose an audio description scheme as a basic structure and representation for audio based on hierarchical, temporal segments. Such a description scheme is essential for both ease of description and to support content based indexing and retrieval of audio. We also propose a description scheme for the representation of larger structures such as spoken content in audio, where the annotation is generated using automatic speech recognition Finally, we propose linking mechanisms between structured descriptions and unstructured audio content, as an example facility that would add great power to both of the previously mentioned description frameworks.