Media Mining Indexer is designed to perform in real time or faster with common off-the-shelf PC hardware. We currently recommend Dell Precision workstations (model 450 or model 650) utilizing dual 2.8 GHz Pentium 4 processors, 2GB of ECC DDR memory and Intel 7208 chipset. The minimum configuration we tested to run in real-time on our standard broadcast-news test set is a Dual Pentium-III 1GHz, 2GB SDRAM ECC memory configuration.
The minimum hardware configuration to run Media Mining Indexer is 2GB of memory and a single Pentium-III/733 processor (will run approximately 2 times slower than real-time).
The Media Mining Indexer requires a sound card if you feed input other than PCM audio files.
To feed video from Cable-TV, Video recorders or other input devices using Composite input or S-Video input, Media Mining Indexer provides you with the ability to use a WDM compatible analog TV card as input. SAIL LABS supports the following TV cards:
- Hauppauge WinTV GO
- Pinnacle Studio PCTV (Pro)
Feeding can also be effected directly from a digital satellite dish. In this configuration, the satellite dish is connected to a digital satellite receiver card which is installed in a PC. Due to CPU requirements of MPEG encoding and recoding, we suggest to have the satellite card installed in a separate machine. SAIL LABS supports the Pinnacle PCTV SAT card. (Please note that the PCTV SAT/CI is fundamentally different in terms of hardware and drivers and is therefore not supported).
Configuration hints for real-time indexing:
- Pentium 4 with hyper-threading 2.8 GHz or equivalent.
- 2GB ECC DDR RAM compatible with your system.
- Disk space: 60 GB recommended for production use on indexers; more for machines acting as DMC; Minimum: 30GB (between 350 and 500 MB per Language Feature plus temporary disk space).
See also: Media Mining System Sample Configurations
Software prerequisites
- Microsoft Windows XP or Windows 2003 Server
- DirectX 9.0 or later. This is required for indexing mpeg, mp3, wmv, avi files
- Sun Java Runtime Environment 1.3.1.02 (j2re-1_3_1_02-win.exe); this can be found on the Indexer CD (DVD) or can be downloaded from http://java.sun.com Note: only the mentioned version of JRE is guaranteed to work with Indexer
- MS Internet Information Services (primary machine only). This is included in the Windows installation CD and can be installed by Start\Settings\Control Panel\Add Remove Programs\Windows Components
- Real Player (current version); this optional component should be installed if you need to index Real Media files; this can be downloaded from www.real.com
Language Options
Languages supported by Media Mining Indexer are:
- Arabic
- English (US)
- English (US/UK)
- French
- German
- Greek
- Spanish (American)
- Spanish (Spain)
Integration
The open architecture of the Media Mining Indexer enables easy integration with complementary technologies for diverse applications. Some of the products which have integrated the Media Mining Indexer are Virage Videologger®, Oracle 9 i interMedia, blue order's media archive® etc. The integration can be done at two levels:
- Using command line tools provided by Sail Labs
- By linking with the API C++ library and directly accessing Indexer components
Two command line tools are currently available: a Transcriber tool, and an Indexer tool. Both provide different and complementary functionalities, and they must be used together in order to obtain the complete Indexer process.
The API facilitates communication with the Media Mining System components. It provides a set of C++ classes that represent proxies for the interface components. These classes are in an API Library Header file. There are two typical usage scenarios for the Media Mining Indexer API: for an Audio Feeder Client and an Administrator Client .
An Audio Feeder Client would typically perform the following actions:
- Connect to Dispatcher with an Indexer Cluster port or directly to an Indexer instance providing an instance ID
- Request an input pin
- Optionally register a result handler to receive progress updates and/or the result transcription (by default the results are uploaded to Media Mining Server)
- Send audio/streaming media to the pin
- Wait for results
An Administrator Client would typically do the following:
- Connect to an Indexer instance with an Indexer Cluster port and instance ID
- Query the status of an instance, and start and stop the instance
Technologies
The Media Mining System harnesses the synergies of some of the best speech processing and language technologies produced or currently in development. Technologies such as Automatic Speech Recognition, Speaker Identification, Name Spotting, Topic Classification, and Story Segmentation have been integrated in our system, and together produce comprehensively indexed text files from the media stream input.
Architecture/SW
The software consists of CORBA components performing the various steps of processing. The Engine can be run on both Linux and Win32 platforms.
Automatic Speech Recognition
Automatic Speech recognition is performed in a sequence of steps; it first processes the incoming audio, then chunks the audio into sections of speech and non-speech, and then applies speech-recognition to those segments identified as containing speech. This can be done in real-time, for large vocabularies (64K entries) and for 8 and 16kHz (but is not limited to these).
Our speech recognition engine is language independent (modulo changes on acoustic front end, e.g. for tonal languages). Languages the recognizer has been run in include: English, French, Spanish, German, Arabic and Mandarin.
Front-End : Standard MFCC coefficients, energy+ 5/3 frame regression, deltas, delta-deltas, cepstral mean subtraction.
Acoustic Models : Speaker- and Gender-independent acoustic models, 3- and 5-phone context models, with Gaussian Prototypes tied at the prototype as well as mixture weight levels.
Language Models : Our engine uses mainly Bigrams and Trigrams for the Language Model. We adopt a Witten-Bell type back-off model. The Language Model and Acoustic scores are combined for maximum accuracy.
Decoder Search : We use a multi-pass, time-synchronous search. During processing, increasingly detailed models are used at each step. After a forward pass and a backward pass, the resulting N-best list is re-scored using the most detailed acoustic models.
Speaker ID/Clustering
The Speaker Identification (SID) system identifies speakers in the area of broadcast news transcription. The incoming audio is first chunked up according to speech / non-speech regions and speaker turns are hypothesized on these chunks. Speaker clustering (SC) and Speaker ID (SID) are run on the resulting chunks. Typically from about 20 to 100 speakers can be identified; for non-target speakers the gender is detected and the unknown speaker's segments labelled accordingly. Speaker ID/Clustering is language independent.
Speaker Identification : SID uses Gaussian Mixture Models (GMM) for a number of pre-selected target speakers and an additional number of cohort speakers who serve for normalization and gender detection purposes.
Speaker Clustering : Speaker Clustering is based on clustering pre-determined chunks (from initial segmentation) into k clusters. As quality measure the within class dispersion is used. Segments are clustered using a variant of the generalized likelihood ratio criterion.
Speaker Change Detection
Speaker Change Detection (SCD) is done using a phone-level decoder, which employs a set of broad phonetic classes of speech sounds as well as non-speech sounds. Using the information produced by the decoder (i.e. the "transcript"), the SCD system sequentially hypothesizes speaker turns at phoneme boundaries. A generalized likelihood ratio test is used to determine whether a change should be made.
SCD works on the cepstral features and energy; no derivatives are used. The acoustic models employed are tied on the phoneme level with a language model based on the phoneme classes.
Named Entity Detection
The Named Entity System (NE) identifies named locations, persons, organizations, dates, times, monetary amounts and percentages in the text. Name detection is approached as a classification problem, where every word is either part of some name or not part of a name. A variant of an HMM is employed for this detection task, where each class of names (i.e. persons, locations, etc.) is modelled by a separate state in the HMM as well as the "not-a-name" class.
Each named class is modelled by a separate state in an HMM . Other text is modelled by another state. Within states for named-entities, a statistical bigram language model is used to compute the likelihood of a sequence of words. Decoding consists of the task of finding the most likely name-class state-sequence.
Story Detection
Story Detection consists of several phases. First an episode is partitioned into sections of homogeneous topics. The boundaries of these sections are adjusted further in a subsequent step. Finally, these stable sections are scored against an HMM modelling the individual topics. The top ranking topics are used to determine what the story is about.
HMM, one model per topic and one model to model general language (all those filler words which really aren't specific to any topic). Each state represents the topic and emits topic dependent words probabilistically. Transitions between topics are allowed at every word and are not observed. At decoding time, the most likely set of topics given the recognized text is determined.