mClust
mClust

This system is no longer maintained by LIUM, please switch to LIUM_SpkDiarization.

mClust is a software package dedicated to speaker diarization (ie speaker segmentation and clustering). Most of the tools in the package take a segmentation as input and generate a new segmentation as output. The provided tools allow to perform BIC hierarchical clustering, Viberbi decoding using GMM models trained by EM or MAP, and CLR hierarchical clustering using GMM (clustering based over automatic speaker recognition methods).
 
Please read first the Interspeech 2005 paper [1]: "The LIUM speech transcription system: a CMU Sphinx III-based system for French broadcast news".

Licence

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

Installation

The LIUM diarization package is developed in c++ without any external library (as lapack or spro) except the standard ones (as STL). The documentation is generated by using doxygen. A doxygen configuration file is proposed in the root of the package. The simplest way to compile this package follows the GNU procedure defined as:

  1. "cd" to the directory containing the package's source code and type "./configure" to configure the package for your system. If you're using "csh" on an old version of System V, you might need to type "sh ./configure" instead to prevent "csh" from trying to execute "configure" itself.
    Running `configure' takes awhile. While running, it prints some messages telling which features it is checking for.
  2. Type "make" to compile the package.
  3. Type "make install" to install the programs.
  4. Type "doxygen" to generate the documentation.
  5. You can remove the program binaries and object files from the source code directory by typing "make clean". To also remove the files that "configure" created (so you can compile the package for a different kind of computer), type "make distclean". There is also a "make maintainer-clean" target, but that is intended mainly for the package's developers. If you use it, you may have to get all sorts of other programs in order to regenerate files that came with the distribution.

Optional features are not necessary but you can specify installation prefixes. The option "--exec-prefix=PATH" given to 'configure' will install the programs and libraries using PATH as the prefix.

"configure" could take on the command line initial values for variables. You can do that on the command line like this:
./configure CXXFLAGS="-Wall -O3"
"CXXFLAGS" defines the c++ compiler options in which it should be useful to set the "-march" (see man gcc).


Acoustic features

The supported acoustic features are in SPRO 4.0 or SPHINX format. We observed better results with SPRO features. HTK format is proposed but it is not tested. SPRO is distributed under GPL. We patched the SPRO 4.0 version of sfbcep in some minor way. The patch is provided in "mClust/spro".

LIUM segmentation format

The format for segmentation files is close to the MDTM or STM NIST format. Each line corresponds to a segment.
Example: "19981217_0700_0800_inter_fm_dga 1 1 317 U U U spk0"

  • field 1: "19981217_0700_0800_inter_fm_dga" = the show name
  • field 2: "1" the channel number
  • field 3: "1" the start of the segment (in features)
  • field 4: "317" the length of the segment (in features)
  • field 5: "U" the speaker gender
  • field 6: "U" the type of band (telephone, studio)
  • field 7: "U" the type of environment (music, speech only, ...)
  • field 8: "spk0" the speaker label

Descriptions of the main tools

Every program provides helps (--help) and debugging information (--trace). A name of a program parameter has always the same signification in all programs.

mSegInit

Perform two safety checks on a given feature file:

  • Checks that the sections of the file on which segmentation is to be done actually fit completely in the file (it sometimes happens in evaluation campaigns that some sound files are not as long as they are supposed to be); these sections are given to the program as a segmentation file;
  • Checks the feature vectors to ensure that there is no sequence of several identical vectors (usually resulting from a problem when recording the sound), as such sequences would disturb the segmentation process.

mSeg

Distance based segmentation software. The distance can be BIC, GLR, divergence Gaussian or KL2. Caution, there is a known bug in the current version: this version only works over ONE segment.

mClust

BIC based clustering software. It works fine with BIC mono Gaussians, GMMs are also implemented using CLR distance.

mTrainInit

Software for initialization of GMMs.

mTrainEM

Software for the training of a GMM using EM algorithm (previously initialized with mTrainInit).

mTrainMAP

Software for the training of a GMM using MAP algorithm (previously initialized with mTrainInit).

mDecode

Basic Viterbi decoder using a set of GMMs.

mScore

A program that computes the likelihood scores given a set of GMMs.

Examples

BIC hierarchical clustering

The script segBIC.sh provides the speaker segmentation process which was used during the ESTER evaluation campaign in 2005 [1]. The diarization error rate is 19.18% for the ESTER evaluation corpus. This corpus is composed of 18 shows (10h) were recorded from six radios: France Inter, France Info, RFI, RTM, France Culture and Radio Classique. The official results are shown in [3]. The score is computed using the NIST scoring tool (version 23).

The script takes a Sphere file and yields a speaker segmentation file. Wave file could be easily used, you only need to change "sfbcep -v -F sphere ..." by "sfbcep -v -F wave ...".
BIC distances, based on mono Gaussians, are used in the segmentation and clustering process. A viterbi decoding is performed by using an ergodic HMM which is trained by EM (8 diagonal Gaussians by speaker class model).

Score over ESTER evaluation corpus

The diarization score is given in the file BIC.result. The segmentation is also provided in mdtm and seg formats.

TOTAL TIME =36399.21 secs
 TOTAL SPEECH =34706.44 secs ( 95.3 percent of total time)
SCORED TIME =34808.38 secs ( 95.6 percent of total time)
SCORED SPEECH =33870.76 secs ( 97.3 percent of scored time)
---------------------------------------------
TIME IN OVERLAPS = 89.85 secs ( 0.3 percent of scored time)
TIME*SPEAKERS IN OVERLAPS = 179.70 secs ( 0.5 percent of scored time)
---------------------------------------------
MISSED SPEECH = 0.06 secs ( 0.0 percent of scored time)
FALARM SPEECH = 937.63 secs( 2.7 percent of scored time)
-------------------------------------------------------------------------
SCORED SPEAKER TIME =33960.60 secs (100.3 percent of scored speech)
MISSED SPEAKER TIME = 89.91 secs ( 0.3 percent of scored speaker time)
FALARM SPEAKER TIME = 937.63 secs ( 2.8 percent of scored speaker time)
SPEAKER ERROR TIME = 5486.75 secs ( 16.2 percent of scored speaker time)
---------------------------------------------
OVERALL SPEAKER DIARIZATION ERROR = 19.18 percent of scored speaker time `(Overall)

CLR hierarchical clustering

The script segCLR.sh provides a speaker segmentation process which is close to the best method proposed during the ESTER and RT'05F evaluations [2,3]. The method uses CLR distances between two speaker clusters which are modeled by GMMs. GMMs are adapted by MAP. The UBM ("gmm128_1_3_2_0_13_1_1_0_sil_0.1_split.gmm") is composed of the merge of 4 gender and bandwidth dependent GMMs.

Note: The CLR segmentation needs to be filtered: non-speech regions are removed according to the transcription. This filtering gives an absolut gain of near 1.3 point.

Score over ESTER evaluation corpus

The diarization score is given in the file CLR.result. The segmentation is also provided in mdtm and seg formats.

TOTAL TIME =36399.21 secs
 TOTAL SPEECH =34706.44 secs ( 95.3 percent of total time)
SCORED TIME =34808.38 secs ( 95.6 percent of total time)
SCORED SPEECH =33870.76 secs ( 97.3 percent of scored time)
---------------------------------------------
TIME IN OVERLAPS = 89.85 secs ( 0.3 percent of scored time)
TIME*SPEAKERS IN OVERLAPS = 179.70 secs ( 0.5 percent of scored time)
---------------------------------------------
MISSED SPEECH = 0.04 secs ( 0.0 percent of scored time)
FALARM SPEECH = 937.62 secs( 2.7 percent of scored time)
-------------------------------------------------------------------------
SCORED SPEAKER TIME =33960.60 secs (100.3 percent of scored speech)
MISSED SPEAKER TIME = 89.89 secs ( 0.3 percent of scored speaker time)
FALARM SPEAKER TIME = 937.62 secs ( 2.8 percent of scored speaker time)
SPEAKER ERROR TIME = 3493.49 secs ( 10.3 percent of scored speaker time)
---------------------------------------------
OVERALL SPEAKER DIARIZATION ERROR = 13.31 percent of scored speaker time `(Overall)

Gender and bandwidth detection

A gender and bandwidth detection is also provided. The system is based upon 4 GMMs (gender and bandwidth dependent) composed of 128 diagonal Gaussians each. The models are contained in the file "gender.gmms". The "mScore" program labels each segment by the name of the most likely model. The score of each model is also provided. The file make_gender.sh contains an example of the GMMs training.


[1] Paul Deléglise, Yannick Estève, Sylvain Meignier, Teva Merlin (2005), The LIUM speech transcription system: a CMU Sphinx III-based system for French broadcast news, In: Interspeech'05, ISCA, Sept 2005, Lisbona.
[2] Xuan Zhu, Claude Barras, Sylvain Meignier, Jean-Luc Gauvain (2005), Combining speaker identification and bic for speaker diarization, In: Interspeech'05, ISCA, sept 2005, lisbona. [3] S. Galliano, E. Geoffrois, D. Mostefa, K. Choukri, J.-F. Bonastre, G. Gravier, The ESTER phase II evaluation campaign for the rich transcription of french broadcast news, in Proceedings of European Conference on Speech Communication and Technology (ISCA, Eurospeech 05), Lisbon, Portugal, September 2005, pp. 1149-1152.

 
© 2017 Les outils du lium

Joomla! is Free Software released under the GNU/GPL License.