Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Phonexia Product Portfolio
Turning Voice into Knowledge
TABLE OF CONTENTS
About Phonexia 2
Phonexia Speaker Identification 3
Phonexia Language Identification 4
Phonexia Gender Identification 5
Phonexia Keyword Spotting 6
Phonexia Speech Transcription 7
Phonexia Speaker Diarization 8
Phonexia Voice Activity Detection 9
Phonexia Speech Quality Estimation 10
Phonexia Age Estimation 11
Phonexia Voice Inspector 12
Phonexia Denoiser 14
Integration Possibilities and Licensing 15
2 3
About Phonexia
Customers and Partners
Phonexia Products
Phonexia transforms voice into knowledge with its
innovative speech analytics and voice biometrics
technologies. Its Phonexia Speech Platform is
the first on the market using exclusively deep
neural networks to allow speaker identification
with extremely accurate and fast results. The
Phonexia Speech Platform packs a wide range of
speech technologies into a single, highly modular
platform that is easy to integrate with other
solutions. Phonexia innovation is available through
its network of integration partners. A university
spin-off, Phonexia has been delivering its
technologies to call centers, financial institutions
and security agencies in more than 60 countries
since 2006.
Phonexia Voice Biometrics helps the
identification of or search for a speaker based on
the comparison to a previously created voiceprint.
Similar to a fingerprint, Phonexia voice biometrics
can be used for voice authentication, fraud
prevention or speaker search.
Phonexia Speech Analytics provides ready to
analyze data on speech content using either full
Speech Transcription, Keyword Spotting
(a phonetically based keyword search)
or Language Identification.
Phonexia Voice Inspector is an out-of-
the-box solution providing police forces and
forensic experts with a highly accurate Speaker
Identification tool that supports criminal
investigations.
Phonexia Denoiser software cleans the audio
signals of reverberation and other noises to make
them more audible to the human ear.
14YEARS
ON THE MARKETBASED IN CZ,
THE EUROPEAN UNIONPROJECTS IN
60 COUNTRIES
Phonexia Speaker Identification
Output
XML/JSON format with all results or results
files with a log likelihood ratio (-∞;∞) and/or
percentage metric scoring <0–100%>
Accuracy and speed
Phonexia provides several technology models
optimized for different use cases.
The most precise model (XL4) is optimized for best
performance on a short speech signal. In this way,
the speaker can be verified through 3 seconds
of net speech with more than 92% accuracy.
With a longer speech signal, the accuracy can
reach 97% without calibration. This result can
be improved through calibration within the
customer’s environment. The speed of the model
is approximately 64 times faster than real-time for a
typical call center call (one channel).
The accuracy was measured on the NIST SRE16
dataset.
Technology
• A calibration tool for even higher accuracy
• 1:1 (verification), 1:n and n:m (identification)
comparison possible
• The technology is language-, accent-, text-,
and channel- independent
• Uses deep neural networks to generate highly
representative voiceprints
• Applies state-of-the-art channel compensation
techniques, verified by NIST evaluation
• Compatible with the widest range of
audio sources possible (applies channel
compensation techniques): GSM/CDMA, 3G,
VoIP, landlines, satphones, etc.
Input
Input format for processing:
WAV or RAW (PCM unsigned 8 or 16 bits, IEEE
float 32-bit, A-law or Mu-law, ADPCM), FLAC,
OPUS; 8 kHz+ sampling (other audio formats
automatically converted)
Minimum speech signal for enrolment:
recommended 20+ secs
Minimum speech signal for identification:
recommended 3+ secs
In specific use cases the time required for
the speaker enrolment and identification can
be much shorter.
Phonexia Speaker Identification uses the power of voice biometrics to recognize a speaker automatically by their voice. Its latest generation, called Deep EmbeddingsTM, uses deep neural networks for even greater performance.
Voice Biometrics
4 5
Voice Biometrics Speech AnalyticsVoice Biometrics Speech Analytics
Phonexia Language Identification
Technology
• The technology is text and channel independent
• Applies state-of-the-art channel compensation
techniques, verified by NIST evaluation
• Compatible with the widest range of
audio sources possible (applies channel
compensation techniques): GSM/CDMA, 3G,
VoIP, landlines, satphones, etc.
Supported languages
Oromo, Albanian, Amharic, Arabic_Egypt,
Arabic_Gulf, Arabic_Iraqi, Arabic_Levantine, Arabic_
Maghrebi, Arabic_MSA, Assamese, Azerbaijani,
Bangla_Bengali, Basque, Belarusian, Bulgarian,
Burmese, Catalan, Cebuano, Chinese_Cantonese,
Chinese_Mandarin, Chinese_Min_Nan, Chinese_
Wu, Chuvash, Czech, Dari, Dutch, English_American,
English_British, English_Indian, Estonian, Farsi,
French, Georgian, German, German_Switzerland,
Greek, Guarani, Haitian_Creole, Hausa, Hindi,
Hungarian, Indonesian, Italian, Japanese, Kazakh,
Khmer, Kirundi_Kinyarwanda, Korean, Kurdish, Lao,
Lithuanian, Luxembourgish, Macedonian, Ndebele,
Pashto, Polish, Portuguese, Punjabi, Romanian,
Russian, Serbo-Croat-Bosnian, Shona, Slovak,
Slovenian, Somali, Spanish_American, Spanish_
European, Swahili, Swedish, Tagalog, Tamil, Telugu,
Thai, Tibetan, Tigrignya, Tok_Pisin, Turkish, Ukrainian,
Urdu, Uzbek, Vietnamese, Welsh, Zulu
The Phonexia Language Identification (LID) system allows the automatic detection of spoken language or dialect.
Input
Input format for processing:
WAV or RAW (PCM unsigned 8 or 16 bits, IEEE
float 32-bit, A-law or Mu-law, ADPCM), FLAC,
OPUS; 8 kHz+ sampling (other audio formats
automatically converted)
Minimum speech signal for identification:
recommended 7+ secs
Output
XML/JSON format with all results or results files
with a logarithm of probabilities scoring (-∞;0>
and/or percentage metric scoring <0-100%>
Processing speed
Approx. 20× faster than real-time processing
on 1 CPU core with the most precise model, i.e.,
a standard 1 CPU core server processes 480 hours
of audio in one day of computing time.
A user can add new languages to the system,
no assistance from Phonexia is necessary.
Approx. 20 hours of audio recordings
recommended for new language training.
Technology
• Uses the acoustic characteristics of speech
• Speech is converted to frequency spectra
and modeled with advanced statistical
methods
• The technology is language-, accent-, text-,
and channel- independent
• Compatible with the widest range of
audio sources possible (applies channel
compensation techniques): GSM/CDMA, 3G,
VoIP, landlines, satphones, etc.
Input
Input format for processing:
WAV or RAW (PCM unsigned 8 or 16 bits, IEEE
float 32-bit, A-law or Mu-law, ADPCM), FLAC,
OPUS; 8 kHz+ sampling (other audio formats
automatically converted)
Minimum speech signal for identification:
recommended 7+ secs
Output
XML/JSON format with all results or results files
with processed information (scores for male
and female)
Phonexia Gender IdentificationPhonexia Gender Identification (GID) automatically recognizes the gender of a speaker.
Processing speed
Approx. 200× faster than real-time processing
on 1 CPU core with the most precise
model, i.e., a standard 1 CPU core server
processes 4,800 hours of audio in one day
of computing time.
6 7
Speech AnalyticsSpeech Analytics
Technology
• Robust acoustic-based technology, even with
noisy recordings
• Keywords are automatically converted into
phonemes and searched for
• Compatible with the widest range of
audio sources possible (applies channel
compensation techniques): GSM/CDMA, 3G,
VoIP, landlines, satphones, etc.
Input
Input format for processing:
WAV or RAW (PCM unsigned 8 or 16 bits, IEEE
float 32-bit, A-law or Mu-law, ADPCM), FLAC,
OPUS; 8 kHz+ sampling (other audio formats
automatically converted). List of keywords or key
phrases to be searched for.
Output
XML/JSON format with all results or results files
generated with detected keywords (containing the
keyword, start/end time, path, probability, etc.)
Processing speed
The 5th generation is approximately 30× faster
than real-time processing on 1 CPU core, i.e.,
Phonexia Keyword SpottingPhonexia Keyword Spotting (KWS) identifies the occurrences of keywords and/or keyphrases in audio recordings.
a standard 1 CPU core server processes 720 hours
of audio in one day of computing time.
The 4th generation is approximately 10× faster
than real-time processing on 1 CPU core.
Supported languages
Language Code Note
Arabic (Levantine) ar-XL 5th Gen.
Arabic (Gulf) ar-KW 4th Gen.
Chinese zh-CN 4th Gen. – Beta
Croatian hr-HR 5th Gen.
Czech cs-CZ 5th Gen.
Dutch nl-NL 5th Gen.
English UK en-UK 4th Gen.
English US en-US 5th Gen.
Farsi fa-IR 4th Gen. – Beta
French fr-FR 4th Gen.
German de-DE 4th Gen.
Hungarian hu-HU 4th Gen. – Beta
Italian it-IT 4th Gen.
Pashtu ps-AR 4th Gen.
Polish pl-PL 5th Gen.
Russian ru-RU 5th Gen.
Slovak sk-SK 5th Gen.
Spanish – Latin America es-LA 5th Gen.
Swedish sw-SE 5th Gen.
Turkish tr-TR 4th Gen. – Beta
A user can add an unlimited number of keywords
to the system, as well as an unlimited number of
pronunciation variants for each keyword.
Technology
• In the fifth generation a Language Model
Customization tool is available for the optional
addition of desired words to the model
• Trained with an emphasis on spontaneous
telephone conversation
• Based on state-of-the-art techniques for
acoustic modeling, including discriminative
training and neural network-based features
• Compatible with the widest range of
audio sources possible (applies channel
compensation techniques): GSM/CDMA, 3G,
VoIP, landlines, satphones, etc.
Input
Input format for processing:
WAV or RAW (PCM unsigned 8 or 16 bits, IEEE
float 32-bit, A-law or Mu-law, ADPCM), FLAC,
OPUS; 8 kHz+ sampling (other audio formats
automatically converted)
Processing speed
The 5th generation is approximately 7× faster than
real-time processing on 1 CPU core, i.e., a standard
1 CPU core server processes 168 hours of audio in
one day of computing time. The 4th generation is
approximately 1.2× faster than real-time processing.
Phonexia Speech TranscriptionPhonexia Speech Transcription (STT) converts speech signals into plain text.
Output
XML/JSON format with all results or results
files with:
• One-best transcription
i.e., a file with a time-aligned speech transcript
(the time of the words’ start and end)
• n-best transcription
i.e., a confusion network with hypotheses for
words at each moment
Supported languages
Language Code Note
Arabic (Levantine) ar-XL 5th Gen.
Arabic (Gulf) ar-KW 4th Gen. – Beta
Chinese zh-CN 4th Gen. – Beta
Croatian hr-HR 5th Gen.
Czech cs-CZ 5th Gen.
Dutch nl-NL 5th Gen.
English UK en-UK 4th Gen.
English US en-US 5th Gen.
Farsi fa-IR 4th Gen. – Beta
French fr-FR 4th Gen.
German de-DE 4th Gen.
Italian it-IT 4th Gen.
Polish pl-PL 5th Gen.
Russian ru-RU 5th Gen.
Slovak sk-SK 5th Gen.
Spanish – Latin America es-LA 5th Gen.
Swedish sw-SE 5th Gen.
8 9
Voice Biometrics Speech AnalyticsVoice Biometrics Speech Analytics
Technology
• Trained with an emphasis on spontaneous
telephone conversation
• The technology is language-, accent-, text-,
and channel- independent
• Compatible with the widest range of
audio sources possible (applies channel
compensation techniques): GSM/CDMA, 3G,
VoIP, landlines, satphones, etc.
Input
Input format for processing:
WAV or RAW (PCM unsigned 8 or 16 bits, IEEE
float 32-bit, A-law or Mu-law, ADPCM), FLAC,
OPUS; 8 kHz+ sampling (other audio formats
automatically converted)
Output
XML/JSON format with all results or results
files with segmentation of speech, silence, and
technical signals (i.e., elimination of phone lines
beeps, DTMF tones, music, etc.)
Audio file extracted for each speaker
Phonexia Speaker DiarizationPhonexia Speaker Diarization (DIAR) enables segmentation of voices in one monochannel audio record.
Processing speed
Approx. 50× faster than real-time processing
on 1 CPU core with the most precise model,
i.e., a standard 1 CPU core server processes
1,200 hours of audio in one day of computing time.
Technology
• Trained with an emphasis on spontaneous
telephone conversation
• The technology is language-, accent-, text-,
and channel- independent
• Compatible with the widest range of
audio sources possible (applies channel
compensation techniques): GSM/CDMA, 3G,
VoIP, landlines, satphones, etc.
Input
Input format for processing:
WAV or RAW (PCM unsigned 8 or 16 bits, IEEE
float 32-bit, A-law or Mu-law, ADPCM), FLAC,
OPUS; 8 kHz+ sampling (other audio formats
automatically converted)
Output
XML/JSON format with all results or results
files with segmentation of speech, silence, and
technical signals (i.e., elimination of phone lines
beeps, DTMF tones, music, etc.)
Audio file extracted for each speaker
Phonexia Voice Activity DetectionPhonexia Voice Activity Detection (VAD) identifies parts of audio recordings with speech content vs. non-speech content.
Processing speed
Approx. 50× faster than real-time processing
on 1 CPU core with the most precise
model, i.e., a standard 1 CPU core server
processes 1,200 hours of audio in one day of
computing time.
10 11
Voice Biometrics Speech AnalyticsVoice Biometrics Speech Analytics
Technology
• Trained with an emphasis on spontaneous
telephone conversation
• The technology is language-, accent-, text-,
and channel- independent
• Compatible with the widest range of
audio sources possible (applies channel
compensation techniques): GSM/CDMA, 3G,
VoIP, landlines, satphones, etc.
Input
Input format for processing:
WAV or RAW (PCM unsigned 8 or 16 bits, IEEE
float 32-bit, A-law or Mu-law, ADPCM), FLAC,
OPUS; 8 kHz+ sampling (other audio formats
automatically converted)
Output
XML/JSON format with all results or results files
with labels (speech vs. non-speech segments)
Phonexia Speech Quality EstimationPhonexia Speech Quality Estimator (SQE) measures the quality parameters of the speech in an audio recording.
Processing speed
Approx. 150× faster than real-time processing
on 1 CPU core with the most precise
model, i.e., a standard 1 CPU core server
processes 3,600 hours of audio in one day
of computing time.
Technology
• The technology is language-, accent-, text-,
and channel- independent
• Compatible with the widest range of
audio sources possible (applies channel
compensation techniques): GSM/CDMA, 3G,
VoIP, landlines, etc.
Input
Input format for processing:
WAV or RAW (PCM unsigned 8 or 16 bits, IEEE
float 32-bit, A-law or Mu-law, ADPCM), FLAC,
OPUS; 8 kHz+ sampling (other audio formats
automatically converted)
Output
XML/JSON format with all results or results
files with:
• Global score
i.e., a percentage expression of audio quality
(range <0; 100>), by default, the global score
is calculated based on waveform_n_bits
and waveform_snr variables
• Detailed outputs
i.e., clipped signal, amplitude, sample values,
sampling frequency, SNR, technical signal,
encoding, etc.
Phonexia Age EstimationPhonexia Age Estimation (AGE) estimates the age of a speaker from an audio recording.
Processing speed
Approx. 2,000× faster than real-time
processing on 1 CPU core with the most precise
model, i.e., a standard 1 CPU core server
processes 48,000 hours of audio in one day
of computing time.
12 13
Voice BiometricsVoice Biometrics
Phonexia Voice InspectorPhonexia Voice Inspector is an out-of-the-box solution providing police forces and forensic experts with highly accurate, AI-powered automatic speaker recognition to support criminal investigations.
Technology
• Deep Embeddings™ - uses deep neural
networks to generate highly representative
voiceprints
• Applies state-of-the-art channel compensation
techniques, verified by NIST evaluation
• Compatibility with the widest range of audio
sources possible: GSM/CDMA, 3G, VoIP,
landlines, etc.
• Independent of language, accent, text and
channel
Input
• WAV (8 or 16 bits linear coding), A-law and Mu-
law, PCM, 8 kHz+ sampling
• 7 seconds recommended minimum speech
signal duration for a questioned recording
• 20 seconds recommended minimum speech
signal duration for a suspected speaker
Features and Benefits
• 1:1 speaker comparison in accordance with
ENFSI guidelines
• 1:N speaker identification for more
complex cases
• Automatic Forensic Voice Comparison
• A diarization tool to make working with audio
recordings containing multiple speakers easier
• A phoneme recognizer for the searching and
visualization of the same phoneme sequences
across multiple audio files
• An evaluation tool for the measurement of
accuracy in a user’s data sets
• A waveform editor with tools such as
a spectrum panel, voice activity detection
and more
• Easy management of investigation cases
Output
• Scoring to a likelihood ratio (LR), log-likelihood
ratio (LLR) and verbal presentation of results
• Graphic presentation of the likelihood ratio (LR)
• Detailed report output (expert opinion
template automatically generated) for
presentation of results (to a court or an
investigation team)
Phonexia Voice Inspector User Interface
A visualization of scores from a sample case
14 15
Phonexia Denoiser Phonexia Denoiser software cleans the audio signals of reverberation and other noises to make them more audible to the human ear.
Technology
• Denoiser is distributed as a part of Phonexia
Speech Engine and is accessible via REST
API. Its algorithms use deep neural networks
to achieve the automatic cleaning and
reconstruction of the processed audio signals.
Removing noises and enhancing the speech
signal provide better audibility and the ability
to understand the speech content. For each
denoised file, information is provided about the
difference of the signal-to-noise ratio to indicate
the improvement in the signal achieved by the
process of denoising.
Input
• A WAVE (*.wav) container including any of the
following:
• signed 8-bit PCM (s8)
• signed 16-bit PCM (s16le)
• IEEE float 32-bit (f32le)
• IEEE float 64-bit (f64le)
• A-law (alaw)
• µ-law (mulaw)
• ADPCM
• FLAC codec inside a FLAC (*.flac) container
• OPUS codec inside an OGG (*.opus) container
Output
• A RAW or WAV audio file (8 or 16 bits)
The processed audio is to be listened to and
examined by an expert and is not to be used as an
input for other automatic processing.
Interfaces
• REST API interface
• Command line interface
• Graphical user interface (GUI) for evaluation
Supported OS
• Windows 64 bit (x86_64)
• Linux 64 bit (x86_64)
Licensing options
• USB dongle licensing key (offline license,
on-premise installment)
• HW profile licensing key (offline license,
on-premise installment)
• Licensing server (offline license, on-premise
installment, used for HA)
• NET-based license (for demo purposes)
Integration Possibilities and LicensingPhonexia offers multiple integration and licensing possibilities, as well as custom development.
Recommended hardware
For the production system, a 64-bit server
processor is recommended with a higher L3
cache (the higher, the better)—for example,
the Intel® Xeon® processors E5/E7/Gold/
Platinum or Intel® Core™ processors i5/i7/i9.
Phonexia technologies also work in a virtualized
environment.
An advanced consultation on hardware
configuration will be provided upon a specific
deployment request.
Customization
Phonexia provides research and development
services such as speech technology optimization
for target channels, development of new language
versions, etc. Phonexia also offers multiple
engines balancing speed and accuracy according
to the specific use case. Contact our team for
more details.
More information
Should you like to know more information about
Phonexia technologies, please do not hesitate to
contact us at [email protected]
Voice Biometrics Speech Analytics
V-20
20-1
0
Phonexia s.r.o.
+420 511 205 265 [email protected] Chaloupkova 3002/1a, 612 00 Brno, Czech Republic, European Union
phonexia.com