Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Phonexia Product Portfolio
Turning Voice into Knowledge
TABLE OF CONTENTS
About Phonexia 2
Phonexia Speaker Identification 3
Phonexia Language Identification 4
Phonexia Gender Identification 5
Phonexia Keyword Spotting 6
Phonexia Speech Transcription 7
Phonexia Speaker Diarization 8
Phonexia Voice Activity Detection 9
Phonexia Speech Quality Estimation 10
Phonexia Age Estimation 11
Phonexia Voice Inspector 12
Phonexia Denoiser 14
Integration Possibilities and Licensing 15
2 3
About Phonexia
Customers and Partners
Phonexia Products
Phonexia transforms voice into knowledge with its
innovative speech analytics and voice biometrics
technologies. Its Phonexia Speech Platform is
the first on the market using exclusively deep
neural networks to allow speaker identification
with extremely accurate and fast results. The
Phonexia Speech Platform packs a wide range of
speech technologies into a single, highly modular
platform that is easy to integrate with other
solutions. Phonexia innovation is available through
its network of integration partners. A university
spin-off, Phonexia has been delivering its
technologies to call centers, financial institutions
and security agencies in more than 60 countries
since 2006.
Phonexia Voice Biometrics helps the
identification of or search for a speaker based on
the comparison to a previously created voiceprint.
Similar to a fingerprint, Phonexia voice biometrics
can be used for voice authentication, fraud
detection, speaker search or speaker spotting.
Phonexia Speech Analytics provides ready to
analyze data on speech content using either full
Speech Transcription, Keyword Spotting
(a phonetically based keyword search)
or Language Identification.
Phonexia Voice Inspector is an out-of-
the-box solution providing police forces and
forensic experts with a highly accurate Speaker
Identification tool that supports criminal
investigations.
Phonexia Denoiser so�ware cleans the audio
signals of reverberation and other noises to make
them more audible to the human ear.
13YEARS
ON THE MARKETBASED IN CZ,
THE EUROPEAN UNIONPROJECTS IN
60 COUNTRIES
Phonexia Speaker Identification
Output
XML/JSON format with all results or results
files with a log likelihood ratio (- ; ) and/or
percentage metric scoring <0–100%>
Accuracy and speed
Achieves more than 99% accuracy (0.96% Equal
Error Rate based on NIST evaluation data set).
Up to 182× faster than real-time processing
on 1 CPU core with the most precise model—
for example, a standard 8 CPU core server
processes up to 28,992 hours of audio in one day
of computing time.
Technology
• A calibration tool for even higher accuracy
• 1:1 (verification), 1:n and n:m (identification)
comparison possible
• The technology is language-, accent-, text-,
and channel- independent
• Uses deep neural networks to generate highly
representative voiceprints
• Applies state-of-the-art channel compensation
techniques, verified by NIST evaluation
• Compatible with the widest range of
audio sources possible (applies channel
compensation techniques): GSM/CDMA, 3G,
VoIP, landlines, satphones, etc.
Input
Input format for processing:
WAV or RAW (PCM unsigned 8 or 16 bits, IEEE
float 32-bit, A-law or Mu-law, ADPCM), FLAC,
OPUS; 8 kHz+ sampling (other audio formats
automatically converted)
Minimum speech signal for enrolment:
recommended 20+ secs
Minimum speech signal for identification:
recommended 7+ secs
In specific use cases the time required for
the speaker enrolment and identification can
be much shorter.
Phonexia Speaker Identification uses the power of voice biometrics to recognize a speaker automatically by their voice. Its latest generation, called Deep EmbeddingsTM, uses deep neural networks for even greater performance.
Voice Biometrics
4 5
Voice Biometrics Speech AnalyticsVoice Biometrics Speech Analytics
Phonexia Language Identification
Technology
• The technology is text and channel independent
• Applies state-of-the-art channel compensation
techniques, verified by NIST evaluation
• Compatible with the widest range of
audio sources possible (applies channel
compensation techniques): GSM/CDMA, 3G,
VoIP, landlines, satphones, etc.
Supported languages
Afan_Oromo, Albanian, Amharic, Arabic, Arabic_
Gulf, Arabic_Iraqi, Arabic_Levantine, Arabic_
Maghrebi, Arabic_MSA, Azerbaijani, Bangla_
Bengali, Bosnian, Burmese, Chinese_Cantonese,
Chinese_Dialects, Chinese_Mandarin, Creole,
Croatian, Czech, Dari, English_American, English_
British, English_Indian, Farsi, French, Georgian,
German, Greek, Hausa, Hebrew, Hindi, Hungarian,
Indonesian, Italian, Japanese, Khmer, Kirundi_
Kinyarwanda, Korean, Lao, Macedonian, Ndebele,
Pashto, Polish, Portuguese, Punjabi, Russian, Serbian,
Shona, Slovak, Somali, Spanish, Swahili,Swedish,
Tagalog, Tamil, Thai, Tibetan, Tigrigna, Turkish,
Ukrainian, Urdu, Uzbek, Vietnamese
A user can add new languages to the system,
no assistance from Phonexia is necessary.
Approx. 20 hours of audio recordings
recommended for new language training.
The Phonexia Language Identification (LID) system allows the automatic detection of spoken language or dialect.
Input
Input format for processing:
WAV or RAW (PCM unsigned 8 or 16 bits, IEEE
float 32-bit, A-law or Mu-law, ADPCM), FLAC,
OPUS; 8 kHz+ sampling (other audio formats
automatically converted)
Minimum speech signal for identification:
recommended 7+ secs
Output
XML/JSON format with all results or results files
with a logarithm of probabilities scoring (- ;0>
and/or percentage metric scoring <0-100%>
Processing speed examples
Approx. 20x faster than real-time processing
on 1 CPU core with the most precise model
i.e., a standard 8 CPU core server processes
3,840 hours of audio in 1 day of computing time
Technology
• Uses the acoustic characteristics of speech
• Speech is converted to frequency spectra
and modeled with advanced statistical
methods
• The technology is language-, accent-, text-,
and channel- independent
• Compatible with the widest range of
audio sources possible (applies channel
compensation techniques): GSM/CDMA, 3G,
VoIP, landlines, satphones, etc.
Input
Input format for processing:
WAV or RAW (PCM unsigned 8 or 16 bits, IEEE
float 32-bit, A-law or Mu-law, ADPCM), FLAC,
OPUS; 8 kHz+ sampling (other audio formats
automatically converted)
Minimum speech signal for identification:
recommended 7+ secs
Output
XML/JSON format with all results or results files
with processed information (scores for male
and female)
Phonexia Gender IdentificationPhonexia Gender Identification (GID) automatically recognizes the gender of a speaker.
Processing speed
Approx. 200x faster than real-time processing
on 1 CPU core i.e., a standard 8 CPU core server
processes 38,400 hours of audio in 1 day of
computing time
6 7
Speech AnalyticsSpeech Analytics
Technology
• Robust acoustic-based technology, even with
noisy recordings
• Keywords are automatically converted into
phonemes and searched for
• Compatible with the widest range of
audio sources possible (applies channel
compensation techniques): GSM/CDMA, 3G,
VoIP, landlines, satphones, etc.
Input
Input format for processing:
WAV or RAW (PCM unsigned 8 or 16 bits, IEEE
float 32-bit, A-law or Mu-law, ADPCM), FLAC,
OPUS; 8 kHz+ sampling (other audio formats
automatically converted). List of keywords or key
phrases to be searched for.
Output
XML/JSON format with all results or results files
generated with detected keywords (containing the
keyword, start/end time, path, probability, etc.)
Processing speed
The 5th generation is approximately 30x faster
than real-time processing on 1 CPU core—
Phonexia Keyword SpottingPhonexia Keyword Spotting (KWS) identifies the occurrences of keywords and/or keyphrases in audio recordings.
for example, a standard 8 CPU core server
processes 5,760 hours of audio in one day of
computing time.
The 4th generation is approximately 10x faster
than real-time processing on 1 CPU core.
Supported languages
Language Code Note
Arabic ar-KW 4th Gen.
Chinese zh-CN 4th Gen. – Beta
Croatian hr-HR 4th Gen.
Czech cs-CZ 5th Gen.
Dutch nl-NL 5th Gen.
English UK en-UK 4th Gen.
English US en-US 5th Gen.
Farsi fa-IR 4th Gen. – Beta
French fr-FR 4th Gen.
German de-DE 4th Gen.
Hungarian hu-HU 4th Gen. – Beta
Italian it-IT 4th Gen.
Pashtu ps-AR 4th Gen.
Polish pl-PL 5th Gen.
Russian ru-RU 5th Gen.
Slovak sk-SK 5th Gen.
Spanish – Latin America es-LA 5th Gen.
Turkish tr-TR 4th Gen. – Beta
A user can add an unlimited number of keywords
to the system, as well as an unlimited number of
pronunciation variants for each keyword.
Technology
• In the fi�h generation a Language Model
Customization tool is available for the optional
addition of desired words to the model
• Trained with an emphasis on spontaneous
telephone conversation
• Based on state-of-the-art techniques for
acoustic modeling, including discriminative
training and neural network-based features
• Compatible with the widest range of
audio sources possible (applies channel
compensation techniques): GSM/CDMA, 3G,
VoIP, landlines, satphones, etc.
Input
Input format for processing:
WAV or RAW (PCM unsigned 8 or 16 bits, IEEE
float 32-bit, A-law or Mu-law, ADPCM), FLAC,
OPUS; 8 kHz+ sampling (other audio formats
automatically converted)
Output
XML/JSON format with all results or results
files with:
• One-best transcription
i.e., a file with a time-aligned speech transcript
(the time of the words’ start and end)
Phonexia Speech TranscriptionPhonexia Speech Transcription (STT) converts speech signals into plain text.
• n-best transcription
i.e., a confusion network with hypotheses for
words at each moment
Processing speed
The 5th generation is approximately 7x faster than
real-time processing on 1 CPU core—for example, a
standard 8 CPU core server processes 1,344 hours of
audio in one day of computing time.
The 4th generation is approximately 1.2x faster than
real-time processing on 1 CPU.
Supported languages
Language Code Note
Arabic ar-KW 4th Gen. – Beta
Chinese zh-CN 4th Gen. – Beta
Czech cs-CZ 5th Gen.
Dutch nl-NL 5th Gen.
English UK en-UK 4th Gen.
English US en-US 5th Gen.
Farsi fa-IR 4th Gen. – Beta
French fr-FR 4th Gen.
German de-DE 4th Gen.
Italian it-IT 4th Gen.
Polish pl-PL 5th Gen.
Russian ru-RU 5th Gen.
Slovak sk-SK 5th Gen.
Spanish – Latin America es-LA 5th Gen.
8 9
Voice Biometrics Speech AnalyticsVoice Biometrics Speech Analytics
Technology
• Trained with an emphasis on spontaneous
telephone conversation
• The technology is language-, accent-, text-,
and channel- independent
• Compatible with the widest range of
audio sources possible (applies channel
compensation techniques): GSM/CDMA, 3G,
VoIP, landlines, satphones, etc.
Input
Input format for processing:
WAV or RAW (PCM unsigned 8 or 16 bits, IEEE
float 32-bit, A-law or Mu-law, ADPCM), FLAC,
OPUS; 8 kHz+ sampling (other audio formats
automatically converted)
Output
XML/JSON format with all results or results
files with segmentation of speech, silence, and
technical signals (i.e., elimination of phone lines
beeps, DTMF tones, music, etc.)
Audio file extracted for each speaker
Phonexia Speaker DiarizationPhonexia Speaker Diarization (DIAR) enables segmentation of voices in one monochannel audio record.
Processing speed
Approx. 50x faster than real-time processing on
1 CPU core i.e., a standard 8 CPU core server
processes 9,600 hours of audio in 1 day of
computing time
Technology
• Trained with an emphasis on spontaneous
telephone conversation
• The technology is language-, accent-, text-,
and channel- independent
• Compatible with the widest range of
audio sources possible (applies channel
compensation techniques): GSM/CDMA, 3G,
VoIP, landlines, satphones, etc.
Input
Input format for processing:
WAV or RAW (PCM unsigned 8 or 16 bits, IEEE
float 32-bit, A-law or Mu-law, ADPCM), FLAC,
OPUS; 8 kHz+ sampling (other audio formats
automatically converted)
Output
XML/JSON format with all results or results files
with labels (speech vs. non-speech segments)
Phonexia Voice Activity DetectionPhonexia Voice Activity Detection (VAD) identifies parts of audio recordings with speech content vs. non-speech content.
Processing speed
Approx. 150x faster than real-time processing
on 1 CPU core i.e., a standard 8 CPU core server
processes 28,800 hours of audio in 1 day of
computing time
10 11
Voice Biometrics Speech AnalyticsVoice Biometrics Speech Analytics
Technology
• The technology is language-, accent-, text-,
and channel- independent
• Compatible with the widest range of
audio sources possible (applies channel
compensation techniques): GSM/CDMA, 3G,
VoIP, landlines, etc.
Input
Input format for processing:
WAV or RAW (PCM unsigned 8 or 16 bits, IEEE
float 32-bit, A-law or Mu-law, ADPCM), FLAC,
OPUS; 8 kHz+ sampling (other audio formats
automatically converted)
Output
XML/JSON format with all results or results
files with:
• Global score
i.e., a percentage expression of audio quality
(range <0; 100>), by default, the global score
is calculated based on waveform_n_bits
and waveform_snr variables
• Detailed outputs
i.e., clipped signal, amplitude, sample values,
sampling frequency, SNR, technical signal,
encoding, etc.
Phonexia Speech Quality EstimationPhonexia Speech Quality Estimator (SQE) measures the quality parameters of the speech in an audio recording.
Processing speed
Approx. 2,000x faster than real-time processing
on 1 CPU core i.e., a standard 8 CPU core server
processes 384,000 hours of audio in 1 day of
computing time
Technology
• Trained with an emphasis on spontaneous
telephone conversation
• The technology is language-, accent-, text-,
and channel- independent
• Compatible with the widest range of
audio sources possible (applies channel
compensation techniques): GSM/CDMA, 3G,
VoIP, landlines, satphones, etc.
Input
Input format for processing:
WAV or RAW (PCM unsigned 8 or 16 bits, IEEE
float 32-bit, A-law or Mu-law, ADPCM), FLAC,
OPUS; 8 kHz+ sampling (other audio formats
automatically converted)
Output
XML/JSON format with all results or results files
with age estimates
Phonexia Age EstimationPhonexia Age Estimation (AGE) estimates the age of a speaker from an audio recording.
Processing speed
Up to 182× faster than real-time processing
on 1 CPU core with the most precise model—
for example, a standard 8 CPU core server
processes up to 28,992 hours of audio in one day
of computing time.
12 13
Voice BiometricsVoice Biometrics
Phonexia Voice InspectorPhonexia Voice Inspector is an out-of-the-box solution providing police forces and forensic experts with highly accurate, AI-powered automatic speaker recognition to support criminal investigations.
Technology
• Deep Embeddings™ - uses deep neural
networks to generate highly representative
voiceprints
• Applies state-of-the-art channel compensation
techniques, verified by NIST evaluation
• Compatibility with the widest range of audio
sources possible: GSM/CDMA, 3G, VoIP,
landlines, etc.
• Independent of language, accent, text and
channel
Input
• WAV (8 or 16 bits linear coding), A-law and Mu-
law, PCM, 8 kHz+ sampling
• 7 seconds recommended minimum speech
signal duration for a questioned recording
• 20 seconds recommended minimum speech
signal duration for a suspected speaker
Features and Benefits
• 1:1 speaker comparison in accordance with
ENFSI guidelines
• 1:N speaker identification for more
complex cases
• Automatic Forensic Voice Comparison
• A diarization tool to make working with audio
recordings containing multiple speakers easier
• A phoneme recognizer for the searching and
visualization of the same phoneme sequences
across multiple audio files
• An evaluation tool for the measurement of
accuracy in a user’s data sets
• A waveform editor with tools such as
a spectrum panel, voice activity detection
and more
• Easy management of investigation cases
Output
• Scoring to a likelihood ratio (LR), log-likelihood
ratio (LLR) and verbal presentation of results
• Graphic presentation of the likelihood ratio (LR)
• Detailed report output (expert opinion
template automatically generated) for
presentation of results (to a court or an
investigation team)
Phonexia Voice Inspector User Interface
A visualization of scores from a sample case
14 15
Phonexia Denoiser Phonexia Denoiser so�ware cleans the audio signals of reverberation and other noises to make them more audible to the human ear.
Technology
• Denoiser is distributed as a part of Phonexia
Speech Engine and is accessible via REST
API. Its algorithms use deep neural networks
to achieve the automatic cleaning and
reconstruction of the processed audio signals.
Removing noises and enhancing the speech
signal provide better audibility and the ability
to understand the speech content. For each
denoised file, information is provided about the
difference of the signal-to-noise ratio to indicate
the improvement in the signal achieved by the
process of denoising.
Input
• A WAVE (*.wav) container including any of the
following:
• signed 8-bit PCM (s8)
• signed 16-bit PCM (s16le)
• IEEE float 32-bit (f32le)
• IEEE float 64-bit (f64le)
• A-law (alaw)
• µ-law (mulaw)
• ADPCM
• FLAC codec inside a FLAC (*.flac) container
• OPUS codec inside an OGG (*.opus) container
Output
• A RAW or WAV audio file (8 or 16 bits)
The processed audio is to be listened to and
examined by an expert and is not to be used as an
input for other automatic processing.
Interfaces
• REST API interface
• Command line interface
• Graphical user interface (GUI) for evaluation
Supported OS
• Windows 64 bit (x86_64)
• Linux 64 bit (x86_64)
Licensing options
• USB dongle licensing key (offline license,
on-premise installment)
• HW profile licensing key (offline license,
on-premise installment)
• Licensing server (offline license, on-premise
installment, used for HA)
• NET-based license (for demo purposes)
Integration Possibilities and LicensingPhonexia offers multiple integration and licensing possibilities, as well as custom development.
Recommended hardware
For the production system, a 64-bit system server
kind processor is recommended with a higher L3
cache (the higher, the better) – e.g., Intel® Xeon®
Processor E5, E7, i5, or i7, Phonexia technologies
also work in a virtualized environment.
An advanced consultation of HW configuration
will be provided upon a specific deployment
request.
Customization
Phonexia provides research and development
services such as speech technology optimization
for target channels, development of new language
versions, etc. Phonexia also offers multiple
engines balancing speed and accuracy according
to the specific use case. Contact our team for
more details.
More information
Should you like to know more information about
Phonexia technologies, please do not hesitate to
contact us at [email protected]
Voice Biometrics Speech Analytics
Phonexia s.r.o.
+420 511 205 265 [email protected] Chaloupkova 3002/1a, 612 00 Brno, Czech Republic, European Union
phonexia.com
PARTNER:
Pangea Communications
Anche BothaPangea Communications (Pty) [email protected]+27 82 570 5862