SensAnalytics - Activity Recognition System for baby monitoring

BARREAU Pierrick – Activity Recognition System for baby monitoring

31

ao

ût

20

12

0 Activity Recognition System

for Baby Monitoring

T u t o r : D r C a t h a l G U R R I N

P r a c t i c u m C o o r d i n a t o r

D u b l i n C i t y U n i v e r s i t y

3 1 / 0 8 / 2 0 1 2

F i n a l r e p o r t f o r a F i n a l - y e a r

p r o j e c t a t t h e I N S A L y o n ,

T e l e c o m m u n i c a t i o n s , S e r v i c e s &

U s a g e s d e p a r t m e n t

BARREAU Pierrick INSA Lyon, Telecommunications Dept.

DCU, MSc. In Electronic Commerce


31

ao

ût

20

12

1

Summary

1. Context of the study ........................................................................................................................ 2

1.1 Practicum presentation ........................................................................................................... 2

1.2 SensAnalytics: Goals and motivations ..................................................................................... 2

1.3 Market Research: Process and Findings .................................................................................. 2

2. Product specifications ..................................................................................................................... 4

2.1 Product definition .................................................................................................................... 4

2.2 Functional analysis .................................................................................................................. 5

3. State of the art Overview ................................................................................................................ 7

3.1 Baby activity recognition: Characteristics and Challenges ...................................................... 7

3.2 Current state-of-the-art .......................................................................................................... 8

3.3 Complete solution overview ................................................................................................. 17

4. Solution development &optimization ........................................................................................... 18

4.1 Development environment ................................................................................................... 18

4.2 The jAudio library .................................................................................................................. 19

4.2.1 Presentation and reliability .................................................................................................. 19

4.2.2 How will we use it? ............................................................................................................... 20

4.3 Solution development ........................................................................................................... 21

4.3.1 Recording sound with Android ............................................................................................. 21

4.3.2 Signal pre-processing system ............................................................................................... 22

4.3.3 Feature extractors ................................................................................................................ 22

4.3.4 Matching function ................................................................................................................ 22

4.4.1 Recognition testing ............................................................................................................... 22

4.4.2 Performance testing ............................................................................................................. 23

4.4.3 Future improvements ........................................................................................................... 24

5. Experience Feedback ..................................................................................................................... 25

Appendices ............................................................................................................................................ 26

References ............................................................................................................................................. 34


31

ao

ût

20

12

2

1. Context of the study

1.1 Practicum presentation

Currently pursuing a Master in Electronic Commerce at the Dublin City University, I performed an

innovative company creation project (called Practicum) over the summer with a team of 5 persons.

In order to cater at the same time with my duties as an INSA Lyon engineering student, I turned this

exercise into a Research & Development project fulfilling the requirements for both formations.

Before detailing any further the content of the following R&D project, let us introduce the Practicum

principles. Similar to the Innovation project conducted during the fourth year of the

Telecommunications department’s formation, the Practicum goals are to assess our understanding of

the subjects (both technical and business) taught during the year. The outcome is a start-up creation

project containing insights into both the business aspects (business model and processes, marketing)

and a technical mock-up proving the viability of the concept supported by the team.

As part of an international team composed of 3 business- and 2 engineering-background students,

we developed a start-up called SensAnalytics. I took the role of quality manager, business analyst and

developer, which allowed me to have a complete overview of the R&D process and to realize the

technical implementation required for my engineering degree. Let us introduce the initial goals and

motivations of my team.

1.2 SensAnalytics: Goals and motivations

In our everyday lives, millions of events take place around and inside our bodies. We as humans

naturally capture and interpret some of this data; however most of it is lost or not understood. Our

start-up SensAnalytics seeks to acquire these complex data and turn them into human readable

communication.

With the will of establishing our product in an original market, yet untouched by recent high

technologies we aimed at delivering a product for the baby market. The most promising segment

appeared to be the baby care one, because it is the first expenses budget of parents after food. From

there, we chose to design a baby monitor, because it is the technology-related product. Our initial

idea was to build a technological sustainable advantage through the use of a heterogeneous Wireless

Sensor Network (WSN) which gathers complex biometric data in order to conclude on the child’s

status (health, sleep cycle, emotions …) using machine learning algorithms. However our idea

changed as we considered the market environment.

1.3 Market Research: Process and Findings

In order to answer at best the parenting market’s needs it aims at serving, SensAnalytics’ first task

has been to analyse the worldwide market in order to identify and gather all the positive business

drivers to help settling its products.


31

ao

ût

20

12

3

Following the process outlined in Figure 1,

the whole team began by defining what the

core elements of its offer were. We

summarized our project as an improvement

to existing baby monitors that allow

parents to get access to valuable analytics

at any time and any place. The core

element of the value proposition is the

access to comprehensive data helping

parents in their daily decision-making and

empowering the child to communicate with

them. After having determined our target

market as being the UK considering a set of

indicators, we surveyed a pool of first-time

parents collecting their needs. Then we

studied the marketplace identifying the

strengths and weaknesses of the

competitor’s offers and designed an offer

that fits the best the gap between the two.

In order to put some context on our decisions, let us review our key findings.

- 78% of respondents put security guarantees (reliable communication, medical certifications

…) as the most important characteristic of a baby monitor. The price is the second

determinant factor in the purchase decision with 82% ranking it over size, simplicity of use.

- 84% of our target market owns a smartphone, which constitutes a higher penetration rate

than the UK population (62%).

- 96% already searched on Internet if their child was normal compared to the age norms. They

also admit feeling stressed by their child’s mental and physical development and frustrated

by the poor results of a Google search.

- The baby monitor marketplace is crowded and involves big players such as Philips or

Motorola. It’s difficult to establish a hardware product as the R&D process involves heavy

costs and competitors have already rationalised their production chain expenses.

- The baby monitor smartphone app market is competitive as well, but with low-quality

products and no implanted big players.

These statements allow us analysing the market environment to define a product that suits the best

the gap between the users’ needs and the current competitors’ offers.

Figure 1: Market Study Process


31

ao

ût

20

12

4

2. Product specifications

2.1 Product definition

After analysing the results of the survey and the market analysis, we define our final solution and opt

for a three step product specification that would integrate the necessary success factors presented

earlier. As illustrated bellow, these three steps would take place in a three year plan to continuously

develop a sustainable business.

In the following technical study, we only consider the app-to-app baby monitor. It works by using two

smartphones, one placed with the child as a monitor, and the other with the parent as a receiver.

The monitor detects auditory events, in particular crying and talking and then notify the receiver that

alerts parents about activity, and allows them to listen to the monitor in real-time. Unlike traditional

baby-monitoring solutions, our product works over WIFI and 3G, and parents can therefore monitor

their child from any location.

Thus, this first development step consists in offering a simple smartphone baby-monitor application

with the most important features to ensure a secure, simple and efficient service easily accessible

and open to important sophistication for the future steps.

Features:

Simple audio Monitoring: The user can listen to his child at any time anywhere

2 way audio talk: the user can talk to his baby anytime anywhere via his smartphone and the

audio will be played on the other smartphone’s speakers.

Alerts when the baby is awake and crying with sensitivity control

Customizable events associated with actions: if the baby cries, the user can configure the

application to automatically play a song or other audio track.

Sleep cycle analysis via auto generated tables.

Figure 2: Product specification plan


31

ao

ût

20

12

5

Having identified the core product we aim at bringing to the market, we start the development of

version 1.0 of the mobile application by conducting a functional analysis. This first step allows writing

a complete specification with requirements and constraints, along with identifying the different

development tasks the R&D team will have to perform.

2.2 Functional analysis

External analysis

The functions resulting from the external functional analysis are drawn in the chart below. In order to

translate their importance and to give concrete objectives to development teams, each of them is

given weights along with an indicator of success and a target range the product has to comply with at

the end of the development.

N° Service functions (FS) Weigh Objective indicator(s) Range

Related to baby security

S1 Help configuring settings 2 Learning time 1 – 5 min

S2 Acquire baby data 5 Data loss < 10%

S3 Recognize baby activity 4 Positive recognition rate

False positive rate

> 70%

< 30%

S4 Trigger actions accordingly 3 Nb possible actions

Nb possible events

5

5

Related to baby evolution

E1 Gather baby evolution information 2 Nb milestones info

Nb medical info

Nb monitored info

10

5

All possible

E2 Compare with norms 4 Nb comparison indic.

Comparison time

All possible

< 10s

N° Constraint functions (FC) Weigh Objective indicator(s) Range

Constraints related to parents

1.1 Intuitive interface 4 Learning time

Language supported

1 – 5 min

ENG, FR

1.2 Interface accessible everywhere 3 Access supported Wi-Fi, 3G, Internet

1.3 Need reliability and security guarantees

3 Application downtime

Security guarantees

< 48h / year

Interferences resilience, Battery

monitoring

1.4 Need quality certifications 2 ISO norms ISO 9001

Constraints related to baby

2.1 Should be harmless for health 5 Intensive test phases

Smartphone size

Toxic products

> 3

H > 6cm, W > 6cm

0%

2.2 Should not be reachable from bed 2 Min acquisition range > 50cm


31

ao

ût

20

12

6

The “Octopus” chart resulting from the analysis is presented in Appendix 1.

Internal analysis

Having identified the market requirements and constraints, we then conducted a FAST analysis (see

Appendix 2) giving us the different development parts that needed to be considered. The diagram

highlights the main function of the system performed through six service functions. Each of these

service functions evolves technical functions internal to the system and corresponding to physical

2.3 Should be strong 2 Resilience to shock Yes

2.4 Baby-side applications should be silent 4 Block calls & messages

Ensure silent mode

Yes

100% use time

Constraints related to smartphones

3.1 Should be hosted on smartphones with good battery life and computation power

4 Battery life

Processor power

> 10h

> 1.5 GHz

3.2 Should use few computation cost 3 CPU use rate < 15%

3.3 Should use little battery 3 Battery impact < 10%

3.4 Should always keep top priority 2 Priority downtime < 5%

Constraints related to server

4.1 Should be hosted on server with good response time and high availability

4 Server response time

Server downtime

< 1 sec

< 48h / year

4.2 Should have enough space to store data

3 Database initial capacity

DB capacity growth

> 1 To

> 5 To / year

Constraints related to environment

5.1 Should adapt to background noises 4 SNR > 5 dB

5.2 Should be resilient to interferences 5 Intersymbol interference BER < 15%

5.3 Should provide good acquisition range 3 Max acquisition range > 5m

Constraints related to price

6.1 Cheap app purchasing cost 5 Price 0 £

6.2 Should only use smartphones 3 Extra device 0

Legal constraints

7.1 Medical data stored anonymously 5 Anonymous storage Yes

7.2 Sensitive data are secured 5 Development security guidelines

Communication protocol

Common criteria level 2

SSL-TLS

7.3 Customers informed about stored data

5 Terms and conditions Yes

Constraints related to the mobile network

8.1 Should be always connected 4 Alerts to parents Yes

8.2 Should ensure alerts always forwarded

5 Message loss < 2%


31

ao

ût

20

12

7

and hardware solutions. Considering the set of identified technical functions and selected solutions,

the FAST diagram imposes several features for the implementation:

- An efficient network architecture,

- A reliable app-server communication over 3G network,

- An activity recognition system,

- An intuitive application design aimed at parents.

Partnering with another engineer, we split the workload into two parts. I chose to focus on the

network architecture and on the implementation of an activity recognition system. In the following I

will outline the R&D process associated with the ARS solution.

In order to explain progressively our solution, we will briefly highlight the main aspects and

challenges of the baby activity recognition field. With a clearer understanding of the domain, we

review the most useful sound features and the most common techniques to consider for waking and

cry detection. By assessing them according to the product’s functional requirements, we conclude on

the solution that fits best our needs.

3. State of the art Overview

3.1 Baby activity recognition: Characteristics and Challenges

As every human activity recognition field, the baby one faces the same difficulty: translating the

complex stimuli information of the human body into computer-understandable data [1]. What our

brain is able to perform innately through cortex and synapse interactions demands many

computation power from machines and a lot of careful studies from scientists [2]. Moreover, on the

same way that our understanding of someone’s behaviour is not entirely reliable, computers

algorithms can only be partially trusted when concluding on emotions or activity recognition.

Therefore we see appear the two of the domain’s characteristics inherent from the humanity nature:

- The complexity of the computation algorithms (regarding both implementation and execution).

- The uncertainty of results, which make any solution only partially reliable.

The fact that we only use sound as an input for the recognition is also determinant. The auditory

environment surrounding the baby can be at any moment polluted by noises coming for different

sources that we cannot identify upfront [3]. These noises can trick the algorithms by presenting the

same characteristics as a baby cry or waking signal and thus alter the results (this situation is called

false positives). They can also overlap and distort an actual baby signal, thus causing the algorithm

not to detect the activity (that situation is called false negatives) [4]. Therefore, a major challenge of

our solution is to reliably differentiate baby signals from any background noises. A first processing

step will thus isolate and amplify this signal to ensure that recognition algorithms will always be

passed good enough quality signals.

Other difficulties are inherent to the voice evolution at the early stages of the child’s life. All humans

have different vocal attributes but some frequencies are similar and thus retrieval of activities based

on sound can be done quite reliably. However, during the 6 to 12 first months of the infant existence,

his voice evolves to get its first stabilized form. This initial state influences greatly his cry or voice

signals and is variable depending on his ethnical origins [5-6], the diseases he may have [7-8], but

also considering the prenatal conditions of his birth (drugs [9], alcohol consumption [10] by the

mother, pre-term/full-term [10-11]…) and its auditory capabilities [12]. This can be seen as a problem

as it creates a requirement for specific cases recognition, but as the research field is currently well


31

ao

ût

20

12

8

documented on their sound characteristics it can also be seen as a future opportunity as the

application could be turned into a disease detector [7-8]. However, even giving these slight changes,

the neonatal cry is a reasonably patterned vocal behaviour considered to have innate biological

function.

To sum up, our product need a pre-processing signal segmentation step and to take into account the

sound features that are relevant to the major part of the baby population. Considering that the

activity recognition aims at triggering alerts and actions, false negatives (when a real cry signal is not

detected) are far more critical that false positives (when the algorithms is tricked into recognizing a

fake cry signal), because parents would prefer to be alerted more than needed instead of missing an

important moment. The solution will thus be selected according to 3 characteristics:

- Its false negatives rate.

- Its ability to recognize specific cases.

- But also its computation power demand.

To conclude, even if this research field is still full of technical challenges yet to be answered, there

are some solutions that are able to recognize a baby cry and waking status with a promising success

rate. Let us review the current state-of-the-art and compare them considering our products

requirements.

3.2 Current state-of-the-art

Generally, infant activity automatic classification process is a pattern recognition problem. It

comprises two main stages that are: signal processing and pattern classification [13]. However,

concerning our product, a first stage is added and consists in detecting infant cries from audio

records. Once cry samples are detected and extracted from audio records, it is possible to apply the

signal processing and pattern recognition steps. The signal processing step aims at normalizing,

cleaning and filtering the raw signal before using the suitable feature extraction techniques to build a

vector of relevant values. This vector serves then as input for the classification algorithms which will

compare them against their norms to conclude on the recognition or not of a given activity.

Each step has its own set of technical solutions that can be then associated together to form a

complete baby activity recognition system. We will analyse the different techniques available at each

stages and conclude on the most suitable association for the product.

3.2.1 Signal Pre-processing

The pre-processing step is about isolating the baby sound signal through filtering and amplifying it.

The challenge here is to design a digital filter which can process the sound in real-time and without

too extensive resources. We opt for a low-pass Finite Impulse Response (FIR) filter at the highest

frequency of the baby cry spectrum. As it ranges from 0.1 to 10 KHz for the fundamental frequency

and the formants, we opt for a FIR at 10 KHz with an attenuation of -30 to -50 dB in the stop-band

and a ripple of 3 dB in the pass-band. It will filter the high frequencies coming from mobile networks

or home apparels surrounding the baby [4-5]. By computing a time domain convolution, we end up

with a filtered signal. Associated with a peak detector and an amplifier, the resulting sound is then

altered to only amplify the frequencies coming from the infant.


31

ao

ût

20

12

9

The audio has also to be sampled with an accurate frequency in order to reduce the computational

complexity while keeping a sufficient sound quality for future cry detection and feature extraction

steps. The 8 kHz sampling frequency is generally used for infant speech analysis [14], but a 20 KHz

sampling with 16-bit quantization has also be used with success by Robb & al. for the determination

of fundamental frequency and formants of baby cry [28]. Both will be tested during the

implementation.

3.2.2 Features extraction

Once the signal has been cleaned, we can study the most important features for baby activity

recognition and their extraction techniques. Most techniques found were related to cry detection.

The waking process is well described in theory but has not been addressed by scientists. However we

suggest our own way to detect it at the end of this section.

3.2.2.1 STE and STZC approach

Because of the physical limitations of human beings, speech analysis systems have to consider short

duration speech segments. Indeed, speech over short time intervals can be considered stationary,

overlapping these 10-30 ms segments by half is a method used to reduce the amount of computation

needed to analyse the infant cry signal [15].

The combination of two mathematical tools may be used to detect cry events from a pre-processed

audio record: the Short-Time Energy (STE) and the Short-Time Zero Crossing (STZC).

Short-Time Energy (STE)

Short-time energy (STE) is defined as the average of the square of the sample values in a suitable

window. It can be mathematically described as follows [15]:

( )

∑ ( ) ( )

where w(m) are coefficients of a suitable window function of length N. As previously mentioned,

short-time processing of speech should take place during segments between 10-30 ms in length. For

signals of 8 kHz sampling frequency, a window of 128 (which represents a segment of 16 ms) is

suitable. STE estimation is useful as a speech detector because there is a noticeable difference

between the average energy between voiced and unvoiced speech, and between speech and silence

[15]. This technique is usually paired with short-time zero crossing for a robust detection scheme.

Figure 3: Pre-processing system overview

Formula 1: STE formula


31

ao

ût

20

12

10

Short-Time Zero Crossing (STZC)

Short-time zero crossing (STZC) is defined as the rate at which the signal changes sign. STZC

estimation is useful as a speech detector because there are noticeable fewer zero crossings in voiced

speech as compared with unvoiced speech. It can be mathematically described as follows [15]:

( )

∑| ( ( )) ( ( ))| ( )

{ ( ( )) ( )

( ( ))

Figure 4 displays the results of short-time signal detection using both STE and STZC tools. STZC allows

to envelop periods when the signal changes sign with a significant rate (which is identified as speech

events) while STE allows to detect significant normalized energy within these envelops to conclude

on infant cry events.

In order to consistently pick up desired cry events, a desired cry was defined as a voiced segment of

sufficiently long duration and sufficiently noticeable STE. We can express it by using two quantifiable

threshold conditions that need to be met to constitute a desired cry:

(1) Normalized energy > 0.05: to eliminate non-voiced artefacts and cry precursors

(breathing, whimpering).

(2) Signal envelope period > 0.1 seconds: to eliminate impulsive voiced artefacts such as

coughing.

Figure 1 (a)

Figure 1 (b)

In Figure 1, each cry envelope is bounded by the STZC and the voiced portion of each cry is bounded

by where the STE meets the t = 0 axis. Figure 1(a) contains two false signals where STZC suggests an

infant vocalization has occurred. However, there is no significant STE to indicate the presence of a

voiced infant cry until the third vocalization. Even though this third vocalization meets the

normalized energy threshold of a voiced event, the duration does not meet the minimum time

period. This third vocalization was actually a cough.

Figure 4: Cry signal detection examples

Formula 2: STZC formula


31

ao

ût

20

12

11

The STZC in Figure 1(b) suggests that five vocalizations have occurred, four of which meet criterion

for a voiced cry. However, two of these voiced vocalizations are impulsive and of too short a

duration and thus are ruled out as cries through the envelope period threshold. The final

vocalization lacks the energy to be analysed as a cry event.

3.2.2.2 Frequency domain approach

Another approach to cry detection is to study the frequency domain of the signal by extracting:

- The vocal fundamental frequency (F0), which is the lowest frequency of the voice waveform.

- The formant frequencies, which indicates the acoustic resonance of the human vocal tract. They

are measured as amplitude peaks in the frequency spectrum of sound (see Figure 5).

- The Mel-Frequency Cepstral Coefficients (MFCC) features which allow capturing the spectral

discriminant of each signals.

To study the spectral domain, a first step is to transform the signal representation from time to

frequency domain using the Discrete Fourier Transform (DFT). This allows picturing the main

frequencies of a signal. Our product requires a fast and computation-efficient algorithm to compute

the DFT, thus by reviewing a benchmark of existing Fast Fourier Transform (FFT) algorithms [16], we

choose the solution of Pei-Chen & al. [17] as it allows real-time FFT computing using few

computational resources.

Once the frequency domain of the signal is determined, the fundamental frequency and the

formants can be measured using a peak detector, i.e. a function that finds maxima in the value range.

To increase the reliability of the detection, some techniques aims at smoothing the signal to help real

maxima appear. The Smoothed Spectrum Method (SSM) seems the most promising with an

efficiency of 97.99% against 95.50% for a classical local maximum value detector and 96.86% for the

Cepstrum analysis [23]. The idea is to use a weighted addition to smooth the spectrum and increase

the detection reliability.

Figure 5: Signal

frequency domain

representation


31

ao

ût

20

12

12

To determine the MFCCs, we follow the more complex process proposed by Vempada & al. [18]:

- Divide the cry signal into sequence of frames with a frame size of 20 ms and a shift of 10 ms.

- Apply the Hamming window over each of the frames.

- Compute magnitude spectrum for each windowed frame by applying DFT.

- Mel spectrum is computed by passing the DFT signal through Mel filter bank.

- DCT is applied to the log Mel frequency coefficients to derive the desired MFCCs.

The computation of these coefficients is CPU-intensive and is only supported in real-time on

important and optimized infrastructures. Yet it can provide further interesting development as new

initiatives to improve the algorithm are under development and because it allows distinguishing the

cry cause among 3 main types (hunger, pain, wet diaper) with a good reliability [18].

3.2.2.3 Rhythmic organisation of the sound

A final approach to cry detection is to consider it as a dynamic signal. The rhythmic organisation

analysis of the sound takes a look at the infant noise bursts and pauses durations. By monitoring the

magnitude spectrum of the infant expiratory sounds over time, an algorithm proposed by Sandford

Zeskind & al. [19] tries to find temporal features correlation among different individuals. However,

even if this solution can be run in real-time without the requirement for an efficient hardware, recent

investigations have proven that rhythmic organisation is not yet a reliable indicator for cry detection.

3.2.2.4 Waking detection system

In the literature, the detection of infant waking is mainly addressed by recognizing cries. However,

we believe that parents can find value knowing when their child is awake, not only when they cry but

to feed or change them. The current research attempts focus mainly on the sleep stages recognition

using complex biometric sensors such as Electro-encephalogram (ECG), accelerometers or Galvanic

Skin response (GSR) [20-21], but no dedicated auditory study of the temporal waking process of an

infant can be found.

According to Karraker & al. [22], the waking process has some detectable auditory events such as

giggles, sheets movements, or shocks. These are sudden noises, thus sudden changes in the signal

spectrum. This gave us the idea to monitor the signal spectrum changes over time. When sudden

peaks appear in several previously determined frequency ranges (e.g. voice spectrum) at repeated

instants over time, then conclusive evidence of an infant wake can be inferred. To support that idea,

one approach would be to compute the spectral density (PSD) of the signal every sample and to keep

track of the past PSD. If a sudden change appears at a specified frequency, then a variable is

incremented. If after a number of samples, no other change is detected then the variable is set to

zero. Otherwise, if the variable exceeds a threshold value then the waking activity is recognized.

The frequencies and variables involved in this solution will be defined during test sessions with baby

as it is a rather empiric system. The assumptions surrounding this idea will also be further tested with

different baby noises and environment before adding it to the customer-facing application.


31

ao

ût

20

12

13

3.2.2.5 Sound features extractors’ comparison

As we previously said, any feature extractor can be coupled with any classification algorithm in order

to form a complete activity recognition system. Therefore before examining the pattern detection

techniques, we decide in the following what features will be chosen for the final solution considering

our product requirements: reliability rate, computational cost, adaptability and evolution potential

towards new features.

With a false positive/negative rate of 75.6/86.5% (used on a real database) [15], the STE and STZC

approach seems promising, but need to be completed by another solution in order to improve its

reliability. It can be run with few resources if optimized and is adaptable and evolutionary for further

functionalities related to speech. It could for example detect when the baby pronounces its first

intelligible words [14].

The frequency domain approach is a very interesting solution. Detecting the fundamental frequency

can be done at a reliable rate (97.9%) using the Smoothed Spectrum Method (SSM) and helps

successfully detecting a cry at 99% (associated with neural network and used on the Baby Chillanto

database) [24]. It does not require extensive computation resources, but its only drawback is that the

solution can only be used for very specific detection.

As for the formant, their determination characteristic is similar to the fundamental frequency. Their

computation can be done in real-time but can impact the smartphone performance. Extensive

testing will be done on this part after implementation. The formants being a good indicator of the

human vocal tract, they could constitute the basis for further development of functionalities related

to emotions and speech. Moreover, by monitoring the signal frequency spectrum, changes can be

found and the waking recognition system could be subsequently implemented and tested.

The MFCCs are too CPU-intensive to be kept for implementation [24-26]. With the rise of

smartphone’s CPU power in the upcoming years, the implementation of a real-time computation

could be imagined, but it is currently unfeasible. However, they would be really interesting to classify

the cry causes and the emotions.

Finally, the rhythmic organisation is also not chosen, because of its low reliability rate (30-40%). If

further investigations are made in that area and new reliable temporal indicators are found, this

solution could be interesting as it does not require a lot of computational power [19]. It would also

allow bringing more contexts to the cry signal and, with the study of expiratory bursts, open new

perspectives for safety risks and diseases detections.

Considering these points, we choose to implement the extraction of the STE/STZC, the fundamental

frequency and the formants as features used for child’s status detection algorithms. The techniques’

comparison is summarized in the chart below.

Technique Reliability rate Computation cost Adaptability Evolution potential

STE/STZC + + + +

Frequency F0 ++ ++ - -

Formants ++ + + +

Mel frequency ++ -- ++ ++

Rhythm -- ++ ++ ++


31

ao

ût

20

12

14

3.2.3 Pattern recognition algorithms

Once the features have been extracted and form a vector of value, there are two main approaches to

recognize a pattern from this data:

- A static matching function which compare the values against known and identified norms giving

a matching score between the signal and an ideal activity-related signal. If the score is greater

than a decision-threshold, the activity is recognized.

- Machine learning algorithms which, rather than processing the data, act as a black box that

learn from precedent outputs its own classification and regression model and conclude directly

on a recognized activity giving the vector position in the data space.

Let us further detail and compare them.

3.2.3.1 Matching functions

The design of a matching function is empiric and involves three decisions that can severely impact its

performance. Firstly, different functions can be employed. The most simple and adapted to our case

is the weighted differential addition (see Formula 3). Giving a set of features that we have previously

determined (Normalised energy (STE), Signal envelope period (STZC), fundamental frequency (F0) and

formants (F1 – Fx), the function is the weighted sum of differences between the features values of a

given signal and an ideal activity signal. If the result of that function is lower than a threshold then

the activity is recognized.

∑ (

( ))

Once the function has been defined, the feature weights and the threshold values should be

determined. The weights attribution can be done considering: the importance of the features for the

activity recognition, their reliability (increased if reliable, else lowered), but also the usual gap range

between the signal and the norm in order to reduce the unwanted impact of a non-determinant

feature difference. The threshold value is defined through testing and experiences in order to lower

the false positives and negatives rates.

Once the matching function has been designed, it can be deployed anywhere. Considering our small

set of features, it uses little computational power. Their only drawback is that the determination of

the weights and threshold values should be done every time a new feature is added. However, once

the matching function class has been implemented, it can be used for other functionalities without

the need for any other development.

3.2.3.2 Machine learning algorithms

Machine learning is the branch of artificial intelligence that studies and develops architectures and

algorithms to equip an agent (a machine which is usually a computer) with certain behaviour and an

ability to build internal models from empirical training data in order to solve a certain task [27].

Among them we distinguish the Support Vector Machine (SVM) and the Neural Network (NN) that

are often used for auditory event and activity classification.

With w: output

n: number of indicators

wn: feature weight

vsignal: feature value for the studied signal

vnorm: feature value for the ideal signal

Formula 3: Weighted differential addition formula


31

ao

ût

20

12

15

Figure 7: Neural network model

Support Vector Machine (SVM)

A SVM is a two-categorical classifier, i.e. it can be used to conclude on the recognition of a given

activity or not. It is composed of an internal regression model which separates the value space into

two parts: the recognised pattern space and the rest. When the SVM receives a feature vector, it

projects it on the value space and concludes on the recognition or not of the activity considering the

position of the resulting point compared to the regression model. In order to build its internal

regression model, training algorithms (see Figure 6) are used to make it “learn” (i.e. build) it.

These algorithms are based on training samples. At each iteration the SVM is presented with a set of

sample feature vectors and its associated activity (e.g. crying / not crying). By processing these

examples, the SVM maps them into its internal value space and computes the regression model (e.g.

segmenting the space between crying and not crying activities). Once the SVM is trained, when

unmarked feature vectors are given to it, it is able to recognize the pattern in a time and

computation-efficient manner.

Neural Network (NN)

A neural network is a multi-categorical classifier. It is composed of an interconnected multi-layered

set of entities called “neurons”, where each neuron can be “activated” outputting its “activity” which

is a level of confidence in the recognition of a pattern. Each neuron is connected to the neurons at

the next layer by weighted links.

The whole concept relies on the “firing” function φ(). When the sum of all inputs multiplied by their

affected weights exceeds a certain threshold, the neuron is activated and outputs a value yj as

explained in Figure 7. Thus the decision-making algorithm is the combination of multiple neurons’

y1

y2

yn

w1

w2

wn

∑ φ()

𝒙𝒋 ∑𝒘𝒊 ∗ 𝒚𝒊

𝒏

𝒊 𝟏

𝒚𝒋 𝝋(𝒙𝒋)

Inp

uts

of

pre

vio

us

leve

l

Output for next level Neuron

Weights

Figure 6: Support Vector Machine principles overview


31

ao

ût

20

12

16

decisions. Initially, scientists configure the network hierarchy and the “firing” rule of each neuron.

The training algorithms then “teach” the network by changing the weights affected to links.

For each data samples, some neurons will “fire” (ie. say to the next level that they recognize the

pattern) and some not. During training time, data samples are marked as belonging to one category

or another. Their features are extracted and serve as inputs in the neural network. The objective of

training algorithms is then to minimize the quadratic error of the output by reducing the weights of

neurons that went wrong and improving the others depending on the level of confidence they

output.

3.2.3.3 Pattern recognition algorithm comparison

As we have previously said, the matching function is an interesting solution because it does not

require much computation power and after a careful design and testing stage can achieve pattern

recognition with good false positive/negative rates (around 90% [24]). Moreover once the Java class

has been implemented, it can be easily reusable for other functionalities. Its only drawback is that

the design stage (determination of feature weights and threshold values) should be performed again

for each new functionality, thus requiring expensive experiences and testing.

On the other hand, the machine learning algorithms feature better recognition rates (from 95 up to

99% [24]). Their main advantage is that once the neural network (NN) has been designed and once

the SVM or the NN training algorithms and procedures have been defined, the deployment of new

functionalities over these solutions only requires computation power and time. No extensive

development is needed. The only drawback is that every time a new feature is considered, since the

value vector form which serves as input is changed, the algorithm will have to be trained again from

the start to build its internal model.

Moreover, the training algorithms’ complexity (e.g. feed-forward) leads to strong requirements on

the computation power of the infrastructure that will support the operation [27]. However, once

they have been trained, these algorithms are able to recognize quickly and efficiently complex

patterns and can be deployed on a smartphone platform. Thus it is possible to consider the

integration of training algorithms on an on-demand cloud computing platform to solve the need and

avoid huge infrastructure costs at the foundation of our start-up. Deployment and integration of pre-

trained recognition algorithms would then be directly performed within the mobile application.

Nevertheless, other ethical issues are also raised by the training phase. To have a good-enough

performance, the algorithms need to be trained with samples recorded in real-life condition. But as

stated by Robb & al. [28], the techniques employed for eliciting cry vocalizations and their

subsequent use for research purposes can be subject to ethical questioning. Moreover, it could form

the basis for a communication and marketing problem with parents if the technical principles at the

roots of our technology are publicly denounced. However, auditory databases of baby cries have

been gathered by scientists and could be used. Thus, these solutions raise extra requirements for

transparency and careful definition of the techniques employed for the samples collection, along

with a special risk strategy.

To conclude, considering that our actual product has just to recognize between 3 activities and would

mainly perform cry recognition, we choose the SVMs over NNs because they need less design and

computation power for training. If the app activity recognition functionality’s importance increases

on the future, we might consider migrating to a neural network solution. The chart below


31

ao

ût

20

12

17

Figure 8: Activity recognition system overview

summarizes the assessment of the different solutions. The matching function solution is chosen over

the SVM for a first attempt because it is simpler to implement and allow performing most of the job.

It would allow a faster release of the first version of the mobile app and does not raise the extra risk

of communication problems related to ethical issues.

Techniques Matching function Support Vector Machine Neural Network

Reliability + ++ ++

Computation power ++ + -

Evolution potential - + +

Ethical issues ++ -- --

Development Simplicity ++ - --

3.3 Complete solution overview

When aggregating all these design choices into one solution, we end up with the following Activity

recognition system architecture detailed in Figure 8 below.


31

ao

ût

20

12

18

Figure 9: Development Environment overview

4. Solution development &optimization

The most difficult part in designing an activity recognition system is the choice of its components

(features and their extractors, pattern recognition technique) that we previously justified. Once they

are set, the development of the application in itself first involves configuring the development

environment which we will review and then implement it. In our case, the implementation that

followed was eased by all the high-level functions provided by the Android API. Therefore we explain

less the development part as it has less technical challenges and impacts than the previous part.

4.1 Development environment

Before starting to develop any application, it is important to install and configure an accurate and

optimal development environment. Google provides (in addition to the Operating System) a set of

tools for application development projects.

The development environment we use is composed of several layers with specific roles:

- A Java runtime environment - JRE

- A Java development Kit - JDK

- An Android development Kit - SDK

- A development platform – Eclipse

- Modules and libraries related to the project

- An Android device

The architecture of the development environment is detailed in the following diagram.


19

Each of these components has specific roles and provides a set of services to the layer up. Ultimately,

we use the software Eclipse with add-ons for Android development and a library called jAudio to

perform the auditory features extraction (see § 5.2.2). We define the components in the chart below.

JRE

The JRE is a Java Virtual Machine which allows executing Java application on a device. Most of users have a already an installed JRE on their computer especially to browse the Internet and execute specific Java application. However, a JRE does not allow creating Java applications.

JDK The JDK (Java Development Kit) includes development tools such as compilers, debuggers and Java libraries to create Java applications. We can notice that a JDK often includes a JRE, so installing a JDK is sufficient to have a JRE.

SDK

The SDK is a development kit provided by Google that includes a set of tools for Android development projects. Especially, it includes APIs (a set of classes with available functions for developers), code examples, technical documentation, and an emulator. It is freely available on the Google’s website.

Eclipse

Eclipse is a multi-language software development environment comprising an integrated development environment (IDE) and an extensible plug-in system. It is written mostly in Java. It can be used to develop applications in Java and, by means of various plug-ins.

ADT Google provide a compatible module with Eclipse to assist Android application development projects.

Additional Java Libraries

It is possible to import additional Java libraries to the project to take advantage of existing Java classes and functionalities. For example, in our project we have imported the JAudio library to perform audio treatment.

Test Devices

It is possible to test an application either on the emulator provided by the Android SDK or directly on an Android smartphone. It is necessary to configure an emulator before being able to use it. It especially means to specify the screen type, the size of the SD card, etc.

4.2 The jAudio library

4.2.1 Presentation and reliability

JAudio is a new framework for feature extraction designed to eliminate the duplication of effort in

calculating features from an audio signal. This system meets the needs of audio processing

researchers by providing a library of analysis algorithms that are suitable for a wide array of sound

analysis tasks. It provides an easy-to-use GUI that makes the process of selecting desired features

straight forward but also a command-line interface to manipulate its services via scripting.

Here is the common process of using jAudio. The system takes a sequence of audio files as input. In

the GUI, users select the features that they wish to have extracted—letting jAudio take care of all

dependency problems—and either execute directly from the GUI or save the settings for batch

processing. The output is either an ACE XML file or an ARFF file depending on the user’s preference.

In order to address issues related to audio feature extraction, jAudio was designed by taking into

account technical specifications and several design decisions were taken. Many of these design

decisions match our needs for the implementation of the cry detection system presented above:

http://en.wikipedia.org/wiki/Software_development_environment

http://en.wikipedia.org/wiki/Integrated_development_environment

http://en.wikipedia.org/wiki/Plug-in_(computing)

http://en.wikipedia.org/wiki/Plug-in_(computing)

http://en.wikipedia.org/wiki/Java_(programming_language)


31

ao

ût

20

12

30

20

Java based

JAudio is implemented in Java in order to capitalize on Java’s cross-platform portability and design

advantages. A custom low-level audio layer was implemented in order to supplement Java’s limited

core audio support and allow those writing jAudio features to deal directly with arrays of sample

values rather than needing to concern themselves directly with low-level issues such as buffering and

format conversions. By importing the jAudio library within our project development environment it is

possible to directly use implemented jAudio classes and feature extraction methods. It permits to

have a homogenous Java based code between the Android application development and the back-

end audio treatment implementation.

XML & ARFF output

JAudio supports multiple output formats, including both the native XML format and the ARFF format.

Both of them provide structured data easily extractable and usable as input for matching functions.

Handling dependencies

In order to reduce the complexity of calculations, it is often advantageous to reuse the results of an

earlier calculation in other modules. JAudio provides a simple way for a feature class to declare which

features it requires in order to be calculated. An example is the magnitude spectrum of a signal. It is

used by a number of features, but only needs to be calculated once. Just before execution begins,

jAudio reorders the execution of feature calculations such that every feature’s calculation is executed

only after all of its dependencies have been executed. Furthermore, unlike any other system, the

user need not know the dependencies of the features selected. Any feature selected for output that

has dependencies will automatically and silently calculate dependent features as needed without

replication. It is especially interesting in terms of calculation speed and power consumption

reduction.

Extensibility

Effort was taken to make it as easy as possible to add new features and associated documentation to

the system. An abstract class is provided that includes all the features needed to implement a

feature. Moreover, meta-features are templates that can be applied against any feature to create

new features. Examples of meta-features include Derivative, Mean, and Standard Deviation. Each of

these meta-features may be automatically applied to all features without the user needing to

explicitly create these derivative features. It allows us to establish exactly the features we have

previously selected.

4.2.2 How will we use it?

As previously stated, jAudio will allow us to define our own feature extractors using their low-level

audio characteristics extraction library. This will allow us implementing the STE and STZC

computation components. There are already in-built functions to extract the fundamental frequency

and the formants. However, we will need to perform tests on a benchmark of smartphones to verify

that the jAudio solutions do not use too much computation power and memory resources.

Once the feature extractors are developed, the Eclipse IDE allows us linking transparently all the

libraries. Thanks to the ADT plugin, a direct interaction between the Android SDK and our software

development platform is possible. And once we have added the JAudio library to the Java Build path,

the compilation of a code gathering Android, JAudio and standard java libraries is successful.


31

ao

ût

20

12

30

21

4.3 Solution development

The principal component of an Android application is the Activity. It is a single, focused thing that the

user can do and the entry point of the SDK. From that central point, one can invoke any object

necessary for the application. In order to start the development with a good overview of what

objects implementation needed to be done, we first draw an UML class diagram. It allows separating

the concerns between 4 main components:

- The activity recognition system and how recording sound using Android SDK.

- The pre-processing system and how filtering the sound to improve its quality

- The feature extractors and how using jAudio to quickly craft our own extractors.

- The matching functions and how adapting weights and threshold to make it more reliable.

4.3.1 Recording sound with Android

The AudioRecord object is provided by Android to directly pull sound from any audio source for the

smartphone. We configure it to take the microphone as an input (MediaRecorder.AudioSource.MIC).

As we previously said we will try two different sampling frequencies. When using the Eclipse

emulator (AVD), the sample frequency is set to 8 KHz as it cannot support more. When deployed on a

real-world smartphone, it is set to 20 KHz. As for the audio encoding, we choose to quantize on 16-

bit (using AudioFormat.ENCODING_PCM_16BIT) as it proves to have good enough results for Robb &

al. [28]. Finally we set the channel configuration to CHANNEL_IN_MONO to effectively pull voice

sound from the microphone. Thus we end up with:

It allows creating a stream to which we can then apply the noise suppression (via the object

NoiseSuppressor), the echo cancelation (via the object EchoCanceler) and the signal normalization

(via the object AutomaticGainControl) pre-processing algorithms. Then we can store this stream as

an array of short in a buffer in order to forward it to the filtering section and then to the feature

extractors. The code used to record sound on Android is provided in Appendix 4.

audioRecord = new AudioRecord(MediaRecorder.AudioSource.MIC, 8000,

AudioFormat.CHANNEL_IN_MONO, AudioFormat.ENCODING_PCM_16BIT, bufferSize);

Figure 10: UML class diagram

Figure 11: Audio recording source code


31

ao

ût

20

12

30

22

4.3.2 Signal pre-processing system

In order to further removing the noise coming from the external environment, we filter the signal

using a digital FIR at 10 KHz with an attenuation of -30 to -50 dB in the stop-band and a ripple of 3 dB

in the pass-band (see §3.2.1). We use Matlab to generate it with the sptool functionality configured

to employ the Hanning Window method. This generates a table of coefficients that we store in the

attribute Coefficients of the Filter class. To filter the signal, we then just have to implement a

convolution algorithm.

4.3.3 Feature extractors

Using the jAudio library, we are able to directly apply an FFT algorithm (using the FFT object) on the

buffer. Then, using the PeakFinder object, we can conclude on the fundamental frequency and the

signal formants. To implement the STE and STZC, we use the FeatureExtractor interface. It provides a

set of common methods that fits well with our project.

4.3.4 Matching function

The implementation of the matching function is straightforward. This a class including the weights,

threshold and ideal values for activity recognition as attributes. In order to be able to update the

sensibility of the app according to the baby voice characteristics, we define getters and setters to

update the weights and the threshold.

The pattern recognition is performed by the method computeFunction. It is a simple implementation

of the weighted differential addition formula presented in section 3.2.3.1. The source code can be

found in Appendix 5.

4.4 Solution testing and optimization

4.4.1 Recognition testing

In order to efficiently test our ARS, we need to have a database to benchmark the system against.

We first thought of using the Baby Chillanto database [30] and ask the researchers involved at the

Instituto Nacional de Astrofisica Optica y Electronica to get access to it. However our requests being

unanswered, we chose to constitute our own baby sounds database. The main drawback is that we

are unable to provide standard figures to compare with other existing solutions.

To constitute this collection, we searched online (mainly on sound sharing platforms such as

findsounds.com) and gathered 15 baby sounds samples. Then we play these sounds near the

smartphone microphone on various environments and assess our system. By constantly refining our

weights and threshold, we finally ended up with a recognition rate of 40% (6 samples over 15).

A second optimisation step was to quantify the effect of the audio effects added during the sound

acquisition on the recognition rate. By successively disabling these extra functionalities, we

discovered that the two most important pre-processing effects were the noise suppression and the

signal normalisation algorithms. Considering those results, we chose to disable the echo cancelation

algorithm, saving computation resources in the process.


31

ao

ût

20

12

30

23

The results of our researches supporting these assumptions are summarized in the chart below:

With all effects Without noise suppression

Without normalisation Without echo canceler

40% (6/15) 13% (2/15) 13% (2/15) 40% (6/15)

As we previously said in the functional analysis, our design goals are a recognition rate of at least

70% to have a reliable baby monitor. We are far from this requirement, mainly because the matching

function algorithm showed its limits. As future work, we plan to migrate towards a SVM solution (see

§4.4.3).

4.4.2 Performance testing

To test the performance of our application we use two separate environments:

- The Android Virtual Devices (AVD) that emulates a smartphone on a computer. Directly

integrated within Eclipse, it allows testing the application among several platforms without

the need to buy them physically. We use that tool to test the application on Samsung Galaxy

Nexus and Motorola MT870.

- 2 real-world smartphones (the HTC Sense and the HTC One).

The logs and resource consumption can be directly viewed in the feedback console of Eclipse. This

allowed us to see the performance of our application when deployed on a broad range of

smartphone.

As previously said in the functional analysis (see §2.2), the application requirements will be a

smartphone CPU rate of at least 1.5 GHz, therefore we choose the smartphones available as AVD

according to that characteristic. Moreover, our design goal is to provide an application that uses at

worst 15% of the CPU use rate. The application CPU use depends on the smartphone performance

and from the Android version deployed [31]. Therefore we conducted a benchmark of different

smartphones on different Android OS version.

The performance test results are summarized in the chart below. The presented CPU use percentage

has been determined by taking the average maximal rate reported by Eclipse during a test session.

Smartphone HTC Sense HTC One Samsung Galaxy Motorola MT870

Real smartphone Real smartphone AVD AVD

Android 4.1 19.4% / 20.5% /

Android 4.0 / / 21.2% /

Android 3.2 / 20.2% 22.4% 24.3%

Android 2.3.3 / / 26.7% 27.6%

As we can see the performance design goals are rarely reached. However, we took a look at some

possible optimisation. By recording asynchronously the sound, we would save some resources as

processing is given much priority while recording can take place when resources are free. To perform

that, we change the implementation of the SoundRecorder so it extends the ASyncTask class. This

change only requires to implement the doInBackground() method which will contain the code to


31

ao

ût

20

12

30

24

record sound. We also choose to disable the echo cancelation algorithm, because it does not strongly

affect the recognition performance (see §4.4.1). With those changes, the performance results are as

followed:

Smartphone HTC Sense HTC One Samsung Galaxy Motorola MT870

Android 4.1 13.8% / 14.5% /

Android 4.0 / / 16.4% /

Android 3.2 / 14.5% 17.3% 19.3%

Android 2.3.3 / / 19.8% 20.3.6%

For the most recent smartphone and OS version the design goals are fulfilled but with short success.

Further work need to be conducted to improve the resource use of the app. A possible future

advancement would be to improve the storage and sharing of the acquired audio signal to be in a

dynamic buffer.

4.4.3 Future improvements

As the matching function proved to be limited as a pattern recognition technique, we plan to

implement a Support Vector Machine (SVM) integrated in the Android Application using the Native

Development Kit (NDK). Indeed, the NDK allows programming in C/C++, which is a programming

language more suitable to implement this type of solution than Java. The ultimate goal would be to

implement some advanced training algorithms and reach a 60-70% recognition rate.

Moreover, we plan to use a shared memory among threads to store the audio buffer which contains

the sound signal pulled from the microphone. This would allow the recorder continuous and

asynchronously storing sound while the activity recognition system object consumes that data to

conclude on a recognised baby state. The goal would then be to reach a CPU use rate lower than

10%.


31

ao

ût

20

12

30

25

5. Experience Feedback

The practicum aims at bringing to market a focused and adapted product. Thus the technical solution

has constantly been changed as the project goes on to fit with the market needs or the business

model evolutions. This challenging experience of continuously refining network architecture and

software to answer evolving users’ issues while keeping the whole product coherent was my first real

experience of R&D process.

Moreover, I had the chance to collaborate with individuals from different background (business,

management, and marketing) and from different countries (France, Ireland, and Spain). This

association of multiple competences, work methodologies and cultures within a team constitute an

interesting insight on what could be a real-world international start-up environment. In addition to

the technical knowledge I developed throughout the solution’s implementation, I also had the

opportunity to help businessmen defining our key value proposition and business model, along with

identifying potential future prospects. This complete overview of the development of a project both

from a business and a technical perspective added an entrepreneurial competence to my resume.

Furthermore, with the rise of users’ will to capture their daily activity and the maturity of wireless

body sensor network, the importance of pattern recognition systems will grow in the upcoming year.

Having a lot of interests in these technologies, and more particularly in machine learning algorithms,

this technological study fits well with my professional career expectations.

Finally, collaborating with the CLARITY research center1 allowed me to get a first experience with a

research activity. Indeed, our project being supervised by Cathal Gurrin and Alan Smeaton, two major

managers of that center, the state-of-the-art and solution definition were performed in a laboratory

context. Thus this master thesis project gave me the opportunity to immerse myself in a highly

technological start-up working with multiple important stakeholders of the field.

1 CLARITY Center for Sensor Web Technologies: http://www.clarity-centre.org/


26

Appendices

Appendix 1: The Octopus Chart

FC5

FC8 FC7

FC4

FS-E2

FS-E1

FS-S3

FS-S2 FC2

FP

FC1

Cost

Physical Environment

Mobile Network

Legal environment

Server

Smartphone

Baby

Parents

FC6

FS-S1

FS-S4

FC3


27

Appendix 2: FAST diagrams

Main Function (FP)

Service Functions (FS)

FS1: “Configure Settings”


31

ao

ût

20

12

30

28

FS2: “Acquire baby data”


31

ao

ût

20

12

30

29

FS3: “Recognize baby activity”

FS4: “Trigger actions”


31

ao

ût

20

12

30

30

FS5: “Gather evolution data”

FS6: “Compare with norms”


31

ao

ût

20

12

30

31

Appendix 3: Sound recording code

package com.example.sensanalytics; import android.os.AsyncTask; import java.io.BufferedOutputStream; import java.io.DataOutputStream; import java.io.File; import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.IOException; import android.util.Log; import android.annotation.TargetApi; import android.media.AudioRecord; import android.media.MediaRecorder; import android.media.AudioFormat; import android.media.audiofx.NoiseSuppressor; import android.media.audiofx.AcousticEchoCanceler; import android.media.audiofx.AutomaticGainControl; public class SoundRecorder extends AsyncTask <Void, Integer, Void> { private File file; private Boolean isRecording; private int frequency = 8000; private int channelConfiguration = AudioFormat.CHANNEL_IN_MONO; private int audioEncoding = AudioFormat.ENCODING_PCM_16BIT; private AudioRecord audioRecord; public File getFile() { return file; } public void setFile(File file) { this.file = file; } public Boolean getIsRecording() { return isRecording; } public void setIsRecording(Boolean isRecording) { this.isRecording = isRecording; }


31

ao

ût

20

12

30

32

@Override @TargetApi(16) protected Void doInBackground(Void... arg0){ setIsRecording(true); try { DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(getFile()))); int bufferSize = AudioRecord.getMinBufferSize(frequency, channelConfiguration, audioEncoding); audioRecord = new AudioRecord(MediaRecorder.AudioSource.MIC, frequency, channelConfiguration, audioEncoding, bufferSize); // Apply Acoustic Echo Canceler algorithm on recorded sound Boolean isAvailable = AcousticEchoCanceler.isAvailable(); if (isAvailable) { AcousticEchoCanceler aec = AcousticEchoCanceler.create(audioRecord.getAudioSessionId()); if(!aec.getEnabled()) aec.setEnabled(true); } // Apply Noise Suppression algorithm on recorded sound isAvailable = NoiseSuppressor.isAvailable(); if (isAvailable) { NoiseSuppressor ns = NoiseSuppressor.create(audioRecord.getAudioSessionId()); if(!ns.getEnabled()) ns.setEnabled(true); } // Normalize the output signal isAvailable = AutomaticGainControl.isAvailable(); if (isAvailable) { AutomaticGainControl agc = AutomaticGainControl.create(audioRecord.getAudioSessionId()); if(!agc.getEnabled()) agc.setEnabled(true); } int r = 0; short[] audioBuffer = new short[bufferSize]; while(isRecording && r<50){ int bufferReadResult = audioRecord.read(audioBuffer, 0, bufferSize); for(int i = 0; i < bufferReadResult; i++){ dos.writeShort(audioBuffer[i]); Log.e("info", "Ecris la valeur:"+audioBuffer[i]); } r++; } audioRecord.stop(); audioRecord.release(); dos.close(); } catch (FileNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IOException e) {


31

ao

ût

20

12

30

33

// TODO Auto-generated catch block e.printStackTrace(); } return null; } public void stopRecording(){ setIsRecording(false); } @Override protected void onProgressUpdate(Integer... values) { // TODO Auto-generated method stub super.onProgressUpdate(values); } }

Appendix 4: Matching function source code

package com.example.sensanalytics; import java.lang.Math; public class matchingFunction { private int[] weights = {1,2,2,2,3,3}; private double[] idealValues = {880, 1020, 3340, 4510, 0.05, 0.1}; private double threshold = 0.8; // Getters and Setters to allow updating the weights and Threshold for sensitivity control public int[] getWeights() { return weights; } public void setWeights(int[] weights) { this.weights = weights; } public int getThreshold() { return threshold; } public void setThreshold(int threshold) { this.threshold = threshold; } // Compute the matching function public double computeFunction(double extractedFeaturesValues[]){ double res = 0; for(int i=0;i<5;i++){ res += weights[i]*(extractedFeaturesValues[i] - idealValues[i])/Math.max(extractedFeaturesValues[i], idealValues[i]); } return res; } }


31

ao

ût

20

12

30

34

References

[1] Bao L., Intille S. S. "Activity Recognition from User-Annotated Acceleration Data", In Proceedings

of the Second International Conference in Pervasive Computing (PERVASIVE '04). Vienna, Austria, pp.

1-17, 2004.

[2] Stikic M., Laerhoven K. V., Schiele B., "Exploring semi-supervised and active learning for activity

recognition", 12th IEEE International Symposium on Wearable Computers, 2008, pp. 81-88.

[3] Wasz-Hockert, O., Lind, J., Vuorenkoski, V., Partanen, T. and Valanne, E., “The infant cry: a

spectrographic and auditory analysis”, Clinics in Developmental Medicine No. 29, London: Spastics

International Publications, 1988.

[4] Clarkson B., “Extracting context from environmental audio”, Digest of Papers. Second

International Symposium on Wearable sensors 1998, 1998, pp. 154-155.

[5] Murry, T. “Acoustic and perceptual characteristics of infant cries”. In: Murry, T., Murry, J. (Eds.),

Infant Communication: Cry and Early Speech. TX: College Hill Press, 1980, pp. 251-271.

[6] Wasz-Hockert, O., Michelsson, K. and Lind, J. (1985) Twenty-five years of Scandinavian cry

research. In: Lester, B.M, Boukydis, C.F.Z. (Eds.) Infant Crying: Theoretical and Research Perspectives.

Plenum, New York, pp. 83-104.

[7] Michelsson, K., Sirvio, P., Koivisto, M., Sovijarvi, A. and Wasz-Hockert, 0., “Spectrographic analysis

of pain cry in neonates with cleft palate”, Biol. Neonate 26, 1975, pp. 353-358.

[8] Michelsson, K., Sirvio, P. and Wasz-Hockert, 0., “Sound spectrographic cry analysis of infants with

bacterial meningitis”, Devel. Med. Child Neurol. 19, 1977, pp. 309-315.

[9] Blinick, G., Travolga, W.N. and Antopol, W. “Variations in birth cries of new-born infants from

narcotic addicted and normal mothers”, Am. J. Obstet. Gynecol. 110, 1971, pp. 48-958.

[10] Cacace, A. T., Robb, M. P., Saxman, J. H., Risemberg, H., Koltai, P., "Acoustic features of normal-

hearing pre-term infant cry", International journal of pediatric otorhinolaryngology, Volume 33, Issue

3, 1995, pp. 213 – 224.

[11] Murray, A.D., Javel, E. and Watson, C.S., "Prognostic validity of auditory brainstem evoked

response screening in new-born infants", Am. J. Otolaryngol. 6, 1985, pp. 120-131.

[12] Oller, D.K., Eilers, R.E., Bull, D.H. and Carney, A.E., "Prespeech vocalizations of a deaf infant: a

comparison with normal metaphonalogical development", J. Speech Hear. Res. 28, 1985, pp. 47-63.

[13] Saraswathy, J., Hariharan, M., Yaacob, S., Khairunizam, W., " Automatic Classification of Infant

Cry: A Review", International Conference on Biomedical Engineering, 2012, pp. 534-549.

[14] Kevin Kuo, “Feature Extraction and Recognition of Infant Cries”, 2010 IEEE International

Conference on Electro/Information Technology (EIT), 2010, pp. 1-5.

[15] Kondoz, A. M., “Digital Speech”, John Wiley & Sons Ltd, West Sussex, England, 2004.


31

ao

ût

20

12

30

35

[16] Balducci, M., Ganapathiraju, A., Hamaker, J., Picone, J., "Benchmarking Of FFT Algorithms", IEEE

Proceedings on Southeastcon ‘97. 'Engineering New Century', 1997, pp. 328-330.

[17] Pei-Chen, L., Yun-Yun, L., "Real-Time FFT Algorithm Applied To On-Line Spectral Analysis", Circuit

System Signal Process, Vol. 8, No. 4, 1999, pp. 377-393.

[18] Vempada, R.R., Kumar, B.S.A., Rao, K.S., "Characterization of infant cries using spectral and

prosodic features", National Conference on Communications (NCC), 2012, pp. 1-5.

[19] Sanford Zeskind, P., Parker-Price, S., Barr R.G., "Rhythmic organization of the sound of infant

crying", Developmental Psychobiology Volume 26, Issue 6, 1993, pp. 321–333.

[20] Sadeh, A., Acebo, C., Seifer, R., Aytur, S., Carskadon, M.A., "Activity-Based Assessment of Sleep

Wake Patterns during the 1st Year of Life", Infant Behavior and Development Vol.18, 1995, pp. 329-

337.

[21] Heiss, J.E., Held, C.M., Estévez, P.A., Perez, C.A., Holzmann, C.A., Pérez, J.P., "Classification of

Sleep Stages in Infants: A Neuro Fuzzy Approach", IEEE Engineering in Medicine And Biology, 2003,

pp. 147-151.

[22] Karraker, K., "The Role of Intrinsic and Extrinsic Factors in Infant Night Waking", Journal of Early

& Intensive Behavior Intervention, Vol. 5 Issue 3, 2008, pp. 108-121.

[23] Várallyay Jr., G., Benyó, Z., Illényi, A., Farkas, Z., Kovács, L., "Acoustic analysis of the infant cry:

classical and new methods", Proceedings of the 26th Annual International Conference of the IEEE

EMBS, 2004, pp. 313-316.

[24] Saraswathy, J., Hariharan, M., Yaacob, S., Khairunizam, W., "Automatic Classification of Infant

Cry: A Review", International Conference on Biomedical Engineering (ICoBE), 2012, pp. 543-548.

[25] Garcia, J.O., Reyes García, C.A., "Mel-frequency cepstrum coefficients extraction from infant cry

for classification of normal and pathological cry with feed-forward neural networks”, INAOE, IEEE,

2003.

[26] Mansouri Jam, M., Sadjedi, H., "Identification of hearing disorder by multi-band entropy

cepstrum extraction from infant’s cry", IEEE, 2009.

[27] Martel J. Convolutional Neural Networks - A Short Introduction to Deep Learning. Not published

yet. 2012.

[28] Robb, M.P., Goberman, A.M., Cacace, A.T., "Methodological Issues in the Acoustic Analysis of

Infant Crying",

[29] Aggarwal, J.K., Ryoo, M.S., "Human Activity Analysis: a Review", ACM Computing Surveys (CSUR)

Surveys, Vol. 43, Issue 3, Article 16, 2011.

[30] O.F. Reyes-Galaviz, S. Cano-Ortiz and C. Reyes-Garca, “Evolutionary-neural system to classify

infant cry units for pathologies identification in recently born babies”, in 8th Mexican International

Conference on Artificial Intelligence,MICAI 2009, Guanajuato, Mexico, pp. 330–335, 2009.


31

ao

ût

20

12

30

36

[31] Huang, J., Xu, Q., Tiwana, B., Mao, Z.M., Zhang, M., Bahl, P., "Anatomizing application

performance differences on smartphones", Proceedings of the 8th international conference on

Mobile systems, applications, and services, 2010, pp. 165-178.

Documents

SensAnalytics - Activity Recognition System for baby monitoring