238
Conference of the International Federation of Classification Societies IFCS-2013 Program and Book of Abstracts July 14-17, 2013 Tilburg, the Netherlands

Book of Abstracts.IFCS2013 - Elise DusseldorpConference of the International Federation of Classi˜ cation Societies IFCS-2013 Program and Book of Abstracts July 14-17, 2013 Tilburg,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Conference of the International Federation of Classi� cation Societies

IFCS-2013

Program and Book of Abstracts

July 14-17, 2013Tilburg, the Netherlands

General information

Greetings!

Welcome to Tilburg and the 2013 conference of the International Federation of Classifi-cation Societies (IFCS), which is held at Tilburg University, The Netherlands from July14 to July 17, 2013. The conference theme is‘United through Ordination and Classifi-cation’.

On July 14, preconference workshops will be held. The conference itself will start onJuly 15 in the Morning, and will close on July 17 with a full dayconference programand a conference dinner. The conference includes a president’s invited session and apresidential address, plenary invited sessions, and concurrent invited and contributedsessions with oral paper presentations.

The opening session will feature the presentation of theChikio Hayashi Awards toAnne-Laure BoulesteixandPaul McNicholas, who are the2013 winnersof this prizefor young researchers with promising track records in the areas of classification and dataanalysis, as a support of their professional career. The members of the 2013 AwardsCommittee are: Michel Wedel (Chair), Edwin Diday, Sylvia-Früwirth-Schnatter andJames Ramsay.

IFCS Registration Desk

The IFCS Registration Desk is located in the hall in front of the University auditorium(aula), which can be found in the Cobbenhagen building of Tilburg University. Signsare provided to help you find the registration desk.

Preconference Workshops Registration Hours

Sunday July 14, 9:30 – 18:30

IFCS Registration Hours

Monday, July 15, 7:30 – 17:00Tuesday, July, 7:30 – 17:00Wednesday, July, 7:30 – 10:30

Badges

Participation in the IFCS conference is limited to registered attendees. The official con-ference badge is required for admission to all sessions.

Lunches and Coffee Breaks

Free lunches and coffee or tea during coffee breaks are included in the registration fee.Note that the lunches and coffee or tea are only free at the designated conference area infront of the auditorium. Meals and drinks used elsewhere have to be paid by attendeesat their own expense.

Social Events and Conference Dinner

Participation in one of the two IFCS social events is limitedto attendees who have regis-tered for either a visit to the national Park Loon and Drunen Dunes or to La Trappe: BeerBrewery at the Koningshoeven Monastry. Busses will be available in the Hogeschool-laan to bring participants to the social event and back to Tilburg University afterwards.

Participation in the conference dinner is limited to attendees who have registered forthe conference dinner. The conference dinner will be held atrestaurantDe Harmonie,Stationsstraat 26, 5038 ED Tilburg, tel: +31(0)13-5425843. The restaurant can be easilyreached by going to the Tilburg Central Railway Station. Walk about 75 metres into thestreet (Stationstraat) right opposite to the front entrance of the Tilburg Central RailwayStation. The restaurant is at the left side of this street.

Messages

A message board will be maintained in the registration area during registration hours.

Local Organizing Committee and Scientific ProgramCommittee

Members of the Local Organizing Committee:

Andries van der Ark (Chair), Tilburg UniversityJohn Gelissen, Tilburg UniversityJeroen Vermunt, Tilburg UniversityMarieke Timmermans (secretary), Tilburg UniversityTom Wilderjans, KU LeuvenKatrijn van Deun, KU Leuven

Members of the Scientific Program Committee:

Jeroen Vermunt (Chair), Tilburg University, The NetherlandsCarlos Cuevas-Covarrubias, Anahuac University, MexicoRozenn Dahyot, Trinity College Dublin, IrelandAnuška Ferligoj, University of Ljubljana, SloveniaChristian Hennig, UCL, London, UKKrzysztof Jajuga, Wrocław University, PolandTae Rim Lee, Korea National Open University, KoreaFriedrich Leisch, University of Natural Ressources and Life Sciences, Vienna, AustriaNiel le Roux, University of Stellenbosch, South AfricaGeoff McLachlan, University of Queensland, AustraliaFred R. McMorris, Chicago University, USAAngelos Markos, Democritus University of Thrace, GreeceBoris Mirkin, Higher School of Economics, Moscow, Russian FederationMohamed Nadif, RenÃl’ Descartes University, Paris, FranceRebecca Nugent, Carnegie Mellon University, USAAkinori Okada, Tama University, JapanFernanda Sousa, University of Porto, PortugalIven Van Mechelen (President IFCS), KU Leuven, BelgiumMaurizio Vichi, University of Rome, ItalyClaus Weihs, TU Dortmund, GermanyJunjie Wu, Beihang University, ChinaPatrick Groenen, Erasmus University, The Netherlands

IFCS Member Societies:

Associação Portuguesa de Classificação e Análise de Dados (CLAD)British Classification Society (BCS)The Classification Society (CS)Gesellschaft für Klassifikation (GfKl)Greek Society of Data Analysis (GSDA)Irish Pattern Recognition and Classification Society (IPRCS)Japanese Classification Society (JCS)Korean Classification Society (KCS)Sekcja Klasyfikacji i Analizy Danych PTS (SKAD)Sociedad Centroamericana y del Caribe de Clasificación y Análisis de Datos (SoCC-CAD)Società Italiana di Statistica (SIS-CLADAG)Société Francophone de Classification (SFC)Statisticno društvo Slovenije (SdS)Vereniging voor Ordinatie en Classificatie (VOC)

The 2013 IFCS conference is scientifically sponsored by the International StatisticalInstitute (ISI) and supported by the IFCS, the VOC, Tilburg University, and the Depart-ment of Methodology and Statistics of Tilburg University.

TITL

EFI

RST_

NAM

ESU

RNAM

EIN

STIT

UTI

ON

CITY

COU

NTR

YEM

AIL_

ADDR

ESS

DrHo

ngsh

ikAh

nSU

NY

Kore

a / S

UN

Y St

ony

Broo

kIn

cheo

nKR

hahn

@su

nyko

rea.

ac.k

rDr

Casp

erAl

bers

Uni

vers

ity o

f Gro

ning

enGr

onin

gen

NL

c.j.a

lber

s@ru

g.nl

DrLa

ura

Ande

rlucc

iU

nive

rsity

of B

olog

naBo

logn

ala

ura.

ande

rlucc

i@un

ibo.

itM

rM

arko

sAn

gelo

sDe

moc

ritus

Uni

vers

ity o

f Thr

ace

Alex

andr

oupo

liGR

amar

kos@

eled

.dut

h.gr

Mr

Deni

sAn

usch

ewsk

iHe

inric

h He

ine

Uni

vers

ity D

üsse

ldor

fDü

ssel

dorf

DEde

nis.

anus

chew

ski@

hhu.

dePr

ofYa

sum

asa

Baba

The

Inst

itute

of S

tatis

tical

Mat

hem

atic

sTa

chik

awa,

Tok

yoJP

baba

@ism

.ac.

jpM

rsZs

uzsa

Bakk

Tilb

urg

Uni

vers

ityTi

lbur

gN

Lz.

bakk

@uv

t.nl

DrBe

ata

Bal-D

omań

ska

Wro

claw

Uni

vers

ity o

f Eco

nom

ics

Jele

nia

Gora

PLbe

ata.

bal-d

oman

ska@

ue.w

roc.

plPr

ofDa

vid

Bank

sDu

ke U

nive

rsity

Durh

am, N

CU

Sba

nks@

stat

.duk

e.ed

uDr

Tom

asz

Bart

łom

owic

zW

rocl

aw U

nive

rsity

of E

cono

mic

sJe

leni

a Gó

raPL

tom

asz.

bart

lom

owic

z@ue

.wro

c.pl

Prof

Fran

cesc

aBa

ssi

Uni

vers

ity o

f Pad

uaPa

dova

ITba

ssi@

stat

.uni

pd.it

DrJe

an-P

atric

kBa

udry

Uni

vers

ité P

ierr

e et

Mar

ie C

urie

Paris

FRje

an-p

atric

k.ba

udry

@up

mc.

frM

rsM

argo

tBe

nnin

kTi

lbur

g U

nive

rsity

Tilb

urg

NL

m.b

enni

nk@

uvt.n

lDr

Patr

ice

Bert

rand

Uni

vers

ite P

aris-

Daup

hine

Paris

FRbe

rtra

nd@

cere

mad

e.da

uphi

ne.fr

Prof

Tam

mo

Bijm

olt

Uni

veris

ty o

f Gro

ning

en, F

acul

ty o

f Ec

onom

ics &

Bus

ines

sGr

onin

gen

NL

t.h.a

.bijm

olt@

rug.

nl

Mr

Flor

ian

Böin

g-M

essin

gDe

part

men

t of M

etho

dolo

gy a

nd S

tatis

tics,

Ti

lbur

g U

nive

rsity

Tilb

urg

NL

f.boe

ing-

mes

sing@

uvt.n

l

Prof

Anne

-Lau

reBo

ules

teix

Ludw

ig-M

axim

ilian

s-U

nive

rsity

Mun

ich

DEbo

ules

teix

@ib

e.m

ed.u

ni-m

uenc

hen.

deDr

Joha

nBr

aeke

nTi

lbur

g U

nive

rsity

Tilb

urg

NL

j.bra

eken

@uv

t.nl

DrM

aria

del

Car

men

Brav

oU

nive

rsid

ad C

ompl

uten

se d

e M

adrid

Mad

ridES

mcb

ravo

@uc

m.e

s

Prof

PAU

LABR

ITO

FEP

& L

IAAD

INES

C TE

C; U

nive

rsity

of P

orto

PORT

OPT

mpb

rito@

fep.

up.p

tPr

ofFr

anço

isBr

ucke

rEc

ole

Cent

rale

Mar

seill

eM

arse

ille

FRfr

anco

is.br

ucke

r@ce

ntra

le-m

arse

ille.

frM

rsJu

styn

aBr

zeziń

ska

Uni

vers

ity o

f Eco

nom

ics i

n Ka

tow

ice

Kato

wic

ePL

just

yna.

brze

zinsk

a@ue

.kat

owic

e.pl

Mrs

Silv

iaCa

ligar

isU

NIV

ERSI

TY O

F M

ILAN

-BIC

OCC

APA

VIA

ITsil

viac

alig

aris8

5@gm

ail.c

omDr

SILV

IACA

LIGA

RIS

UN

IVER

SITY

OF

MIL

AN-B

ICO

CCA

PAVI

AIT

silvi

acal

igar

is85@

gmai

l.com

DrM

ASSI

MO

CAN

NAS

UN

IVER

SITY

OF

CAGL

IARI

CAGL

IARI

ITm

assim

o.ca

nnas

@un

ica.

it

DrVé

roni

que

CARI

OU

Nan

tes-

Atla

ntic

of V

eter

inar

y M

edic

ine,

Fo

od S

cien

ce a

nd E

ngin

eerin

g N

atio

nal

Colle

geN

ante

sFR

vero

niqu

e.ca

riou@

oniri

s-na

ntes

.frDr

Drag

oCa

rloU

nive

rsity

of N

apol

iRo

me

ITc.

drag

o@m

clin

k.it

Prof

Andr

eaCe

rioli

Uni

vers

ity o

f Par

ma

Parm

aIT

andr

ea.c

erio

li@un

ipr.i

tPr

ofEv

aCe

ulem

ans

KU L

euve

nLe

uven

BEEv

a.Ce

ulem

ans@

ppw

.kul

euve

n.be

Mr

Holg

erCe

vallo

s Val

divi

ezo

Ghen

t Uni

vers

ityGh

ent

BEho

lger

.cev

allo

sval

divi

ezo@

ugen

t.be

Mr

Pasc

alCh

ave

Hein

rich-

Hein

e-U

nive

rsitä

t Düs

seld

orf

Düss

eldo

rfDE

pasc

al.c

have

@hh

u.de

Mr

MAR

CCO

MAS

UN

IVER

SITA

T DE

GIR

ON

AGI

RON

AES

mco

mas

@im

a.ud

g.ed

uDr

Pedr

oCo

ntre

ras

Uni

vers

ity o

f Lon

don

Egha

mGB

pedr

o@cs

.rhul

.ac.

uk

DrCl

audi

oCo

nver

sano

Uni

vers

ity o

f Cag

liari,

Dip

artim

ento

di

Scie

nze

Econ

omic

he e

d Az

iend

ali

Cagl

iari

ITco

nver

sa@

unic

a.it

Prof

Fran

caCr

ippa

Depa

rtm

ent o

f Psy

chol

ogy,

Uni

vers

ity o

f M

ilano

-Bic

occa

Mila

nIT

fran

ca.c

rippa

@un

imib

.itDr

Mar

cCs

erne

lIN

RIA-

Rocq

uenc

ourt

Le C

hesn

ayFR

Mar

c.Cs

erne

l@in

ria.fr

DrCa

rlos

Cuev

as C

ovar

rubi

asU

nive

rsid

ad A

nahu

acM

exic

oM

Xcc

ueva

s@an

ahua

c.m

xM

rBr

uno

Daig

leU

nive

rsité

du

Qué

bec

à M

ontr

éal

Mon

trea

lCA

daig

le.b

runo

@co

urrie

r.uqa

m.c

aDr

Sanj

eena

Dang

Uni

vers

ity o

f Gue

lph

Guel

phCA

ssub

edi@

uogu

elph

.ca

Mr

Utk

arsh

Dang

Uni

vers

ity o

f Gue

lph

Guel

phCA

udan

g@uo

guel

ph.c

a

Mr

Rain

erDa

ngl

Uni

vers

ity o

f Nat

ural

Res

ourc

es a

nd L

ife

Scie

nces

Vie

nna

Vien

naAT

rain

er.d

angl

@bo

ku.a

c.at

DrAn

toin

ede

Fal

guer

olle

sU

nive

rsité

de

Toul

ouse

III (

retir

ed)

Toul

ouse

FRan

toin

e@fa

lgue

rolle

s.ne

t

Mr

Joha

nDe

Roo

iEr

asm

uc M

CRo

tter

dam

NL

Secr

etar

iat.B

iost

atist

ics@

eras

mus

mc.

nlDr

Mar

kDe

Roo

ijLe

iden

Uni

vers

ityLe

iden

, The

Net

herla

nds

rooi

jm@

fsw

.leid

enun

iv.n

lM

rsKi

mDe

Roo

ver

KU L

euve

nLe

uven

BEKi

m.D

eRoo

ver@

ppw

.kul

euve

n.be

DrN

ema

Dean

Uni

vers

ity o

f Gla

sgow

Glas

gow

GBne

ma.

dean

@gm

ail.c

omM

rGu

dich

aDe

reje

W.

Tilb

urg

Uni

vers

ityTi

lbur

gD.

W.G

udic

ha@

uvt.n

lDr

Chris

tian

Derq

uenn

eEl

ectr

icité

de

Fran

ce -

R&D

Clam

art C

edex

FRch

ristia

n.de

rque

nne@

edf.f

rPr

ofAb

doul

aye

Bani

reDi

allo

Uni

vers

ité d

u Q

uébe

c à

Mon

tréa

lM

ontr

eal

CAdi

allo

.abd

oula

ye@

uqam

.ca

Prof

Jean

Diat

taU

nive

rsité

de

la R

éuni

onSa

inte

Clo

tilde

REje

an.d

iatt

a@un

iv-r

euni

on.fr

DrFl

oren

tDo

men

ach

Uni

vers

ity o

f Nic

osia

Nic

osia

CYdo

men

ach.

f@un

ic.a

c.cy

Mrs

Lisa

Doov

eKU

Leu

ven,

VAT

-nr.

BE 0

419

052

173

Leuv

enBE

lisa.

doov

e@pp

w.k

uleu

ven.

bePr

ofA.

Ped

roDu

arte

Silv

aCa

thol

ic U

nive

rsity

of P

ortu

gal /

CEG

EPo

rto

PTps

ilva@

port

o.uc

p.pt

DrAn

drze

jDu

dek

Wro

claw

Uni

veris

ty o

f Eco

nom

ics

Jele

nia

Gora

PLan

drze

j.dud

ek@

ue.w

roc.

plDr

Elise

Duss

eldo

rpTN

O &

Kat

holie

ke U

nive

rsite

it Le

uven

Leid

enN

Lel

ise.d

usse

ldor

p@tn

o.nl

DrSe

rgey

Dvoe

nko

Stat

e U

nive

rsity

of T

ula

Tula

RUse

rged

v@ya

ndex

.ruM

rsIri

sEe

khou

tVU

Uni

vers

ity m

edic

al c

ente

rAm

ster

dam

NL

i.eek

hout

@vu

mc.

nlPr

ofPa

ulEi

lers

Eras

mus

Uni

vers

ity M

edic

al C

ente

rRo

tter

dam

NL

p.ei

lers

@er

asm

usm

c.nl

DrW

ilco

Emon

sTi

lbur

g U

nive

rsity

Tilb

urg

ANw

.h.m

.em

ons@

tilbu

rgun

iver

sity.

edu

Mrs

Mar

ijeFa

ggin

ger A

uer

Leid

en U

nive

rsity

Leid

enN

Lm

.f.fa

ggin

ger.a

uer@

fsw

.leid

enun

iv.n

lM

rLu

kasz

Feld

man

Wro

cław

Uni

vers

ity o

f Eco

nom

ics

Wro

cław

PLlu

kasz

.feld

man

@ue

.wro

c.pl

Mr

Bern

ard

Fich

etAi

x-M

arse

ille

Uni

vers

ityM

arse

ille

FRbe

rnar

d.fic

het@

lif.u

niv-

mrs

.frDr

Silv

iaFi

gini

Uni

vers

ity o

f Pav

iaPa

via

ITsil

via.

figin

i@un

ipv.

it

Mr

Kam

ilFi

jore

kCr

acow

Uni

vers

ity o

f Eco

nom

ics

Crac

owPL

kam

il.fij

orek

@ue

k.kr

akow

.pl

Mrs

Mar

jole

inFo

kkem

aVr

ije U

nive

rsite

itAm

ster

dam

NL

m.fo

kkem

a@vu

.nl

Mr

Luca

Frig

auIta

liaSe

larg

ius

ITfr

igau

@un

ica.

itM

rHi

roki

Furu

zum

iU

nive

rsity

of H

yogo

Kobe

JPfu

ruzu

mi@

econ

.u-h

yogo

.ac.

jpPr

ofBe

rnha

rdGa

nter

Tech

nisc

he U

ni D

resd

enDr

esde

nDE

bern

hard

.gan

ter@

tu-d

resd

en.d

eM

rEU

GEN

IUSZ

GATN

ARN

atio

nal B

ank

of P

olan

dW

arsa

wPL

sekr

etar

iat.g

atna

rWB@

nbp.

pl

DrJo

hnGe

lisse

nDe

part

men

t of M

etho

dolo

gy &

Sta

tistic

s,

Tilb

urg

Uni

vers

ityTi

lbur

gN

Lj.p

.t.m

.gel

issen

@uv

t.nl

Mr

van

den

Burg

Gert

jan

Eras

mus

Uni

vers

ity R

otte

rdam

Rott

erda

mN

Lbu

rg@

ese.

eur.n

lDr

Paol

oGi

orda

niSa

pien

za U

nive

rsity

of R

ome

Rom

eIT

paol

o.gi

orda

ni@

uniro

ma1

.itDr

Anna

Gira

ldo

Uni

vers

ity o

f Pad

ova

Pado

vaIT

anna

.gira

ldo@

unip

d.it

DrCy

nthi

aGl

odea

nuTU

Dre

sden

Dres

den

DECy

nthi

a-Ve

ra.G

lode

anu@

tu-d

resd

en.d

e

DrTo

mas

zGó

reck

iFa

culty

of M

athe

mat

ics a

nd C

ompu

ter

Scie

nce,

Ada

m M

icki

ewic

z Uni

vers

ityPo

znań

PLto

mas

z.go

reck

i@am

u.ed

u.pl

Mrs

Rosa

lieGo

rter

VUm

cAm

ster

dam

NL

r.gor

ter@

vum

c.nl

Prof

John

Gow

erO

pen

Uni

vers

ity,

Milt

on K

eyne

sGB

j.c.g

ower

@op

en.a

c.uk

Prof

Mic

hael

Gree

nacr

eU

nive

rsita

t Pom

peu

Fabr

aBa

rcel

ona

ESm

icha

el.g

reen

acre

@gm

ail.c

omPr

ofPa

tric

kGr

oene

nEr

asm

us U

nive

rsity

Rot

terd

amRo

tter

dam

NL

groe

nen@

ese.

eur.n

lDr

Isab

elle

Guyo

nCl

opiN

etSa

n Fr

anci

sco

US

guyo

n@cl

opin

et.c

omM

rKu

niyo

shi

Haya

shi

Oka

yam

a U

nive

rsity

Oka

yam

a Ci

tyJP

k-ha

yash

i@em

s.ok

ayam

a-u.

ac.jp

Mr

Will

emHe

iser

Leid

en U

nive

rsity

Leid

enN

LHe

iser@

Fsw

.Lei

denu

niv.

nl

DrCh

ristia

nHe

nnig

Uni

vers

ity C

olle

ge L

ondo

n, D

epar

tmen

t of

Stat

istic

al S

cien

ceLo

ndon

GBc.

henn

ig@

ucl.a

c.uk

Mrs

Joke

Heyl

enKU

Leu

ven

Leuv

enBE

Joke

.Hey

len@

ppw

.kul

euve

n.be

Prof

Tada

shi

Imai

zum

iTa

ma

Uni

vers

ityTo

kyo

JPim

aizu

mi@

tam

a.ac

.jp

Prof

Salv

ator

eIn

gras

siaDe

part

men

t of E

cono

mic

s and

Bus

ines

s,

Uni

vers

ity o

f Cat

ania

Cata

nia

ITs.

ingr

assia

@un

ict.i

t

DrAl

fons

oIo

dice

D'E

nza

Uni

vers

ità d

i Cas

sino

e de

l Laz

io M

erid

iona

leCa

ssin

oIT

iodi

cede

@un

icas

.itM

rsLi

anne

Ippe

lTi

lbur

g U

nive

rsity

Tilb

urg

NL

g.j.e

.ippe

l@til

burg

univ

ersit

y.ed

u

DrLo

reda

naIv

anN

atio

nal S

choo

l of P

oliti

cal S

tudi

es a

nd

Publ

ic A

dmin

istra

tion

Buch

ares

tRO

lore

dana

.ivan

@co

mun

icar

e.ro

Mr

Rusla

nJa

bray

ilov

Tilb

urg

Uni

vers

ityTi

lbur

gN

Lr.j

abra

yilo

v@uv

t.nl

Prof

Krzy

szto

fJa

juga

Wro

claw

Uni

vers

ity o

f Eco

nom

ics

Wro

claw

PLkr

zysz

tof.j

ajug

a@ue

.wro

c.pl

Mr

Maa

rten

Kam

pert

Leid

en U

nive

rsity

Leid

enN

Lm

kam

pert

@m

ath.

leid

enun

iv.n

lM

rYu

suke

Kana

zaw

aRi

kkyo

Uni

vers

ityTo

shim

a-ku

JPka

naza

wa@

rikky

o.ac

.jp

DrM

ilo¨

Kank

ara¨

Tilb

urg

Uni

vers

ityTi

lbur

gN

Lm

.kan

kara

s@uv

t.nl

DrRo

bert

Kapł

onW

rocl

aw U

nive

rsity

of T

echn

olog

yW

rocł

awPL

robe

rt.k

aplo

n@pw

r.wro

c.pl

DrM

aurit

sKa

ptei

nTi

lbur

g U

nive

rsity

Nijm

egen

NL

mau

rits@

mau

ritsk

apte

in.c

omPr

ofHe

nkKe

lder

man

Leid

en U

nive

rsity

Leid

enN

Lh.

keld

erm

an@

umai

l.lei

denu

niv.

nlPr

ofM

artin

Kidd

Uni

vers

ity o

f Ste

llenb

osch

Mat

iela

ndZA

mki

dd@

sun.

ac.z

aPr

ofHe

nkKi

ers

Gron

inge

n U

nive

rsity

Gron

inge

nN

Lh.

a.l.k

iers

@ru

g.nl

DrSi

mon

aKo

renj

ak-C

erne

Uni

vers

ity o

f Lju

blja

na, F

acul

ty o

f Eco

nom

ics

Ljub

ljana

SIsim

ona.

cern

e@ef

.uni

-lj.s

iM

rsAn

naKr

olW

rocl

aw U

nive

rsity

of E

cono

my

Wro

claw

PLan

na.k

rol@

ue.w

roc.

plDr

Taka

fum

iKu

bota

The

Inst

itute

of S

tatis

tical

Mat

hem

atic

sTo

kyo

JPtk

ubot

a@ism

.ac.

jpM

rsRe

nske

Kuijp

ers

Tilb

urg

Uni

vers

ityTi

lbur

gN

Lr.e

.kui

jper

s@til

burg

univ

ersit

y.ed

uDr

Kei

Kura

kaw

aN

atio

nal I

nstit

ute

of In

form

atic

sTo

kyo

JPku

raka

wa@

nii.a

c.jp

Prof

Koji

Kurih

ara

Oka

yam

a U

nive

rsity

Oka

yam

aJP

kurih

ara@

ems.

okay

ama-

u.ac

.jpM

rsO

lesia

Kush

nir

Tula

Sta

te U

nive

rsity

Tula

RUku

shni

r-ol

esya

@ra

mbl

er.ru

DrM

orne

Lam

ont

Stel

lenb

osch

Uni

vers

itySt

elle

nbos

chZA

mm

cl@

sun.

ac.z

aPr

ofBe

rtho

ldLa

usen

Uni

vers

ity o

f Ess

exCo

lche

ster

GBbl

ause

n@es

sex.

ac.u

kPr

ofN

iel

Le R

oux

Uni

vers

ity o

f Ste

llenb

osch

Stel

lenb

osch

ZAnj

lr@su

n.ac

.za

Prof

TAE

RIM

LEE

Kore

a N

atio

nal O

pen

Uni

vers

itySe

oul

trle

e@kn

ou.a

c.kr

Prof

Herb

ieLe

eU

nive

rsity

of C

alifo

rnia

, San

ta C

ruz

Sant

a Cr

uz, C

AU

She

rbie

@am

s.uc

sc.e

duPr

ofFr

iedr

ich

Leisc

hBO

KU V

ienn

aVi

enna

ATFr

iedr

ich.

Leisc

h@bo

ku.a

c.at

Mr

Etie

nne

Lord

Uni

vers

ite d

u Q

uebe

c a

Mon

trea

l / D

ept.

Info

rmat

ique

Mon

trea

lCA

lord

.etie

nne@

cour

rier.u

qam

.ca

Prof

Sugn

etLu

bbe

Uni

vers

ity o

f Cap

e To

wn

Cape

Tow

nZA

Sugn

et.L

ubbe

@uc

t.ac.

zaM

rsGe

rtra

udM

alsin

er-W

alli

Joha

nnes

Kep

ler U

nive

rsity

Lin

zLi

nzAT

gert

raud

.mal

siner

_wal

li@jk

u.at

DrM

ałgo

rzat

aM

arko

wsk

aW

rocł

aw U

nive

rsity

of E

cono

mic

sJe

leni

a Gó

raPL

mal

gorz

ata.

mar

kow

ska@

ue.w

roc.

plM

rYu

suke

Mat

sui

Hokk

aido

Uni

vers

itySa

ppor

oJP

mat

sui@

iic.h

okud

ai.a

c.jp

DrM

arce

llaM

azzo

leni

Dipa

rtim

ento

di S

tatis

tica

e M

etod

i Q

uant

itativ

i Uni

vers

ità B

icoc

caM

ilano

ITm

.maz

zole

ni8@

cam

pus.

unim

ib.it

Prof

Geof

fM

cLac

hlan

Uni

vers

ity o

f Que

ensla

ndBr

isban

eAU

g.m

clac

hlan

@uq

.edu

.au

Prof

Paul

McN

icho

las

Uni

vers

ity o

f Gue

lph

Guel

phCA

paul

.mcn

icho

las@

uogu

elph

.ca

Mrs

Dhou

haM

ejri

Tech

nisc

he U

nive

rsity

of D

ortm

und

Dort

mun

dDE

mej

ri_dh

ouha

@ya

hoo.

frDr

Giov

anna

Men

ardi

Uni

vers

ity o

f Pad

uaPA

DOVA

ITm

enar

di@

stat

.uni

pd.it

DrHi

royu

kiM

INAM

IHo

kkai

do U

nive

rsity

Sapp

oro

JPm

in@

iic.h

okud

ai.a

c.jp

Mrs

Cam

elia

Min

ica

Vrije

Uni

vers

iteit

Amst

erda

mAm

ster

dam

NL

c.c.

min

ica@

vu.n

l

Prof

Boris

Mirk

inN

RU H

ighe

r Sch

ool o

f Eco

nom

ics M

osco

wM

osco

wRU

mirk

in@

dcs.

bbk.

ac.u

kM

rM

asak

iM

itsuh

iroGr

adua

te S

choo

l of D

oshi

sha

Uni

vers

ityKy

otan

abe

JPdi

m00

09@

mai

l4.d

oshi

sha.

ac.jp

Mrs

Mar

ie-A

nne

Mitt

elha

euse

rTi

lbur

g U

nive

rsity

Tilb

urg

NL

M.M

ittel

haeu

ser@

uvt.n

lPr

ofM

ASAH

IRO

MIZ

UTA

IIC H

okka

ido

Uni

v.Sa

ppor

oJP

mizu

ta@

iic.h

okud

ai.a

c.jp

Prof

Fran

cesc

oM

ola

Uni

vers

ity o

f Cag

liari

Cagl

iari

ITm

ola@

unic

a.it

Prof

Ange

laM

onta

nari

Uni

vers

ity o

f Bol

ogna

Bolo

gna

ange

la.m

onta

nari@

unib

o.it

Mrs

Kath

erin

eM

orris

Uni

vers

ity o

f Gue

lph,

Ont

ario

, Can

ada

Toro

nto

CAkm

orri0

9@uo

guel

ph.c

aM

rPa

vlo

Moz

haro

vsky

iU

nive

rsity

of C

olog

neCo

logn

eDE

moz

haro

vsky

i@st

atist

ik.u

ni-k

oeln

.de

DrJo

risM

ulde

rTi

lbur

g U

nive

rsity

Tilb

urg

NL

j.mul

der3

@uv

t.nl

Prof

Fion

nM

urta

ghRo

yal H

ollo

way

, Uni

vers

ity o

f Lon

don

Lond

onGB

fmur

tagh

@ac

m.o

rgPr

ofM

oham

edN

adif

LIPA

DE -

Uni

vers

ity o

f Par

is De

scar

tes

Paris

FRm

oham

ed.n

adif@

paris

desc

arte

s.fr

Mr

Erw

inN

agel

kerk

eTi

lbur

g U

nive

rsity

Tilb

urg

NL

e.na

gelk

erke

@uv

t.nl

Prof

Mik

iN

akai

Rits

umei

kan

Uni

vers

ityKy

oto

JPm

naka

i@ss

.rits

umei

.ac.

jpPr

ofJu

nji

Nak

ano

The

Inst

itute

of S

tatis

tical

Mat

hem

atic

sTo

kyo

JPna

kano

j@ism

.ac.

jpDr

Atsu

hoN

akay

ama

Toky

o M

etro

polit

an U

nive

rsity

Hach

ioji-

shi

JPat

suho

@tm

u.ac

.jp

Mr

Amed

eoN

apol

iLO

RIA

(CN

RS --

INRI

A N

GE --

Uni

vers

ité d

e Lo

rrai

ne)

Vand

oeuv

re le

s Nan

cyFR

Amed

eo.N

apol

i@lo

ria.fr

DrFe

deric

aN

icol

ussi

Uni

vers

ità d

egli

Stud

i Mila

no-B

icoc

caLi

sson

eIT

f.nic

olus

si@ca

mpu

s.un

imib

.itPr

ofRe

becc

aN

ugen

tCa

rneg

ie M

ello

n U

nive

rsity

Pitt

sbur

ghU

Srn

ugen

t@st

at.c

mu.

edu

DrDa

niel

Obe

rski

Tilb

urg

Uni

vers

ityTi

lbur

gN

Ldo

bers

ki@

uvt.n

lPr

ofAk

inor

iO

kada

Tam

a U

nive

rsity

Toky

oJP

okad

a@rik

kyo.

ac.jp

Prof

Rodr

igue

zO

ldem

arU

nive

rsity

of C

osta

Ric

aSa

n Pe

dro

CRol

dem

ar.ro

drig

uez@

ucr.a

c.cr

Mrs

Hann

ahO

oste

rhui

sTi

lbur

g U

nive

rsity

Tilb

urg

NL

h.e.

m.o

oste

rhui

s@til

burg

univ

ersit

y.ed

uM

rPi

eter

Oos

terw

ijkTi

lTi

lbur

gN

Lp.

r.oos

terw

ijk@

uvt.n

lM

rM

ory

Oua

ttar

aCE

DRCI

/CST

BPa

risFR

mor

y.ou

atta

ra@

live.

fr

DrJa

n W

.O

wsin

ski

Syst

ems R

esea

rch

Inst

itute

, Pol

ish A

cade

my

of S

cien

ces

War

szaw

aPL

owsin

ski@

ibsp

an.w

aw.p

lM

rsO

peol

uwa

Oye

dele

Uni

vers

ity o

f Cap

e To

wn

Cape

Tow

nZA

Ope

oluw

aOye

dele

@gm

ail.c

omPr

ofFr

ance

sco

Palu

mbo

Uni

vers

ity o

f Nap

les F

eder

ico

IIN

aple

sIT

fpal

umbo

@un

ina.

itDr

Silv

iaPa

ndol

fiU

nive

rsity

of P

erug

iaPe

rugi

aIT

pand

olfi@

stat

.uni

pg.it

Prof

Barb

ara

Paw

elek

Crac

ow U

nive

rsity

of E

cono

mic

s,

Depa

rtm

ent o

f Sta

tistic

sKr

akow

PLpa

wel

ekb@

uek.

krak

ow.p

lDr

Mar

cin

Pełk

aW

rocl

aw U

nive

rsity

of E

cono

mic

sJe

leni

a Gó

raPL

mar

cin.

pelk

a@ue

.wro

c.pl

Mr

GON

ZALO

PERE

Z DE

LA

CRU

ZN

ATIO

NAL

UN

IVER

SITY

OF

MEX

ICO

MEX

ICO

MX

acua

rio_1

984@

yaho

o.co

m.m

xDr

Rado

slaw

Piet

rzyk

Wro

claw

Uni

vers

ity o

f Eco

nom

ics

Wro

claw

PLra

dosla

w.p

ietr

zyk@

ue.w

roc.

plDr

Krzy

szto

fPi

onte

kW

rocl

aw U

nive

rsity

of E

cono

mic

sW

rocl

awPL

krzy

szto

f.pio

ntek

@ue

.wro

c.pl

Prof

Joze

fPo

ciec

haCr

acow

Uni

vers

ity o

f Eco

nom

ics,

De

part

men

t of S

tatis

tics

Krak

owPL

poci

echa

@ue

k.kr

akow

.pl

Mr

Oliv

ier

POIR

ION

Labo

rato

ire A

MPE

RE -

UM

R 50

05-C

NRS

�cu

llyFR

oliv

ier.p

oirio

n@ec

-lyon

.frM

rPa

scal

Prea

Ecol

e Ce

ntra

le M

arse

ille

Mar

seill

e Ce

dex

20FR

pasc

al.p

rea@

lif.u

niv-

mrs

.frDr

Klau

dia

Przy

bysz

Wro

claw

Uni

vers

ity o

f Eco

nom

ics

Wro

claw

PLkl

audi

a.pr

zyby

sz@

ue.w

roc.

plDr

Anto

nio

Punz

oU

nive

rsity

of C

atan

iaCa

tani

aIT

anto

nio.

punz

o@un

ict.i

tPr

ofGi

anca

rloRa

gozin

iFe

deric

o II

Uni

vers

ity o

f Nap

les

Nap

les

ITgi

rago

z@un

ina.

itDr

Sura

jitRa

yU

nive

rsity

of G

lasg

owGL

ASGO

WGB

sura

jit.ra

y@gl

asgo

w.a

c.uk

Mr

Moh

amed

Am

ine

Rem

itaU

nive

rsité

du

Qué

bec

à M

ontr

éal

Mon

tréa

lCA

rem

ita.m

oham

ed_a

min

e@uq

am.c

aDr

Ralp

hRi

ppe

Leid

en U

nive

rsity

Leid

enN

Lrr

ippe

@fs

w.le

iden

univ

.nl

DrPa

wel

Roki

taW

rocl

aw U

nive

rsity

of E

cono

mic

sW

rocl

awPL

paw

el.ro

kita

@ue

.wro

c.pl

Mr

Ruan

Ross

ouw

Saso

l Tec

hnol

ogy

R &

DSa

solb

urg

tany

a.ce

rva@

saso

l.com

Prof

Adam

Saga

nCr

acow

Uni

vers

ity o

f Eco

nom

ics

Krak

owPL

saga

na@

uek.

krak

ow.p

lDr

Jan

Sche

pers

Maa

stric

ht U

nive

rsity

Maa

stric

htN

Lja

n.sc

hepe

rs@

maa

stric

htun

iver

sity.

nlDr

Vere

naSc

hmitt

man

nTi

lbur

g U

nive

rsity

Tilb

urg

NL

v.d.

schm

ittm

ann@

uvt.n

lM

rPi

eter

Scho

onee

sEr

asm

us U

nive

rsity

Rot

terd

amRo

tter

dam

scho

onee

s@es

e.eu

r.nl

DrM

algo

rzat

aSe

j-Kol

asa

Wro

claw

Uni

vers

ity o

f Eco

nom

ics

Jele

nia

Góra

PLm

algo

rzat

a.se

j-kol

asa@

ue.w

roc.

plM

rAn

drey

Shes

tako

vHi

gher

Sch

ool o

f Eco

nom

ics

Mos

cow

RUsh

esta

koffa

ndre

y@gm

ail.c

om

Prof

Klaa

sSi

jtsm

aTi

lbur

g Sc

hool

of S

ocia

l and

Beh

avio

ral

Scie

nces

Tilb

urg

NL

k.sij

tsm

a@uv

t.nl

Mrs

Cláu

dia

Silv

estr

eIS

CTE-

IUL

LISB

OA

PTcs

ilves

tre@

escs

.ipl.p

tPr

ofAg

eSm

ilde

Uni

vers

ity o

f Am

ster

dam

Amst

erda

mN

La.

k.sm

ilde@

uva.

nlM

rN

iels

Smits

VU U

nive

rsity

Am

ster

dam

Amst

erda

mN

Ln.

smits

@vu

.nl

Prof

Elżb

ieta

Sobc

zak

Wro

cław

Uni

vers

ity o

f Eco

nom

ics

Jele

nia

Góra

PLel

zbie

ta.s

obcz

ak@

ue.w

roc.

plPr

ofAn

drze

jSo

kolo

wsk

iCr

acow

Uni

vers

ity o

f Eco

nom

ics

Krak

owPL

soko

low

s@ue

k.kr

akow

.pl

Mrs

Alet

teSp

riens

ma

VU U

nive

rsity

Med

ical

Cen

ter

Amst

erda

mN

La.

sprie

nsm

a@vu

mc.

nlDr

Alw

inSt

egem

anU

nive

rsity

of G

roni

ngen

Gron

inge

nN

La.

w.s

tege

man

@ru

g.nl

Prof

Doug

las

Stei

nley

Uni

vers

ity o

f Miss

ouri

Colu

mbi

aU

Sst

einl

eyd@

miss

ouri.

edu

Prof

Xiao

gang

SuU

nive

rsity

of A

laba

ma

at B

irmin

gham

Birm

ingh

amU

Sxg

.su.

2012

@gm

ail.c

omDr

Jacq

ues-

Henr

iSu

blem

ontie

rLI

FO U

nive

rsity

of O

rléan

sO

rléan

sFR

jhs@

univ

-orle

ans.

frPr

ofYu

anSu

nN

atio

nal I

nstit

ute

of In

form

atic

sTo

kyo

JPyu

an@

nii.a

c.jp

DrM

irosla

wa

Szte

mbe

rg-L

ewan

dow

ska

Wro

claw

Uni

vers

ity o

f Eco

nom

ics

Jele

nia

Góra

PLm

irosla

wa.

szte

mbe

rg-

lew

ando

wsk

a@ue

.wro

c.pl

Mr

Kens

uke

Tani

oka

Grad

uate

scho

ol o

f Cul

ture

and

Info

rmat

ion

Scie

nce,

Dos

hish

a U

nive

rsity

Kyot

anab

e Ci

tyJP

eim

1001

@m

ail4

.dos

hish

a.ac

.jp

DrSh

inob

uTa

tsun

ami

St. M

aria

nna

Uni

vers

ity S

choo

l of M

edic

ine

Kaw

asak

iJP

s2ta

tsu@

mar

iann

a-u.

ac.jp

DrFe

tene

Tekl

eTi

lbur

g U

nive

rsity

Tilb

urg

NL

f.b.te

kle@

uvt.n

l

Mr

Yosh

ikaz

uTe

rada

Grad

uate

Sch

ool o

f Eng

inee

ring

Scie

nce,

O

saka

Uni

vers

ityO

saka

JPte

rada

@sig

mat

h.es

.osa

ka-u

.ac.

jpDr

Mak

oto

Tom

itaTo

kyo

Med

ical

and

Den

tal U

nive

rsity

Toky

oJP

tom

ita.c

rc@

tmd.

ac.jp

Mr

Gen

Tsuc

hiya

ma

Grad

uate

Sch

ool o

f Cul

ture

and

Info

rmat

ion

Scie

nce,

Dos

hish

a U

nive

rsity

Kyot

anab

e Ci

tyJP

eim

1002

@m

ail4

.dos

hish

a.ac

.jp

Prof

Mits

uhiro

Tsuj

iFa

culty

of I

nfor

mat

ics /

Kan

sai U

nive

rsity

/ Ja

pan

Taka

tsuk

i-shi

, OSA

KAJP

tsuj

i@ka

nsai

-u.a

c.jp

Mr

Taka

hiko

Uen

oSt

. Mar

iann

a U

nive

rsity

Kaw

asak

iJP

t2ue

no@

mar

iann

a-u.

ac.jp

Mr

Taka

hiro

Um

eiGr

adua

te S

choo

l of D

oshi

sha

Uni

vers

ityKy

otan

abe

JPdi

m00

15@

mai

l4.d

oshi

sha.

ac.jp

Mr

Robb

ieva

n Ae

rtDe

part

men

t of M

etho

dolo

gy a

nd S

tatis

tics

Tilb

urg

Uni

vers

ityTi

lbur

gR.

C.M

.van

Aert

@til

burg

univ

ersit

y.ed

uDr

Mic

hel

van

de V

elde

nEr

asm

us U

nive

rsity

Rot

terd

amRo

tter

dam

NL

vand

evel

den@

ese.

eur.n

lM

rGe

rtja

nva

n de

n Bu

rgEr

asm

us U

nive

rsity

Rot

terd

amRo

tter

dam

NL

burg

@es

e.eu

r.nl

DrAn

drie

sVa

n de

r Ark

Tilb

urg

Uni

vers

ityTi

lbur

gN

La.

vdar

k@uv

t.nl

Mr

Dani

ëlva

n de

r Pal

mTi

lbur

g U

nive

rsity

Tilb

urg

NL

D.W

.vdr

Palm

@uv

t.nl

DrKa

trijn

Van

Deun

KU L

euve

n, V

AT: 0

419

052

173

Leuv

enBE

katr

ijn.v

ande

un@

ppw

.kul

euve

n.be

Prof

fred

van

eeuw

ijkw

agen

inge

n un

iver

sity

wag

enin

gen

NL

fred

.van

eeuw

ijk@

wur

.nl

Mrs

Anou

khva

n Gi

esse

nU

MC

Utr

echt

Utr

echt

NL

a.va

ngie

ssen

@um

cutr

echt

.nl

Mrs

Loan

van

Hoev

enU

MC

Utr

echt

Utr

echt

NL

l.r.v

anho

even

-3@

umcu

trec

ht.n

lDr

M. L

eeVa

n Ho

rnU

nive

rsity

of S

outh

Car

olin

aCo

lum

bia,

SC

US

vanh

orn@

sc.e

duM

rGe

ert

van

Kolle

nbur

gTi

lbur

g U

nive

rsity

Oirs

chot

NL

g.h.

vank

olle

nbur

g@uv

t.nl

Prof

Iven

Van

Mec

hele

nKU

Leu

ven,

VAT

: 041

9 05

2 17

3Le

uven

BEiv

en.v

anm

eche

len@

ppw

.kul

euve

n.be

Prof

Rosa

nna

VERD

ESe

cond

Uni

vers

ity o

f Nap

les

Case

rta

ITro

sann

a.ve

rde@

unin

a2.it

Prof

Rosa

nna

Verd

eSe

cond

Uni

vers

ity o

f Nap

les

Case

rta

ITro

sann

a.ve

rde@

unin

a2.it

Prof

Jero

en K

.Ve

rmun

tTi

lbur

g Sc

hool

of S

ocia

l and

Beh

avio

ral

Scie

nces

Tilb

urg

NL

j.k.v

erm

unt@

uvt.n

lM

rsM

arlie

sVe

rvlo

etKU

Leu

ven

Leuv

enBE

mar

lies.

verv

loet

@pp

w.k

uleu

ven.

be

Prof

DON

ATEL

LAVI

CARI

DIP.

SCI

ENZE

STA

TIST

ICHE

- SA

PIEN

ZA U

NIV

. RO

MA

ROM

AIT

dona

tella

.vic

ari@

uniro

ma1

.itPr

ofM

auriz

ioVi

chi

Sapi

enza

Uni

vers

ity o

f Rom

eRo

me

ITm

auriz

io.v

ichi

@un

irom

a1.it

Mrs

Mar

ia d

el C

arm

enVi

llar P

atin

oU

nive

rsid

ad A

nahu

acM

exic

oM

Xm

aria

.vill

ar@

anah

uac.

mx

Mrs

Ingr

idVr

iens

Tilb

urg

Uni

vers

ityTi

lbur

gN

Li.v

riens

@til

burg

univ

ersit

y.ed

uPr

ofM

arek

Wal

esia

kW

rocl

aw U

nive

rsity

of E

cono

mic

sJe

leni

a Gó

raPL

mar

ek.w

ales

iak@

ue.w

roc.

plDr

Mat

thijs

War

rens

Leid

en U

nive

rsity

Leid

enw

arre

ns@

fsw

.leid

enun

iv.n

lM

rLu

kasz

Was

zak

Adam

Mic

kiew

icz U

nive

rsity

Pozn

anPL

lwas

zak@

amu.

edu.

plDr

Jelte

Wic

hert

sTi

lbur

g U

nive

rsity

TIlb

urg

NL

j.m.w

iche

rts@

uvt.n

l

Mr

Tom

Wild

erja

nsKU

Leu

ven

Leuv

enBE

tom

.wild

erja

ns@

ppw

.kul

euve

n.be

DrJu

styn

aW

ilkW

rocl

aw U

nive

rsity

of E

cono

mic

sJe

leni

a Gó

raPL

just

yna.

wilk

@ue

.wro

c.pl

Prof

Adils

on E

lias

Xavi

erFe

dera

l Uni

vers

ity o

f Rio

de

Jane

iroRi

o de

Jane

iroBR

adils

on@

cos.

ufrj.

brPr

ofHi

rosh

iYa

dohi

saDo

shish

a U

nive

rsity

Kyot

anab

eJP

hyad

ohis@

mai

l.dos

hish

a.ac

.jpPr

ofKa

zuno

riYa

mag

uchi

Rikk

yo U

nive

rsity

Toky

oJP

kyam

agu@

rikky

o.ac

.jpDr

Mic

hio

Yam

amot

oO

saka

Uni

vers

ityTo

yona

kaJP

mya

mam

oto@

sigm

ath.

es.o

saka

-u.a

c.jp

Mr

Achi

mZe

ileis

Uni

vers

ität I

nnsb

ruck

Inns

bruc

kAT

Achi

m.Z

eile

is@R-

proj

ect.o

rgDr

Mar

iang

ela

Zeng

aU

nive

rsity

of M

ilano

-Bic

occa

Mila

noIT

mar

iang

ela.

zeng

a@un

imib

.itM

rBe

rrie

Ziel

man

Net

herla

nds C

ourt

of A

udit

Leid

sche

ndam

NL

a.zie

lman

@re

kenk

amer

.nl

DrM

ario

Zille

r:F

riedr

ich-

Loef

fler-

Inst

itut,

Fede

ral R

esea

rch

Inst

itute

for A

nim

al H

ealth

Grei

fsw

ald

- Ins

el R

iem

sDE

Mar

io.Z

iller

@fli

.bun

d.de

Mrs

Agat

aZo

ltasz

ekCh

air o

f Spa

tial E

cono

met

rics,

Uni

vers

ity o

f Lo

dzLo

dzPL

zolta

szek

@un

i.lod

z.pl

Scientific Program

Monday, July 15Plenary Invited SessionsTime: 09:00-10:30Room: CZ115Chair: Groenen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..

Critical Issues and Developments in High-dimensional Prediction withBiomedical Applications 1Anne-Laure Boulesteix

Flexible Model Based Clustering via the Cluster-Weighted Approach 2Salvatore Ingrassia

Monday, July 15Concurrent Session 1aTopic: Applications in marketing and social policyTime: 11:00-12:20Room: CZ6Chair: Okada . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .

Latent Class Models in Marketing: Trading off Classification Certainty andCosts of Data Collection. 3Maurits Kaptein

Market Segmentation based on Stated Preferences using Latent ClassModels and R 4Andrzej Bak, Aneta Rybicka, and Marcin Pełka

Multi-layer Cluster Analysis of Brand Switching Among Coff ee Brands 5Akinori Okada and Satoru Yokoyama

Polish Households’ Pharmaceutical Expenditures in Years 2010− 2020−Microsimulation Analysis with FARMMES 6Agata Zoltaszek, M.A.

Monday, July 15Concurrent Invited Session 1bTopic: Analysis of symbolic dataTime: 11:00-12:20Room: CZ7Organizer and chair: Brito . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

i

Clustering for Aggregated Symbolic Data 7Nobuo Shimizu and Junji Nakano

Factor Analysis of Distributional Data using Quantiles 8Rosanna Verde and Antonio Irpino

A Hierarchical Clustering Algorithm applied to Modal Ordin al SymbolicData 9Carmen Bravo and José M. García-Santesmases

Constrained Clustering of Temporal Beanplot Data 10Carlo Drago

Monday, July 15Concurrent Invited Session 1cTopic: Reconsidering methodologies in inequalities indicators: thecase of gender studies(session sponsored by European Association ofMethodology)Time: 11:00-12:20Room: CZ8Organizer and chair: Crippa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Gender Gap: towards a Measurement with Chain Graphical Models 11Federica Nicolussi and Fulvia Mecatti

Time To Graduation: Does Gender Make A Difference? An Analysis Of AGreek University 12Adele H. Marshall Aglaia Kalamatianou and Mariangela Zenga

Beyond indicators: a Causal Approach to Gender Statistics 13Silvia Caligaris and Fulvia Mecatti

Gender Differentials In Higher Education: Hints From A Fuzz y StatesAnalysis 14Franca Crippa, Marcella Mazzoleni and Mariangela Zenga

Monday, July 15Concurrent Session 1dTopic: Correspondence analysisTime: 11:00-12:20Room: CZ9Chair: Le Roux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..

Analysing Categorical Variables With Similar Categories:ConstrainedMultiple Correspondence Analysis 15Véronique Cariou and El Mostafa Qannari

Constrained Dual Scaling of Successive Categories for Detecting ResponseStyles 16Pieter C. Schoonees,, Michel van de Velden, and Patrick J.F. Groenen

ORTHOMALS: Orthogonal Projection Of A Multiple Correspond enceSolution On A Design Space 17Ralph C.A. Rippe and Willem J. Heiser

ii

Squared Covariances Or Chi-Squared Statistics Based Distances 18Antoine de Falguerolles

Monday, July 15Concurrent Session 1eTopic: Latent class analysisTime: 11:00-12:20Room: CZ109Chair: Oberski . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..

A New Constant Memory Recursion For Hidden Markov Models 19Francesco Bartolucci and Silvia Pandolfi

Detecting Local Dependence In Binary Data Latent Class Models: SomeDevelopments 20Daniël Oberski

Power and Sample Size Determination for Latent Class Models 21Dereje W. Gudicha, Jeroen K. Vermunt, and Fetene B. Tekle

The Bias-Adjusted Three-Step Approach To Latent Class Modeling WithExternal Variables 22Zsuzsa Bakk, Daniel Oberski, and Jeroen K. Vermunt

Monday, July 15Concurrent Invited Session 1fTopic: Recent clustering techniques and their applicationsTime: 11:00-12:20Room: CZ114Organizer and chair: Kurihara . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Comparative Analysis on LDA-based Classification and Subject Categoriesof the Japanese Awards Database of Grants-in-Aid for Scientific Research,KAKEN 23Kei Kurakawa, Yuan Sun, and Yasumasa Baba

Prototype Identification through Archetypes 24Giancarlo Ragozini

Spatial Clustering based on Hierarchical Structure of MultidimensionalLattice Data 25Koji Kurihara and Fumio Ishioka

Research Literature Analytics through Mapping Narratives 26Fionn Murtagh

Monday, July 15Plenary Invited SessionsTime: 13:20-14:50Room: CZ115Chair: Jajuga . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .

Effects of Moment-to-moment Likeability Patterns on the Virality of OnlineAds 27Tammo Bijmolt

iii

Formal Concepts for Classification 28Bernhard Ganter

Monday, July 15Concurrent Invited Session 2aTopic: Biostatistics & psychometricsTime: 15:20-16:40Room: CZ6Organizer and chair: Lee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Multinomial Logistic Regression Ensembles 29Hongshik Ahn

Age-specific Disease Network For The Major Disease In Korea 30Taerim Lee and Hongseok Kim

Analysis of Questionnaire Survey with Ordinal-polytomousUsing theBinomial Confidence Limits 31Ueno, T., Tatsunami, S., Otaki, M., and Kuwabara, R.

Comparison Of Methods For Handling Missing Data In A Multi-I temInstrument 32I. Eekhout, H.C.W. de Vet, J.W.R. Twisk, J.P.L. Brand, M.R. de Boer, and M.W.Heymans

Monday, July 15Concurrent Session 2bTopic: Reduced rank clusteringTime: 15:20-16:40Room: CZ7Chair: Wilderjans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Common and Cluster-specific Simultaneous Component Analysis 33Kim De Roover, Marieke E. Timmerman, Batja Mesquita and Eva Ceulemans

Extending Clusterwise non-negative matrix factorization(NMF) tohierarchically organized data 34Joke Heylen, Philippe Verduyn, Iven Van Mechelen and Eva Ceulemans

Generalized Reduced Clustering Analysis 35Michio Yamamoto

Mixtures Of Factor Analyzers And Unobserved HeterogeneityInQuestionnaire Data 36Robert Kapłon

Monday, July 15Concurrent Invited Session 2cTopic: Research of IOPS Ph.D.-studentsTime: 15:20-16:40Room: CZ8Organizer and chair: Kuijpers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

Estimation Methods for Categorical Marginal Models: Comparing MAEL,GEE, and GSK. 37Renske E. Kuijpers, Wicher P. Bergsma, L. Andries van der Ark, and Marcel A.Croon

Applying Multilevel Latent Class Analysis To Large-Scale EducationalAssessment Data: Predicting Students’ Mathematical Strategy ChoicesFrom Teachers’ Instructional Practice 38Marije F. Fagginger Auer, Marian Hickendorff, and CornelisM. van Putten

A Tuning Strategy for COSA 39Maarten M.D. Kampert and Jacqueline J. Meulman

Accuracy Of Reliability Estimates 40Pieter R. Oosterwijk, Klaas Sijtsma, and L. Andries van der Ark

Monday, July 15Concurrent Session 2dTopic: Symbolic data clustering and regressionTime: 15:20-16:40 Room: CZ9Chair: Brito . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .

A Big Data Intensive Application System with Symbolic Data Analysis andits Implementation 41Hiroyuki Minami and Masahiro Mizuta

An Generalization Of Centre And Range Method For Fitting A Li nearRegression Model To Symbolic Interval Data Using Ridge Regression, LassoAnd Elastic Net Methods 42Oldemar Rodríguez

Symbolic Data Clustering. A Review 44Justyna Wilk

The Ensemble Conceptual Clustering of Symbolic Data 45Marcin Pełka

Monday, July 15Concurrent Invited Session 2eTopic: Applications in economics and businessTime: 15:20-16:40Room: CZ109Organizer and chair: Pociecha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The Hierarchy Test Of Geographic Units based on Border Lengths 46Andrzej Sokołowski, Danuta Strahl, Małgorzata Markowska,and MarekSobolewski

Statistical Modeling the Optimal Level of FX Reserves for Poland 47Eugeniusz Gatnar

Latent Transitions with Mixture Rasch Model of Bankruptcy R isk in theClassification of Polish Firms 48Barbara Pawełek, Józef Pociecha, and Adam Sagan

v

Automatic Determination The Number Of Clusters In Spectral Clustering 49Marek Walesiak and Andrzej Dudek

Monday, July 15Concurrent Session 3aTopic: Clustering algorithmsTime: 17:10-18:30Room: CZ6Chair: Hennig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .

A Spectral-Mean Shift Algorithm for Clustering of Symbolic Data 50Andrzej Dudek and Marcin Pełka

Asymptotics of ReducedK-means Clustering 51Yoshikazu Terada

Non-hierarchical Clustering Algorithm For Mixed Numerica l AndCategorical Three-Way Three-Mode Data 52Takahiro Umei and Hiroshi Yadohisa

Using Simulation Strategies to Test Clustering Algorithm Performances 53Marina Marino and Cristina Tortora

Monday, July 15Concurrent Invited Session 3bTopic: recursive partitioning and application (session sponsored byIASC)Time: 17:10-18:30Room: CZ7Organizer and chair: Wilhelm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Random Forest Variable Importance Measures: Current Developments 54Anne-Laure Boulesteix and Silke Janitza

Detecting Threshold Interactions In Binary Classification: STIMA 55Claudio Conversano and Elise Dusseldorp

A Recursive Partitioning-Based Method To Balance Covariates WhenEstimating Causal Effects 56Massimo Cannas, Claudio Conversano and Francesco Mola

Recursive Partitioning for Hybrid Image Classification using Captions andImage Features 57Adalbert Wilhelm

Monday, July 15Concurrent Session 3cTopic: Applications in economicsTime: 17:10-18:30Room: CZ8Chair: Markos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..

Change of Aspects of Industrial Classification System from HierarchicalStructure to Network Structure 58Hiroki Furuzumi, Yoshiro Matsuda, and Yasumasa Baba

vi

Econometric Models of Durable Goods’ Prices: A Hedonic Approach 59Anna Król

Smart Growth Versus Economic And Social Cohesion – Econometric PanelAnalysis 60Beata Bal-Domanska and Elzbieta Sobczak

Workflow Classification Based On The K-Means Partitioning 61Etienne Lord, Abdoulaye Baniré Diallo, and Vladimir Makarenkov

Monday, July 15Concurrent Session 3dTopic: R packagesTime: 17:10-18:30Room: CZ9Chair: Leisch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .

Functional Principal Component Analysis with R 62Malgorzata Sej-Kolasa and Miroslawa Sztemberg-Lewandowska

Implementation of Time Series Methods of Forecasting in TSprediction RPackage 63Tomasz Bartłomowicz

Latest developments of theRSDA: An R package for Symbolic Data Analysis 64Oldemar Rodríguez and Johnny Villalobos

Microeconometrics Multinomial Logit Models and their Impl ementations inMMLM R Package 65Andrzej Bak and Tomasz Bartłomowicz

Monday, July 15Concurrent Session 3eTopic: Latent variable & multilevel analysisTime: 17:10-18:30Room: CZ109Chair: Montanari . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Latent Spaces of the Product Baskets - A Hybrid Model of On-line Shopping 66Adam Sagan and Mariusz Łapczynski

Multilevel Principal Covariates Regression 67Marlies Vervloet, Wim Van den Noortgate, Katrijn Van Deun and Eva Ceulemans

Three-step Estimation Method For Discrete Micro-Macro Multilevel Models 68M. Bennink, M. A. Croon and J. K. Vermunt

Single-array SNP Genotype Classification With Semi-ParametricLog-Concave Mixtures 69Paul H.C. Eilers and Ralph C.A. Rippe

Monday, July 15Concurrent Invited Session 3fTopic: Least-squares clusteringTime: 17:10-18:30

vii

Room: CZ114Organizer and chair: Mirkin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

On Featureless K-Means Clustering 70Sergey D. Dvoenko

Two Major Least-squares Divisive Clustering Methods: Bisecting K-Means,PDDP and in between 71E. Kovaleva and B. Mirkin

Scoring Dissimilarity between Binary Images by Aligning Series of SkeletonPrimitives 72Olesya A. Kushnir and Oleg S. Seredin

Least-squares Consensus Clustering versus: (a) other ConsensusApproaches and (b) K-Means 73A. Shestakov and B. Mirkin

Tuesday, July 16Concurrent Session 4aTopic: Clustering methodsTime: 08:30-10:10Room: CZ6Chair: Bertrand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Combination of Several Control Charts using Dynamic Weighted MajorityAlgorithm 74Dhouha Mejri, Claus Weihs and Mohamed Limam

Multiplicity Within Clustering: Challenges And Unificatio ns 75Jacques-Henri Sublemontier

Non-Isometric Transforms in Time Series Classification using DTW 76Tomasz Górecki and Maciej Łuczak

Performance of the Accelerated Hyperbolic Smoothing Clustering Method 77Adilson Elias Xavier and Vinicius Layter Xavier

STATIS Based Multiblock Clustering 78Ndèye Niang and Mory Ouattara

Tuesday, July 16Concurrent Invited Session 4bTopic: New trends in analyzing multi-set and three-way dataTime: 08:30-10:10Room: CZ7Organizers: Wilderjans and Ceulemans (Chair). . . . . . . . . . . . . . . . . . . . . . .

Identifying Common And Distinctive Processes Underlying Multiset Data 79Katrijn Van Deun, Age K. Smilde, Henk A.L. Kiers, and Iven VanMechelen

Fuzzy Clustering of Three-way Proximity Arrays 80Paolo Giordani and Henk A.L. Kiers

viii

Principal Covariates Clusterwise Regression 81Eva Ceulemans, Eva Vande Gaer, Henk A. L. Kiers, Iven Van Mechelen, andTom F. Wilderjans

Clusterwise PARAFAC To Identify Heterogeneity In Three-Way Data 82Tom F. Wilderjans and Eva Ceulemans

Structure-Revealing Data Fusion Model 83Evrim Acar, Anders J. Lawaetz, Morten A. Rasmussen, and Rasmus Bro

Tuesday, July 16Concurrent Session 4cTopic: Distances and similaritiesTime: 08:30-10:10Room: CZ8Chair: Okada . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .

Effects of Resampling Schemes on Stability of Cluster Validation Indices 84Rainer Dangl and Friedrich Leisch

Functional Canonical Correlation Analysis 85Mirosław Krzysko and Łukasz Waszak

Pearson’s Product-Moment Correlation is a Special Case Of Cohen’sWeighted Kappa 86Matthijs J. Warrens

Ternary Diagrams Based On A Probabilistic Ideal Point Model 87Mark de Rooij and Paul Eilers

The Matter Of Scale: Perceiving Distances And Proximities In TheBi-Partial Clustering Setting 88Jan W. Owsinski

Tuesday, July 16Concurrent Session 4dTopic: Algorithms for clustering and classificationTime: 08:30-10:10Room: CZ9Chair: Sokołowski . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Comparing Direct Estimators of the Mode 90Andrzej Sokołowski and Kamil Fijorek

k-NN Algorithm for Instantaneous Classification 91Carmen Villar-Patiño and Carlos Cuevas-Covarrubias

Flexible Multiclass Support Vector Machines: An Approach using IterativeMajorization and Huber Hinge Errors 92G.J.J. van den Burg and P.J.F. Groenen

Power-Stress for Multidimensional Scaling 93Patrick J.F. Groenen and Jan de Leeuw

ix

Variable Selection in Cluster Analysis Using Resampling Techniques: aProposal 94Hans-Joachim Mucha and Hans-Georg Bartel

Tuesday, July 16Concurrent Session 4eTopic: Applications in risk analysis and financeTime: 08:30-10:10Room: CZ109Chair: Cuevas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .

Adversarial Risk Analysis in Auctions 95David Banks

Gaussian Process Classification And Duration Models For Credit Risk 96Silvia Figini and Aki Vehtari

Model Averaging For Credit Risk Modelling 97Silvia Figini and Marika Vezzoli

Multiobjective Optimization Of Financing Household GoalsWith MultipleInvestment Programs 98Lukasz Feldman, Radoslaw Pietrzyk, and Pawel Rokita

Power Of Skewness Tests In The Presence Of Fat Tailed FinancialDistributions 99Krzysztof Piontek

Tuesday, July 16Concurrent Session 4fTopic: Applications in social sciencesTime: 08:30-10:10Room: CZ110Chair: Palumbo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Robust Clustering for Anti-Fraud Analysis 100Andrea Cerioli and Domenico Perrotta

An Extended Gravity Approach To Examining Internal Migrati ons. TheCase Of Poland 101Justyna Wilk and Michał Pietrzak

Clustering of US counties based on their demographic structures 102Simona Korenjak-Cerne, Vladimir Batagelj, Nataša Kejžar

Strategic, Motivational And Emotional Aspects Of University Study. ALatent Class Approach 103Anna Giraldo, Silvia Meggiolaro, and Elisa Visentin

The Comparative Log–Linear Analysis Of Unemployment In Poland In2004–2011 104Justyna Brzezinska

Tuesday, July 16President’s Invited SessionTime: 10:40-12:10

x

Room: CZ115Chair: Dean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .

Measurement of Quality in Cluster Analysis 105Christian Hennig

Resampling Methods for Exploring Cluster Stability 106Friedrich Leisch

The Effect Of Data Generation On Our Understanding Of ClusteringAlgorithms 107Doug Steinley

Tuesday, July 16Concurrent Session 5aTopic: Clustering and multilevel analysis of symbolic dataTime: 13:10-14:30Room: CZ6Chair: McNicholas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CLustering Constrained Symbolic Objects Constrained By Rules 108Marc Csernel

Conceptual Clustering with Interval Representation 109Paula Brito and Géraldine Polaillon

Hierarchical Symbolic Cluster Analysis with Quantile FunctionRepresentation 110Yusuke Matsui, Hiroyuki Minami, and Masahiro Mizuta

Multilevel Consumer Preference Model on Symbolic Data 111Adam Sagan, Marcin Pełka, and Aneta Rybicka

Tuesday, July 16Concurrent Invited Session 5bTopic: advances in clustering and classificationTime: 13:10-14:30Room: CZ7Organizer and chair: Nugent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The Variance of the Adjusted Rand Index (and other properties) 112Doug Steinley

Identifying Clusters Bayesian Disease Mapping 113Nema Dean, Craig Anderson, and Duncan Lee

Classification Boundary Mapping 114Yuning He and Herbert Lee

Deduplicating Text Records by Clustering the Results of AggregatedConditional Classifiers 115Rebecca Nugent and Samuel L. Ventura

Tuesday, July 16Concurrent Session 5cTopic: Applications in behavioral sciences

xi

Time: 13:10-14:30Room: CZ8Chair: Yamaguchi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Classifications of Baseball Pitching Strategies and Exploring Effects of theNew Official Balls in the Japanese Professional Baseball League 116Kazunori Yamaguchi

Life Long Learning Idea on Background of Poles’ Needs 117Marta Dziechciarz-Duda and Klaudia Przybysz

Migration Of Population - The Analysis With The Use Of Log-Li near Models 118Justyna Brzezinska

The Influence of Emotion Recognition and Academic Performance onGroup Popularity 119Ivan Loredana

Tuesday, July 16Concurrent Invited Session 5dTopic: Formal concept analysisTime: 13:10-14:30Room: CZ9Organizer and chair: Ganter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Hierarchical Classes Analysis vs. Formal Concept Analysis 120Bernhard Ganter and Cynthia V. Glodeanu

The Diversity of Pattern Structures in Formal Concept Analysis 121Aleksey Buzmakov, Sergei O. Kuznetsov, and Amedeo Napoli

Decision Aiding Software And Consensus Theory 122Florent Domenach and Ali Tayari

Experimental Comparison of Some Triclustering Algorithms 123Dmitry V. Gnatyshak, Dmitry I. Ignatov, and Sergei O. Kuznetsov

Tuesday, July 16Concurrent Invited Session 5eTopic: Interactions in bi- and tri-additive modelsTime: 13:10-14:30Room: CZ109Organizers: Albers and Gower (Chair) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

A Framework For Modeling Covariances 124Age K. Smilde, M.E. Timmerman, H.C.J. Hoefsloot, J.J. Jansen, and E. Saccenti

Biadditive Models, Alternative Estimation Procedures AndBetter Biplots 125Fred A. van Eeuwijk, Gerrit Gort, Sabine K. Schnabel, and Paul H.C. Eilers

Triadditive Models for Three-way Tables 126John C. Gower, Casper J. Albers, and Steffen Unkel

Three-way Candecomp/Parafac And The Diverging ComponentsProblem 127Alwin Stegeman

xii

Tuesday, July 16Concurrent Session 5fTopic: Cluster-weighted modelingTime: 13:10-14:30Room: CZ114Chair: Ingrassia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Cluster-weighted t-factor Analyzers for Clustering of High-dimensionalData 128Sanjeena Dang, Antonio Punzo, Salvatore Ingrassia, and Paul D. McNicholas

Cluster-Weighted Modeling For Time To Event Data 129Utkarsh J. Dang and Paul D. McNicholas

Modeling Bivariate Mixed-Type Data with the Generalized LinearExponential Cluster-Weighted Model 130Salvatore Ingrassia and Antonio Punzo

Tuesday, July 16Plenary Invited SessionsTime: 15:00-15:45Room: CZ115Chair: Vichi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .

Cluster Inference using Modes 131Surajit Ray

Tuesday, July 16Presidential addressTime: 15:45-16:30Room: CZ115Chair: Vichi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .

IFCS Presidential AddressClassipedia: A Road Map to Help Traverse the Classification Jungle 132Iven Van Mechelen

Wednesday, July 17Concurrent Session 6aTopic: Clustering, including ultrametric approachesTime: 08:30-10:10Room: CZ6Chair: Diatta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .

A Restricted ADCLUS Type Model for Transition Matrices 133Tadashi Imaizumi

Clustering Of Time Series Via A Segmentation Approach 134Christian Derquenne

Looking For A Best Compromise Between The Ultrametric Supremum-Norm Approximations 135B. Fichet

xiii

Ultrametric Tree Representation For Three-Way Three-ModeData WithWeights Of Variables And Occasions 136Kensuke Tanioka and Hiroshi Yadohisa

Which Movie Shall I Watch? Ultrametric Based Recommendation System 137Pedro Contreras, Fionn Murtagh, and Javier Pereira

Wednesday, July 17Concurrent Invited Session 6bTopic: Personalized medicine by treatment-subgroup interactionTime: 08:30-10:10Room: CZ7Organizer : Elise DusseldorpChair and discussant: Willem Heiser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Model-Based Recursive Partitioning for Detecting Interaction Effects inSubgroups 138Achim Zeileis, Torsten Hothorn, and Kurt Hornik

Predicting Individual Causal Effects (ICE) 139Xiaogang Su and Joseph Kang

A New Tool For Identifying Qualitative Treatment-Subgroup Interactions:QUINT 140Elise Dusseldorp and Iven Van Mechelen

A Comparison Of Six Sequential Partitioning Methods To FindSubgroupsInvolved In Treatment-Subgroup Interactions 141Lisa Doove, Elise Dusseldorp, Katrijn Van Deun, and Iven VanMechelen

Wednesday, July 17Concurrent Session 6cTopic: Modeling distributions and associationsTime: 08:30-10:10Room: CZ8Chair: Kiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .

Automatic Bayes Factors for Comparing Variances of Two IndependentNormal Distributions 142Florian Böing-Messing and Joris Mulder

Bayesian Model Selection For Evaluating Equality And OrderConstraintsOn Correlation Matrices 143Joris Mulder

Bivariate Dependence Patterns And Copulas: Model Discrimination AndRobustness 144Lianne Ippel and Johan Braeken

Posterior Predictive checking as alternative to Asymptotics andBootstrapping in Latent Class Analysis 145Geert H. van Kollenburg, Joris Mulder, and Jeroen K. Vermunt

Statistical Modeling Of The Distribution Of Financial Retu rns 146Cuevas-Covarrubias C., Iñigo-Martínez J. and Rosales-Contreras J.

xiv

Wednesday, July 17Concurrent Session 6dTopic: Classification treesTime: 08:30-10:10Room: CZ9Chair: Lausen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .

Combining Decision Trees And Stochastic Curtailment For AssessmentLength Reduction Of Test Batteries Used For Classification 147Marjolein Fokkema, Niels Smits Henk Kelderman

Gaussian Tree Models For Discrimination 148Gonzalo Perez–de–la–Cruz and Guillermina Eslava–Gomez

Stochastic Curtailment Of Questionnaires For Three Level Classification:Shortening The Ces-D For Assessing Low, Moderate, And High Risk OfDepression 149Niels Smits, Matthew Finkelman, and Henk Kelderman

Tree-Based Prediction with Missing Data 150Holger Cevallos Valdiviezo, Stefan Van Aelst

Sparse Classifier Ensembles for Improved Interpretability. 151Werner Adler, Zardad Khan, Sergej Potapov and Berthold Lausen

Wednesday, July 17Concurrent Session 6eTopic: ClassificationTime: 08:30-10:10Room: CZ109Chair: Groenen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..

A ROC-Optimised Multi-Prototype Classifier 152Mario Ziller

Classification of Rounded Shapes with Penalized Signal Regression 153Johan J. de Rooi and Paul H.C. Eilers

Classification of Topics on Twitter in Consideration of TimeSeries Variation 154Atsuho Nakayamar, Hiroyuki Tsurumi, and Junya Masuda

Classifying Real-World Data With The DDα-Procedure 155Pavlo Mozharovskyi, Karl Mosler, and Tatjana Lange

Comparing High-Dimensional Classifiers: Abuse and Dangersof OverallAccuracy 156A. Pedro Duarte Silva

Wednesday, July 17Concurrent Session 6fTopic: Model-based clusteringTime: 08:30-10:10Room: CZ114Chair: McLachlan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xv

Divisive Latent Class Modeling as a Density Estimation Tool: TheEstimation Algorithm and an Application to Incomplete Data. 157Daniel W. van der Palm, L. Andries van der Ark, and Jeroen K. Vermunt

Determining the Number of Clusters in Categorical Data 158Cláudia Silvestre, Margarida Cardoso, and Mário Figueiredo

Identifying Mixtures of Mixtures Using Bayesian Estimation 159Gertraud Malsiner-Walli, Sylvia Frühwirth-Schnatter, and Bettina Grün

Logratio Methodology Applied To Model-Based Clustering 160M. Comas-Cufí, G. Mateu-Figueras and J.A. Martín-Fernández

Model-based Clustering Of Multivariate Longitudinal Data 161Laura Anderlucci, Angela Montanari, and Cinzia Viroli

Wednesday, July 17Concurrent Session 7aTopic: Longitudinal and multilevel analysisTime: 10:40-12:00Room: CZ6Chair: Nugent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .

A Bayesian Multilevel Modeling of Longitudinal data: Appli cation toHygroscopic Expansion in Composite Resins 162Nasim Vahabi, Mahmood Reza Gohari, and Ali Azarbar

A New Approach To Analyse Longitudinal Epidemiological Data With AnExcess Of Zeros 163A.S. Spriensma, T.R.S. Hajos, M.R. de Boer, M.W. Heijmans, and J.W.R. Twisk

A Linear Mixed Model with a Mixture of Smooth Random EffectsDistributions 164Berrie Zielman

Longitudinal IRT Modelling compared with Multilevel Analy sis inestimating Development Over Time In Data From Three Likert-ItemQuestionnaires 165R. Gorter, M.R. de Boer, M.W. Heijmans, and J.W.R. Twisk

Wednesday, July 17Concurrent Invited Session 7bTopic: BiclusteringTime: 10:40-12:00Room: CZ7Organizer: VichiChair: Van Mechelen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Mutual Information, Chi-Squared And Model-Based Clustering ForCo-Clustering Of Contingency Tables 166Mohamed Nadif and Gérard Govaert

Parsimonious Estimation And Testing Of Two-Way Interaction By MeansOf Two-Mode Clustering 167Jan Schepers

xvi

A general Model for Two-mode Clustering 168Maurizio Vichi

Wednesday, July 17Concurrent Session 7cTopic: Applications in medicineTime: 10:40-12:00Room: CZ8Chair: Lausen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .

Comprehensive Calculations of the Sensitivity and Specificity of DiagnosisUsing Bile Cytological Data 169Tatsunami S., Hayakawa C., Koike J., Hoshikawa, M., and UenoT.

Diagnostics for the Risk Prediction of Each Type of EndoleakFormationafter TEVAR Using Statistical Discriminant Analysis 170Kuniyoshi Hayashi, Fumio Ishioka, Bhargav Raman, Daniel Y.Sze, HiroshiSuito, Takuya Ueda, and Koji Kurihara

Extension Of A Multilingual Medical Lexicon By Combined FeatureExtraction Methods 171Wiebke Petersen, Denis Anuschewski, Pascal Chave, and Philipp F. Zeitz

Wednesday, July 17Concurrent Invited Session 7dTopic: Correspondence analysis and related methodsTime: 10:40-12:00Room: CZ9Organizers: Groenen (Chair) and Greenacre. . . . . . . . . . . . . . . . . . . . . . . . . .

The Joy of Fuzzy 172Michael Greenacre

Fast Iterative Implementation of Correspondence Analysis 173Alfonso Iodice D’Enza, Patrick J. Groenen and Michel van de Velden

Inverse Multiple Correspondence Analysis 174Michel van de Velden, Patrick Groenen, and Wilco van den Heuvel

Tracking Association Structures in Categorical Data Flows 175Alfonso Iodice D’Enza and Angelos Markos

Wednesday, July 17Concurrent Invited Session 7eTopic: finding the number of clustersTime: 10:40-12:00Room: CZ109Organizer and chair: Hennig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Determining the Number of Clusters: a Problem of Definition or Estimation? 176Giovanna Menardi

Enhancing The Selection Of A Number Of Clusters In Model-BasedClustering With External Qualitative Variables 177AJ.-P. Baudry, M. Cardoso, G. Celeux, M.J. Amorim, and A.S. Ferreira

xvii

Choosing the Number of Clusters after, before, and while Clustering 178B. Mirkin

Wednesday, July 17Plenary Invited SessionsTime: 13:00-14:30Room: CZ115Chair: Nadif . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .

Competitions in Machine Learning: the Fun, the Art, and the Science 179Isabelle Guyon

Playing with Data–or How to Discourage Incorrect Data Analysis 180Klaas Sijtsma

Wednesday, July 17Concurrent Session 8aTopic: ApplicationsTime: 15:00-16:20Room: CZ6Chair: Bassi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .

A Study on Small-Area Geographical Analysis of ResidentialCharacteristicsafter the Great Hanshin-Awaji Earthquake by two Individual DifferencesModel 181Mitsuhiro Tsuji, Hiroshi Kageyama and Toshio Shimokawa

Author Identification of Japanese Classical Literature by QuantitativeAnalysis 182Gen Tsuchiyama and Masakatsu Murakami

A Latent Class Approach for Estimating Labour Market Mobili ty in thePresence of Multiple Indicators and Retrospective Interrogation 183Francesca Bassi, Marcel Croon, and Davide Vidotto

Wednesday, July 17Concurrent Invited Session 8bTopic: Non-Gaussian model-based classificationTime: 15:00-16:20Room: CZ7Organizer and chair: McNicholas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

On Finite Mixtures of Skew Distributions 184Geoff McLachlan and Sharon Lee

Classification via Mixtures of Shifted Asymmetric Laplace and Mixtures ofGeneralized Hyperbolic Distributions 185Paul D. McNicholas, Ryan P. Browne, and Brian C. Franczak

Gaussian And Distance Based Clustering In High-Dimensional Space:Differences And Common Aspects 186Francesco Palumbo, Cristina Tortora, and Paul McNicholas

Clustering and Dimension Reduction using Non-Gaussian Mixtures 187Katherine Morris and Paul McNicholas

xviii

Wednesday, July 17Concurrent Session 8cTopic: Applications in social sciencesTime: 15:00-16:20Room: CZ8Chair: Dean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .

Comparison of Spatial Clusters between Suicide Data and ItsIncrease-decrease Rates in Japan 188Makoto Tomita, Takafumi Kubota, Fumio Ishioka and Toshiharu Fujita

Detection of Spatial Clusters for High and Low Suicidal RiskAreas in Japan 189Takafumi Kubota, Makoto Tomita, Fumio Ishioka, Tomokazu Fujino and HiroeTsubaki

Patterns of Cultural Practices and Characteristics of the Cultural Omnivore 190Miki Nakai

The Structure Of Subjective Social Status In Japan: An Approach BasedOn Latent Class Model 191Yusuke Kanazawa

Wednesday, July 17Concurrent Invited Session 8dTopic: Biplot-based visualisations and classificationTime: 15:00-16:20Room: CZ9Organizer and chair: Le Roux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Reference Set Selection for Multivariate Statistical Process Monitoring withBiplots 193RF Rossouw, RLJ Coetzer, and NJ Le Roux

PLS Biplot: Another Graphical Tool for Multivariate Data 194Opeoluwa V.F. Oyedele and Sugnet Lubbe

Variable Selection for Regression and PLS using Generic Algorithms andParticle Swarm Optimization: A Comparison between the Two Methods 195Martin Philip Kidd and Martin Kidd

Classification with Hyperspheres 196Morné Lamont

Wednesday, July 17Concurrent Invited Session 8eTopic: Combinatorial methods for hierarchical and non-hierarchicalclusteringTime: 15:00-16:20Room: CZ109Organizer and chair: Brucker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Separation And Convexity Properties Of Hierarchical And Non HierarchicalClustering 197Patrice Bertrand and Jean Diatta

xix

Latticial Approach for Perfect Phylogeny Problems 198François Brucker and Pascal Préa

Some Aspects of Formal Concept Analysis in Hierarchical Classificationand Data Analysis 199Mehdi Kaytoue, Sergei O. Kuznetsov, and Amedeo Napoli

Which Movie Shall I Watch? Ultrametric Based Recommendation System 200Pedro Contreras, Fionn Murtagh, and Javier Pereira

Wednesday, July 17Concurrent Session 8fTopic: Applications in geneticsTime: 15:00-16:20Room: CZ114Chair: Van Deun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Automatic Annotation and Classification of new Papillomavirus genomes 201Mohamed Amine Remita, Ahmed Halioui and Abdoulaye Baniré Diallo

Different Approaches To Modeling Family Data In GWAS: Appli cation ToCannabis Use 202Camelia C. Minica, Conor V. Dolan, Jouke-Jan Hottenga, Dorret I. Boomsmaand Jacqueline M. Vink

Utilization Of Machine-Learning Methodologies In Order To UnderstandComplex Evolutionary And Functional Links Among Bacterial Genomes 203Olivier Poiron and Benedicte Lafay

Application of a Bayesian Artificial Neural Network to the Br east CancerSurvival Data 204Masoud Salehi and Mahmood Reza Gohari

Wednesday, July 17Plenary Invited SessionsTime: 16:50-17:50Room: CZ115Chair: McLachlan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Achieving Near-perfect Classification for Functional Data 205Peter Hall (and Aurore Delaigle)

xx

Critical Issues and Developments in High-dimensionalPrediction with Biomedical Applications

Anne-Laure Boulesteix1

Abstract

The construction of prediction rules based on high-dimensional molecular (“omics”)data in small sample settings has been the focus of abundant literature in computa-tional statistics and bioinformatics in the last decade. Such rules may be used in medicalpractice, e.g., to predict the clinical outcome of patientsbased on their transcriptomic,proteomic or metabolomic profile. While the technical issues characterizing the con-struction of prediction rules in this context have been wellinvestigated in the literature,other related crucial aspects remain comparatively underconsidered. In this talk, I willgive an overview of four projects addressing some of these problems.

The focus of the first project is on cross-validation and preliminary steps – such asvariable selection, normalization or imputation of missing values – that possibly leadto an underestimation of prediction error if performed globally using both training andtest sets. The second project addresses the evaluation and improvement of the clinicalusefulness of the derived prediction rules in terms of addedpredictive value comparedto simpler models based on classical clinical predictors. The third project is about therandom forest algorithm often used for regression and classification in bioinformaticsand the statistical properties of its associated variable importance measures. The fourthproject deals with methodological aspects of comparison studies based on real-life datasets with emphasis on testing procedures and power issues.

Computational Molecular Medicine Research Group, Department of Medical Informa-tics, Biometry and Epidemiology (IBE), University of Munich

1

Flexible Model Based Clustering via the Cluster-WeightedApproach

Salvatore Ingrassia1

Abstract

Cluster-Weighted Models (CWMs) are a flexible family of mixture models for fittingthe joint density of a pair (X, Y) of a response variable Y and avector of covariates X.Statistical properties are investigated from both theoretical and numerical point of view;in particular, it is shown that CWM includes mixture of regressions as a special case.Some particular models, based on Gaussian and t distributions as well as on generalizedlinear models, will be introduced and properties of the maximum likelihood estimatesare presented. Extension to high-dimensional data modeling is finally outlined. Theo-retical results are illustrated using some empirical studies, considering both simulatedand real data.

Department of Economics and Business, University of Catania (Italy)

2

Latent Class Models in Marketing: Trading off ClassificationCertainty and Costs of Data Collection.

Maurits Kaptein1

Abstract

For long, latent class analysis has been used in marketing for consumer segmentation(Green, 1976). Often, a large feature set — such as a purchasehistory of individual cus-tomers — is used identify different segments of customers. Class membership, definingthe segments, is subsequently used to target customers. Class membership might (e.g.)be related to customer susceptibility to distinct promotions, in which case the segmentscan be used to tailor promotions.

While many classification attempts are donepost-hoc, after all the relevant individ-ual level purchase data is collected, such data is not alwaysavailable. Consider anewcustomer of whom only a limited set of purchases are observed: how should we classifysuch a customer? We could estimate class membership — with large uncertainty — buttoo early classification might lead to the use of suboptimal promotions in future interac-tions. On the other hand, refraining from tailoring to obtain more observations in itselfcan be costly.

The above trade-off raises questions about our applicationof latent class analysis.The assignment of a promotion should not merely be a functionof class membership,but also of our associated uncertainty. We will show that using Randomized ProbabilityMatching (Scott, 2010) — a means of optimising the explore-exploit trade-off inher-ent in uncertain classification — outperforms both early as well as late classificationdecisions over the lifetime of a customer.

ReferencesGREEN, P.E., CARMONE F.J. and WACHPRESS, D.P. (1976): Consumer segmen-tation via latent class analysis.Journal of Consumer Research, 3, 170-174.SCOTT, L. (2010): A modern Bayesian look at the multi-armed bandit. Appl.Stochastic Models Bus. Ind., 26, 639–658.

KeywordsCLASSIFICATION, EXPLORE-EXPLOIT TRADEOFF, SEGMENTATION, MAR-KETING

Tilburg University, Tilburg, the [email protected]

3

Market Segmentation based on Stated Preferences usingLatent Class Models and R

Andrzej Bak1, Aneta Rybicka1, and Marcin Pełka1

Abstract

Market segmentation is understood as a division of consumers to relatively homogenousgroups. Market segmentation leads on the basis of variablesdescribing consumers orproducts, or having the same time, both sets of variables. Inthe distinguished groupsare consumers, for which offered products or services have similar utility. Often usedin segmentation research tools are classification methods (cluster analysis), to whichbelong also latent class analysis models. The main aim of thepaper is to present selectedlatent class analysis models and their application in the market segmentation based onthe stated preferences. In these models can be taken into account both types of variables:describing products or services (e.g. brand, price) and characteristics of the consumers(e.g. demographics and socio-economic variables). In segmentation procedure may beused a category of consumer preferences as a criterion for separability homogeneousclasses of consumers. In the estimation of latent class models used R program, packagesand scripts.

ReferencesBEANE T.P., ENNIS D.M. (1987): Market Segmentation: A Review. European Jour-nal of Marketing, vol. 21, nr 5, s. 20-42.LINZER D.A., LEWIS J.B. (2011):poLCA: Polytomous variable Latent Class Anal-ysis. R package version 1.3, http://userwww.service.emory.edu/ dlinzer/poLCA.WEDEL M., DESARBO W.S (1994):A Review of Recent Developments in LatentClass Regression Models, In: R.P. Bagozzi (Ed.),Advanced Methods of MarketingResearch, Blackwell, Cambridge.WEDEL M., KAMAKURA W.A. (2000): Market Segmentation. Conceptual andMethodological Foundations, 2nd ed., Kluwer Academic Publishers, Boston-Dordrecht-London.

KeywordsMARKET SEGMENTATION, PREFERENCES, LATENT CLASS MODELS

1Wrocław University of Economics, Department of Econometrics and Computer Sci-ence, Nowowiejska 3, 58-500 Jelenia Góra, Poland,[email protected],[email protected],[email protected]

4

Multi-layer Cluster Analysis of Brand Switching AmongCoffee Brands

Akinori Okada1 and Satoru Yokoyama2

Abstract

Multi-layer cluster analysis assumes a hierarchical cluster structure consists of severallayers, where each upper layer cluster consists of lower layer clusters, and classifies ob-jects into clusters at each layer (Okada & Yokoyama, 2011, 2013). Multi-layer clusteranalysis has been applied to two-mode two-way data (Okada & Yokoyama, 2011). Abrand switching matrix, whose( j,k) element represents the frequency of brand switch-ing from brand corresponds to rowj to brand corresponds to columnk, is analyzed.In the present study, a brand switching matrix, which consists of one-mode two-waydata, is transposed, and is regarded as two-mode two-way data. Coffee brands vary onthree attributes; (a) type (regular instant, freeze-dried, and already mixed with sugar andcream or packed in a plastic or paper cup), (b) maker (three companies), and (c) with orwithout end (flier) when the brand was purchased. The analysis discloses the salienceof each attribute in brand switching.

ReferencesARABIE, P. and HUBERT, L. (1994): Cluster Analysis in Marketing Research. In:R.P. Bagozzi (Ed.):Advanced Methods of Marketing Research. Blackwell, Cam-bridge, MA, 160-189.OKADA, A. and YOKOYAMA, S. (2011): Cluster Analysis Based onMulti-layerStructure.Collection of Abstracts IFCS Symposium and GfKl/DAGM ConferenceTalks. 149.OKADA, A. and YOKOYAMA, S. (2013): Multi-layer Cluster Analysis of BrandSwitching.Proceeding of the 31st Meeting of the Japanese Classification Society.

KeywordsBRAND SWITCHING, CLUSTER ANALYSIS, HIERARCHICAL STRUCTURE,MULTI-LAYER

Graduate School of Management and Information Sciences, Tama University, 4-1-1 Hi-jirigaoka Tama-shi Tokyo Japan [email protected] · Department ofBusiness Administration, Faculty of Economics, Teikyo University, 359 Otsuka Ha-chioji City Tokyo Japan [email protected]

5

Polish Households’ Pharmaceutical Expenditures in Years2010− 2020− Microsimulation Analysis with FARMMES

Agata Zoltaszek, M.A.

Abstract

Healthcare is a key sector of every economy and is of grate medical, social, and econom-ical importance to all citizens. In Poland healthcare is mostly public funded, howeverdue to the system’s inefficiency “out-of-pocket” expenditures have been increasing. Thelargest share contains pharmaceutical expenditures and itis high enough to limit theaccessibility of the prescribed and over-the-counter medicine. Therefore, analysing dataon current and future private direct expenses on medicine iscrucial for evaluation andimprovement of the healthcare system in Poland. The main goal of this paper is to ana-lyze Polish households’ pharmaceutical expenditures in years 2010 2020. Aggregatedvalues on province and state level are obtained by a microsimulation experiment basedon a microsimulation model that has been constructed for thepurpose of this research- The Microsimulation Socioeconomic Model of Households’ Pharmaceutical Expen-ditures in Poland (FARMMES). Outcomes of this research can be used to analyze thedistribution of these expenditures and determine losers and winners of current health-care policy in Poland, especially by pinpointing the most economically disadvantagedsocial groups. These results might be useful in evaluating the current healthcare policyand offer some guidelines for new policies in Polish healthcare system.

ReferencesBARONI, E., RICHIARDI, M. (October 2007): Orcutt’s Vision,50 years on,WorkingPaper no. 65.CAMERON, A.C., TRIVEDI, P.K. (2009):Microeconometrisc. Methods and Appli-cations. Cambridge University Press, Cambridge.ORCUTT, G.H., CALDWELL, S., WERTHEIMER, R.F. (1976):Policy explorationthrough microanalytic simulation. The Urban Institute, Washington D.C

KeywordsMICROSIMULATION, HEALTH ECONOMICS, PHARMACEUTICAL EXPENDI-TURES

Chair of Spatial Econometrics, University of Lodz, Lodz, Poland,[email protected]

6

Clustering for Aggregated Symbolic Data

Nobuo Shimizu and Junji Nakano

Abstract

Symbolic data (SD) can express “concepts”, which include groups of individuals. Typ-ical SD take intervals, histograms or barcharts as variablevalues (Billard and Diday,2006). Symbolic data analysis (SDA) provides techniques, including cluster analysis,for handling such SD. Traditional SD uses information aboutmarginal distributions ofvariables in each SD. We consider the case where individualsare divided into some nat-ural defined groups and any descriptive statistics of groupscan be easily calculated. Wecan express each group by some descriptive statistics, and call them aggregated sym-bolic data (ASD). ASD can represent information about its marginal distributions, suchas mean and variance, and also information about joint distribution, such as correlationcoefficients. Hierarchical methods based on several definitions of dissimilarity betweentraditional SD have been studied (Billard and Diday, 2006).We define a dissimilaritybetween ASD and use it for hierarchical clustering for ASD. EM algorithm is often usedin model-based clustering for classical data (Everitt et al., 2011). We also investigate aclustering method based on Gaussian mixture model in ASD framework. We derive asimplified EM algorithm for clustering ASD by using mean and variance of each vari-able and covariance among variables in ASD. We apply our method to artificial and realdata examples.

ReferencesBILLARD, L. and DIDAY, E. (2006):Symbolic Data Analysis: Conceptual Statisticsand Data Mining. Wiley, West Sussex.EVERITT, B. S., LANDAU, S., LEESE, M. and STAHL, D. (2011):Cluster Analysis(5th Edition). Wiley, West Sussex.

KeywordsDISSIMILARITY, EM ALGORITHM, GAUSSIAN MIXTURE MODEL, SYMBOLICDATA ANALYSIS

The Institute of Statistical Mathematics, Tokyo, Japan{nobuo,nakanoj}@ism.ac.jp

7

Factor Analysis of Distributional Data using Quantiles

Rosanna Verde1 and Antonio Irpino1

Abstract

Distributional data are multi-valued weighted descriptions of a collection of measure-ments, where each unit is described by a empirical distribution for a particular quanti-tative attribute. Symbolic Data Analysis (SDA) provides tools for the statistical treat-ment of multi-valued data. When the number of variables increases, dimension reduc-tion techniques are useful for extracting pattern from data. The most known dimensionreduction techniques for quantitative data are the Principal Component Analysis (PCA).In the literature of SDA, several PCA techniques for data described by histograms ofvalues have been proposed. The proposed PCAs do not considerdirectly associationmeasures between histogram variables, but relationships between some particular fea-tures of the histograms (the means or only the vector of observed empirical frequen-cies). Starting from a new association measures for distributional variables based onthe squared Wasserstein distance, we propose a new PCA for distributional data, solv-ing the problem of working only on partial information on distributional variables andfurnishing new tools for interpreting the results of the proposed technique.

ReferencesBOCK, H.H. and DIDAY, E. (2000):Analysis of Symbolic Data: Exploratory meth-ods for extractin statistical information from complex data. Springer-VerlagMAKOSSO-KALLYTH, S. and DIDAY, E. (2012): Adaptation of interval PCA tosymbolic histogram variables.ADAC 6 (2), 147-159.RÜSCHENDORF, L. (2001): Wasserstein metric, in: M. Hazewinkel, Encyclopediaof Mathematics, Springer.VERDE, R. and IRPINO, A. (2008): Comparing Histogram Data Using aMahalanobis–Wasserstein Distance. In:COMPSTAT 2008, Phisica-Verlag 77-89.

KeywordsDISTRIBUTIONAL DATA, SYMBOLIC DATA, FACTOR ANALYSIS, WASSER-STEIN DISTANCE

Department of Political Studies “J. Monnet”, Second University of Naples, Viale Ellit-tico 31, Caserta Italy{rosanna.verde,antonio.irpino}@unina2.it

8

A Hierarchical Clustering Algorithm applied to ModalOrdinal Symbolic Data

Carmen Bravo1 and José M. García-Santesmases2

Abstract

A generalϕ function that characterizes a consensus measure is defined for probabilitydistributions for a set of ordinal categories. This measureis extended to sets of modal or-dinal symbolic data objects. A dissimilarity measure between two of these sets based inthe consensus variability of their centroids is defined. TheLeik measure, being suitablefor any ordinal scale, is shown to be aϕ function.

An ascending hierarchical clustering algorithm is appliedfor modal ordinal data. Thecriterion to be minimized in each step is based on the decrease of the variability measureof one partition when two of its members are joined. This decreasing value is shown tobe proportional to the already defined dissimilarity between the two clusters joined.Some criteria to measure the quality of partitions and clusters are given.

To illustrate the proposed method we apply it to a data set composed of 34 teachersthat were rated by their students (1350) on 12 items on the ordinal scale: poor, average,good, excellent. Teachers are described by modal ordinal symbolic data. Interpretationsof clusters regarding relevant issues are shown.

ReferencesBOCK, H.H., DIDAY, E. (Eds.) (2000):Analysis of Symbolic Data. ExploratoryMethods for Extracting Statistical Information from Complex Data. Springer Verlag,Heidelberg.GARCIA-SANTESMASES, J.M., BRAVO, M.C. (2010): Consensus Analysisthrough Modal Symbolic Objects. In:Compstat 2010 proceedings. Springer, ISBN978-3-7908-2603-6, 1055–1062.LEIK, K.R. (1966): A Measure of Ordinal Consensus.The Pacific Sociological Re-view, 9, 85–90.

KeywordsMODAL ORDINAL SYMBOLIC DATA, SYMBOLIC CONSENSUS MEASURE,SYMBOLIC HIERARCHICAL CLUSTERING

Universidad Complutense de Madrid, Servicio Informático de Apoyo al Usuario - Inves-tigación, Vicerrectorado de Alumnos, 28040 Madrid, [email protected] · Uni-versidad Complutense de Madrid, Facultad de Ciencias Matemáticas, Dpto. Estadísticae Investigación Operativa, 28040 Madrid, [email protected]

9

Constrained Clustering of Temporal Beanplot Data

Carlo Drago1

Abstract

The explosion of Big Data in last years has determined some relevant problems in datamanagement and the urgence of new methods. In fact data aggregation lead to informa-tion loss and so there is the need to consider new approaches in order to handle datain a suitable way. The SDA approach consider symbolic data (i.e. interval, boxplot orhistogram data) which take in to account the internal data structure without aggregation.In this sense our proposal is using beanplots to consider thevariation in a specific obser-vation. The beanplots are obtained by mean of a kernel density estimate which allow torepresent the original data and show their relevant features. In the temporal framework,we consider beanplot time series, ordered sequences of beanplot over time. The beanplotdata can be parameterized by mean of mixture distribution models to retain the relevantstructural information. In particular the obtained parameters can be used in clusteringand in forecasting. An important element is the possibilityto taking in to account alsothe fit of the different models obtained in the analysis. In this work we will presenta new clustering approach on Beanplot data which take in to account constraints overtime. These obtained clusters allow to identify homogeneous temporal periods whichcan be used in applicative contexts.

ReferencesDIDAY, E., and NOIRHOMME-FRAITURE, M. (Eds.). (2008). Symbolic DataAnalysis and the SODAS software (pp. 1-457). J. Wiley & Sons.DRAGO, C. and LAURO, C. and SCEPI, G. (2011): Beanplot Data Analysis in aTemporal Framework, presented at CLADAG, September 7-9 2011 Pavia.

KeywordsSYMBOLIC DATA ANALYSIS, BEANPLOT, CONSTRAINED CLUSTERING, TIMESERIES

University of Napoli Federico II, Department of Economic and Statistical [email protected]

10

Gender Gap: towards a Measurement with Chain GraphicalModels

Federica Nicolussi1 and Fulvia Mecatti2

Abstract

Recent gender literature shows a growing demand for sound statistical methods for mea-suring any gender gap, apto to capture its complexity and to embed the pattern of re-lationships among a collection of observable variables selected in order to disuntangleits latent trait. This paper focuses on parametrical Hierarchical Marginal Models (Bar-tolucci, Colombi and Forcina, 2007), which apply to binary and categorical data, as aparticularly useful tool for gender studies. We explore thepotential of Chain GraphicalModels (Drton, 2009) in presence of both directed and undirected arcs while exclud-ing directed/semi-directed cycles. These specific model features allow for representingconditional independence as well as shaping both symmetrical-associational and causalrelationships in the dataset. It will be how comparing the two distinct graphical mod-els referring to each gender, any difference displayed in the conditional independencestructure can be interpreted as a gender gap indicator. Preliminary results from a recentsurvey on the issue of sexual harassment will be illustrated, granted by the Committeeon Equal Opportunities of the University of Milano-Bicocca. The survey, as a first everattempt to collect primary data on this sensitive matter, was conducted in July 2012 atthe university site and it has reached a quite high response rate , as well as producingan unexpectedly large adhesion (sample), including all level of students, professors andstaff.

ReferencesBARTOLUCCI, F., COLOMBI, R., and FORCINA, A. (2007): An extended class ofmarginal link functions for modelling contingency tables by equality and inequalityconstraints.Statistica Sinica, 17(2),691.DRTON, M. (2009): Discrete chain graph models.Bernoulli, 15(3),736–7553.MECATTI, F., CRIPPA, F. and FARINA, P.(2012): A Special Gen(d)er of Statistics:Roots, Development and Methodological prospects of a Gender Statistics.Interna-tional Statistical Review, 80,452–467.NICOLUSSI, F.(2013): Marginal Parametrizations for independence models andgraphical models for categorical data.PhD Thesis.

KeywordsCONDITIONAL INDEPENDENCIES, GENDER STATISTICS, MARGINALMOD-ELS, MARKOV PROPERTIES.

University of Milano-Bicocca, Piazza dell’Ateneo Nuovo, 1- 20126, [email protected];[email protected]

11

Time To Graduation: Does Gender Make A Difference? AnAnalysis Of A Greek University

Adele H. Marshall1 Aglaia Kalamatianou2 and Mariangela Zenga3

Abstract

In the Greek university system the graduation happens aftera time threshold, but stu-dents can graduate any time after this threshold without a time limit. In such casesduration of studies may last for a long time and the corresponding distribution may havea long right tail that never reaches the time axis, leading toa group ofperpetual students(Kalamatianou and McClean, 2003) The aim of this paper is to analyse students’ pro-gression to graduation to estimate the influence of various factors on the probability thatstudents, with certain characteristics, will progress successfully towards their degree orbe still enrolled at the end of the observation. We propose touse the Coxian phase-typedistributions (Cox and Miller, 1965) for modelling the length of graduation of the stu-dents enrolled at a Greece public university, paying attention to the subpopulations ofmen and women students.

ReferencesKALAMATIANOU, A. and McCLEAN, S. (2003): The Perpetual Student: Model-ing Duration of Undergraduate Studies Based on Lifetime-Type Educational Data.Lifetime Data Analysis, 9, 311–330.COX, D.R. and MILLER, H.D. (1965):The theory of stochastic processes. Chapman,London.

KeywordsSURVIVAL ANALYSIS, TIME TO GRADUATION, COXIAN PHASE TYPE DIS-TRIBUTION

Department of Sociology, Panteion University of Athens, [email protected] · Centre for Statistical Scienceand Operational Research (CenSSOR) Queen’s University of Belfast, [email protected] · Department of Statistics and Quantitative Methods,University of Milano-Bicocca, [email protected]

12

Beyond indicators: a Causal Approach to Gender Statistics

Silvia Caligaris1 and Fulvia Mecatti2

Abstract

Most of gender statistical measures proposed in the last decades are composite indica-tors, i.e. weighted linear combinations of basic statistics such as ratios and percentages.Composite indicators then involve several arbitrary choices - for instance the weight-ing/aggregating system, variables selection, standardization - affecting both indexestransparency and interpretation. Furthermore gender inequality is a complex latent phe-nomenon, a collection of disparate and inter-linked issuesthat can be hardly caughtin a single indicator. The development of statistical toolsandad hocmodels is thenrequired. The aim of this work is to explore the potential of graphical models as a lan-guage able to clearly represent the complex relationship arrange among a collection ofvariables selected for statistically assessing of gender disparities. The causal approach,as traditionally applied in genetics and epidemiology, will be adopted. We will focuson causal graphs, allowing for deepening and interpreting the causal mechanism thatmay have originated a gender gap as well as for exploring the effects of gender tailoredpolicies. Causal models indeed provide transparent mathematical tools to implementthe assumptions underlying any causal inference, translating them in joint distributionsand reading off the conditional independences according tothe d-separationcriterion(Pearl, 2000). The potential of this methodology will be shown in deriving causal effectsin non-experimental studies, representing policies’ effects and interventions through thedo operator, controlling for confounders and interpreting counterfactuals.

ReferencesCALIGARIS, S., MECATTI, F., CRIPPA, F.(2013): A Narrower Perspective? Froma Global to a Developed-Countries Gender Gap Index: a GenderStatistics Excercise.Statistica, special issue on gender studies in press.MECATTI, F., CRIPPA, F., FARINA, P.(2012): A Special Gen(d)er of Statistics:Roots, Development and Methodological prospects of a Gender Statistics.Interna-tional Statistical Review, 80,452–467.PEARL, J. (2000):Causality: Models, Reasoning and Inference. Cambridge Univer-sity Press, New York.

KeywordsCAUSAL MODELS, CONDITIONAL INDEPENDENCES,d-SEPARATION CRITE-RION, GENDER GAP INDEXES

University of Milano-Bicocca, Piazza dell’Ateneo Nuovo, 1- 20126, [email protected]; [email protected]

13

Gender Differentials In Higher Education: Hints From AFuzzy States Analysis

Franca Crippa,1 Marcella Mazzoleni2 and Mariangela Zenga2

Abstract

Higher education (HE) persistence has recently shown a turnin favour of the femalepopulation, that graduates more often within the expected timeframe and is less ex-posed to drop out in comparison with males (OECD, 2012). Thisshift from a pastsituation of generalized HE male predominance to the present female outperformanceis evidenced as an inversion in differentials by gender, whose intensity varies ac-cording to HE choices as well as to the career stage. Whilst individual or institu-tional determinants have been widely considered mainly in terms of overall attain-ments, HE students’ intermediate results and strategies have gained less attention.The paper examines a methodological alternative to the existing measures of gender dif-ferentials in coping with undergraduate university requirements. In particular, Markovchains with fuzzy states are applied so as to highlight pathsand to derive synthetic in-dicators of gender gaps, whatever the direction of the latter might be, apt both to berepeated in time and to give insight in undergraduates’ choices and strategies at specifictime points.

ReferencesOECD (2012): Gender Equality in Education,Employment and En-trepreneurship: Final Report to the MCM 2012, Paris, 23-24 May 2012,http://www.oecd.org/social/family/50423364KALAMATIANOU, A.G. and KOPUGIOUMOUTZAKI, F. (2012): EmploymentStatus and Job-Studies Relevance of Social Science.International Journal of Eco-nomic Sciences and Applied Research, 1, 51–75.SYMEONAKI, M. and STAMOUB, G.B. (2004): Theory of Markov systems withfuzzy states,Fuzzy Sets and Systems 3, 427–445.

KeywordsDIFFERENTIAL, GENDER, MARKOV CHAINS, FUZZY STATES

Department of Psychology, University of Milano-Bicocca, piazza dell’AteneoNuovo, 1 Milan, Italy [email protected] · Department of Statis-tics and Quantitative Methods, University of Milano-Bicocca, via Biococcadegli Arcimboldi, 8, Milan, Italy [email protected],[email protected]

14

Analysing Categorical Variables With Similar Categories:Constrained Multiple Correspondence Analysis

Véronique Cariou and El Mostafa Qannari

Abstract

Multiple Correspondence Analysis (MCA) aims at analysing acategorical data tableby exhibiting a small set of axes (also called scores). Theseones are built in order tomaximize the sum of their squared correlation ratio with thedifferent categorical vari-ables. Let us considerK categorical variables, whereℵk is thekth one. If we representeach variableℵk with its indicator matrixXk, the first MCA componentt maximizes∑k corr2(t,Akt), whereAk is the projector associated withXk: Ak = Xk(XT

k Xk)−1XT

k .Constrained MCA introduces a new constraint on MCA in order to explore and visual-ize categorical variables having the same set of categories. This kind of data may occurin applications such as sensory analysis and Just About Right data. Constrained MCAassumes that theK dummy data tables (or alternatively indicator matrices) share com-mon loadings. It proceeds step by step by computing at each step the common vectorof loadings and the common components. Formally, we seek at each step a componentt and a common vector of loadingsu which maximize the same criterion above, witht = Vu and whereV = ∑k αkXk is the optimal linear combination of the different indi-cator matrices. The solution of this maximization problem is simple. It consists in aniterative algorithm in the course of whichα andu are alternatively updated.The method of analysis is illustrated on the basis of a case study.

ReferencesGREENACRE, M. and BLASIUS, J. (2006):Multiple Correspondence Analysis andRelated Methods.Chapman and Hall. CRC Press.

KeywordsMULTIPLE CORRESPONDENCE ANALYSIS, SENSORY DATA

UNAM University, ONIRIS,USC “Sensometrics and Chemomet-rics Laboratory”, Nantes, F-44322, France. INRA, Nantes, F-44316, France. [email protected];[email protected]

15

Constrained Dual Scaling of Successive Categories forDetecting Response Styles

Pieter C. Schoonees1,2, Michel van de Velden1, and Patrick J.F. Groenen1

Abstract

Dual scaling is a multivariate exploratory method equivalent to correspondence analysisfor analyzing contingency tables. However for Likert-scale data collected from surveyswith multiple questions, it is shown here that a peculiarityof dual scaling can be ex-ploited to detect differences in response styles.

Response styles arise in questionnaire research when respondents tend to use ratingscales in a manner unrelated to the actual content of the survey questions, often biasingresults. Interpreting a response style as a nonlinear mapping of a group of respondents’latent preferences to a rating scale allows for four main types of response styles to bemodeled by quadratic monotone splines. Using this and the link between dual scalingand correspondence analysis a spline-based constrained version of dual scaling is de-vised which can detect the presence of the four main types of response styles.

The method is based on an optimality criterion which is subsequently extended toallow for multiple response styles. A computationally intensive alternating nonnegativeleast squares algorithm is devised for estimating the parameters, which include latentclasses for group membership. It is shown how the method can be used to create a dataset in which the effects of response styles have been removed. The impact of this purgingof response styles on the results from typical analyses of ratings data is illustrated.

KeywordsMONOTONE SPLINES, NONNEGATIVE LEAST SQUARES, CORRESPONDENCEANALYSIS

Econometric Institute, Erasmus University Rotterdam, PO Box 1738, 3000 DR Rotter-dam, The Netherlands· [email protected]

16

ORTHOMALS: Orthogonal Projection Of A MultipleCorrespondence Solution On A Design Space

Ralph C.A. Rippe1 and Willem J. Heiser2

Abstract

Multiple correspondence analysis (MCA or HOMALS) (Gifi, 1990) aims to find homo-geneous groups over more than two nominal variables. However, interpretability of thesolution suffers strongly when the data matrix has structural omissions âAS the miss-ings are by design -. An unknown number of primary dimensionsare solely determinedby the structural incompleteness, instead of delivering substantive information.

ORTHOMALS adapts the original multiple correspondence algorithm by restrictingthe solution to be orthogonal to the design space in each iteration. The design space canbe obtained from e.g. OVERALS. The main focus in this work is on the orthogonalityrestriction, not on obtaining the design space.

We show through a simulation study with different levels of incompleteness that accu-rate correspondence recovery is obtained in situations with up to 80% incomplete data.Its recovery behavior is however not linear. We observe an initial decrease of recoverywith increasing incompleteness, while with further increases of incompleteness we seeincreasing recovery.

The new algorithm is applied to the assessment of mathematical problem solvingskills in primary school children. More specifically we use the mathematical divisionstrategy data of CITO PPON 2004, resulting in a solution thatis similar (in the firsttwo dimensions) to that in the PPON 1997; Realistic and Traditional strategies werestill combined with lacking or faulty strategies, whereas the Realistic and Traditionalcombination of strategies seldom occurs.

ReferencesGifi, A. (1990).Nonlinear Multivariate Analysis. New York: Wiley.

KeywordsMULTIPLE CORRESPONDENCE, ORTHOGONALITY, RESTRICTION, PROJEC-TION, INCOMPLETE

Leiden University, Inst. of Educ. & Child [email protected] ·Leiden University, Institute of [email protected]

17

Squared Covariances Or Chi-Squared Statistics BasedDistances

Antoine de Falguerolles

Abstract

In a 2 by 2 contingency table, Pearson’s Chi-squared statistics for independence is equal,up to the sample size, to the squared Pearson’s correlation between two binary quantita-tive variables obtained by coding the levels with arbitrarynumerical values. The squar-ing achieves here a limited form of invariance which may be ofinterest for some multi-variate analyses. In multidimensional scaling or in clustering, the emphasis may be ondistances based on the magnitudes of measures of co-variation regardless of their signs.A related issue is that of positive semi-definiteness of the matrix of these measures ofco-variation, a property central to visualization techniques such as PCA or metric MDS.

In this presentation, I shall advocate the use of squared covariances or squared corre-lations between any two quantitative variables. It turns out that the non-negative matrixthus formed is positive semi-definite, a property also shared by the matrix of squaredconditional (or partial) correlations.

I shall also consider the case of general multi-way tables. In line with the resultabove, the matrix of Pearson’s Chi-squared statistics of independence of all marginaltwo-way tables is positive semi-definite. But the case of conditional independence isless straightforward (see references below). I shall advocate here the use of the ma-trix of Chi-squared statistics of independence between anytwo variables given all othervariables which, in most applications, turns out to be positive semi-definite.

ReferencesSAPORTA, G. (1976): Quelques applications des opérateurs d’Escoufier au traite-ment des variables qualitatives,Statistique et analyse des données, t.1, 38-46.DAUDIN, J.-J.(1979): Coefficient de Tschuprow partiel et indépendance condition-nelle,Statistique et analyse des données, t.3, 55-58.

KeywordsCOVARIANCE, CONDITIONAL COVARIANCES, PEARSON’S CHI-SQUARED STATIS-TICS, DISTANCE

Université de Toulouse III (Retired),[email protected]

18

A New Constant Memory Recursion For Hidden MarkovModels

Francesco Bartolucci1 and Silvia Pandolfi2

Abstract

In this work, we develop the recursion for hidden Markov models proposed by Bar-tolucci and Besag (2002) and we show how it may be employed to implement an esti-mation algorithm for these models which requires an amount of memory not dependingon the length of the observed series of data. This recursion allows us to obtain the con-ditional distribution of the latent state at every occasion, given the previous state andthe observed data. With respect to the estimation algorithmbased on the well-knownBaum-Welch recursions (Baum et al., 1970; Welch, 2003), which requires an amount ofmemory that increases with the sample size, the proposed algorithm also has the advan-tage of not requiring dummy renormalizations to avoid numerical problems. Moreover,it directly allows us to perform global decoding of the latent sequence of states, withoutthe need of a Viterbi method and with a consistent reduction of the memory requirementwith respect to the latter.

ReferencesBARTOLUCCI, F. and BESAG, J. (2002). A recursive algorithm for Markov randomfields.Biometrika, 89, 724-730.BAUM, L. E., PETRIE, T., SOULES, G., and WEISS, N. (1970). A maximizationtechnique occurring in the statistical analysis of probabilistic functions of Markovchains.Annals of Mathematical Statistics, 41:164–171.WELCH, L. R. (2003). Hidden Markov models and the Baum-Welchalgorithm.IEEE Information Theory Society Newsletter, 53:1–13.

KeywordsEXPECTATION-MAXIMIZATION ALGORITHM, FORWARD-BACKWARD RECUR-SIONS, GLOBAL DECODING, HIDDEN MARKOV MODELS, VITERBI ALGO-RITHM

Department of Economics, Finance and Statistics University of Perugia (IT)[email protected] · Department of Economics, Finance and Statistics Univer-sity of Perugia (IT)[email protected]

19

Detecting Local Dependence In Binary Data Latent ClassModels: Some Developments

Daniël Oberski

Abstract

Binary data latent class models crucially assume local independence, violations ofwhich can seriously bias the results. Monitoring possible local dependencies is there-fore vital. I present three tools for detecting local dependence after fitting a latent classmodel: the bivariate Pearson residual, the score test, and the expected parameter change,and note the relationships between these measures. Some recent work on detecting localdependence is discussed and an application to published data discussed.

References

OBERSKI, D., VAN KOLLENBURG, G., AND VERMUNT, J. (submitted). A MonteCarlo evaluation of three methods to detect local dependence in binary data latent classmodels.OBERSKI, D. AND VERMUNT, J. (submitted). The Expected Parameter Change(EPC) for local dependence assessment in binary data latentclass models.OBERSKI, D. (submitted). Change in SEM parameters of interest as a criterion forpartial measurement invariance: The EPC-interest.

KeywordsLOCAL INDEPENDENCE; FINITE MIXTURE MODEL; SCORE TEST; GENER-ALIZED SCORE

Department of Methodology and Statistics, Tilburg University, The [email protected]

20

Power and Sample Size Determination for Latent ClassModels

Dereje W. Gudicha, Jeroen K. Vermunt, and Fetene B. Tekle1

Abstract

Latent class (LC) models are most frequently used by social,behavioral, and medicalscience researchers, for example, to build latent subgroups based on data from multivari-ate categorical variables, to classify cases to their most likely latent classes, to analyzeagreement data from different raters, and to evaluate the sensitivity and specificity ofdiagnostic tests for which a gold standard is not available.Despite such attractive appli-cations and their increasing popularity in widely diverging research areas, little is knownabout statistical power and sample size for LC models. The objectives of this paper aretwofold. First, a Wald based power analysis method for parameters that describe a re-lationship between an indicator and a categorical latent variable is proposed. Second,the design factors that affect the power of statistical tests are studied. We show how themost important design factors of LC models are related via the information matrix, andhow this information matrix is affcted by the fact that the latent class membership is notobservable. The proposed method is illustrated with numerical examples for differentscenarios of design factors. A simulation study conducted to assess the performance ofthe proposed power analysis procedure showed that the procedure will work for manypractical applications of LC models.

KeywordsLATENT CLASS MODELS; SAMPLE SIZE; STATISTICAL POWER; INFORMA-TION MATRIX; DESIGN FACTOR

Department of Methodology and Statistics, Tilburg University, Tilburg, The Netherlands

21

The Bias-Adjusted Three-Step Approach To Latent ClassModeling With External Variables

Zsuzsa Bakk1, Daniel Oberski1, and Jeroen K. Vermunt1

Abstract

A popular way to connect latent class membership to externalvariables is to relate theexternal variables to the estimated scores on class membership; this approach is calledthree step latent class analyses (LCA). While the three stepLCA is a popular approach,until recently it had the disadvantage that the parameters describing the association oflatent class membership and auxiliary variables were underestimated (Bolck, Croon,Hagenaars, 2004). In the current paper we present how unbiased parameter estimates ofthis association can be obtained, by using the known classification error probabilities asfixed value parameters in the third step analysis (Vermunt, 2010, Bakk, Tekle and Ver-munt in press). Next to correct parameter estimates we also show how correct standarderror (SE) estimates can be obtained. We show the results of asimulation study wherewe test the performance of the parameter bias correction, and the SE bias correctionmethods.

ReferencesBolck, A., Croon M.A. and Hagenaars J.A. (2004):EstimatingLatent Structure Mod-els with Categorical Variables: One-Step versus Three-Step Estimators.PoliticalAnalysis,12, 3-27.J.K. Vermunt (2010):Latent Class Modeling with Covariates: Two Improved Three-Step Approaches.Political Analysis,18, 450-469.Z. Bakk, F.T. Tekle and J.K. Vermunt (in press): Estimating the Association Be-tween Latent Class Membership and External Variables UsingBias-adjusted Three-step Approaches.Sociological Methodology.

KeywordsLATENT CLASS ANALYSIS, THREE STEP APPROACH, COVARIATES, BIAS AD-JUSTMENT

Tilburg University, PO BOX 90153, Tilburg, The [email protected]

22

Comparative Analysis on LDA-based Classification andSubject Categories of the Japanese Awards Database ofGrants-in-Aid for Scientific Research, KAKEN

Kei Kurakawa1, Yuan Sun2, and Yasumasa Baba3

Abstract

Since research and development projects are increasingly carried out in more competi-tive environment than ever before, it becomes more important to evaluate project results.In the evaluation process, the fact data of project results are aggregated from severalkinds of sources among different databases, and figured out in the same axes. A set ofsubject categories is one of major evaluation axes and intends to be integrated amongdifferent databases. For example, a bibliometrics-based research evaluation tool, InCitesby Thomson Reuters gives subject categories of Web of Science, which is mapped to thefields of science and technology of the OECD Frascati Manual [OECD2007] that repre-sents a standard of research and development evaluation methods. Such the mapping ofsubject categories among different databases is ideally ought to be automated for timelyand appropriate research evaluation. So, at the starting point, we compared subject cat-egories of the national grants database KAKEN (http://kaken.nii.ac.jp) andtopics derived by a topic model LDA (Latent Dirichlet Allocation) [BLEI2003] fromkeywords of projects in KAKEN. The subject categories and the topics assigned to eachproject are analyzed through the purity index [ZHAO2001].

ReferencesBLEI, D. M., NG, A. Y., and JORDAN, M. I. (2003): Latent dirichlet allocation.TheJournal of Machine Learning Research, 3:993–1022.OECD. (2007):Revised Field of Science and Technology (FOS) Classification in theFrascati Manual.ZHAO, Y. and KARYPIS, G. (2001): Criterion functions for document clustering:Experiments and analysis.Technical report, Department of Computer Science, Uni-versity of Minnesota, MN.

KeywordsKAKEN, SUBJECT CATEGORIES, LDA, TOPIC CLASSIFICATION

National Institute of Informatics, Tokyo, [email protected] · Na-tional Institute of Informatics, Tokyo, [email protected] · The Instituteof Statistical Mathematics, Tokyo, [email protected]

23

Prototype Identification through Archetypes

Giancarlo Ragozini1

Abstract

A prototype is an element chosen to represent a cluster in order to provide a simplifieddescription of it. Prototypes are usually derived by minimizing some adequacy criterion.The most known approach to obtain them is the constant radiusmethod (e.g. thek-meansalgorithm and the related methods). This latter assures good results when dealing withelliptical clusters, but could become unstable and could not allow a correct clustersidentification in the other cases. Furthermore, the clustercentroids could be too averageand, hence, prototypes could not be well distinguished and separated. In the presentpaper we propose a new method for the prototype identification based on archetypes,i.e. few “pure” points lying on the boundary of the data scatter and characterizing thearchetypal pattern in the data. Archetypes span a space in which data, both single valuedor interval valued ones, have new coordinates, the so-called barycentric coordinates.We propose to perform the clustering procedure and the prototype identification in suchnew space−that provide an outward-inward perspective on the data− by using a propercompositional distance. The proposed procedure yields prototypes well-separated andwith clear profiles.

ReferencesAITCHISON, J., BARCELÓ-VIDAL, C., MARTÝN-FERNÁNDEZ, J.A.,PAWLOWSKY-GLAHN, V. (2000): Logratio Analysis and Compositional Distance.Mathematical Geology, 32, 271–275.CUTLER, A., BREIMAN, L. (1994): Archetypal Analysis.Technometrics, 36, 338–347.D’ESPOSITO, M.R., PALUMBO, F., RAGOZINI, G. (2012); Interval Archetypes:a new tool for interval data analysis.Statistical Analysis and Data Mining, 5, 322–335.DOI:10.1002/sam.11140.

KeywordsBARYCENTRIC COORDINATES, COMPOSITIONAL DATA, SOFT CLUSTERING

Department of Political Sciences, Federico II University of Naples, [email protected]

24

Spatial Clustering based on Hierarchical Structure ofMultidimensional Lattice Data

Koji Kurihara1 and Fumio Ishioka2

Abstract

Spatial data have the information of the values of surface variables at specified loca-tions or regions. We focus on lattice data over a fixed subsetD of d-dimensional Eu-clidean space. Lattice data are synoptic observations covering an entire spatial regionsupplemented with neighborhood information. These data are known as a kind of spatialepidemiological data, remote sensing data, regionally lattice data and so on. There aresome approaches of clustering methods for such lattice data. The echelons (Myers et al.,1997; Kurihara, 2004) are useful techniques to study the topological structure of a sur-face in the systematic and objective manner. The echelons are derived from the changesin topological connectivity with decreasing surface level. The echelon dendrogram rep-resents the surface topology of lattice data and hierarchical structure of these data andregional features are shown in an echelon dendrogram. In this paper, we apply the zoneclustering method based on the peak of echelon dendrogram tomultidimensional spatiallattice data. We have some different zones based on practical definition for the relationof peak and foundation. In addition, we demonstrate some illustrations to detect hotspotareas for multidimensional spatial data.

ReferencesKURIHARA, K. (2004): Classification of Geospatial Lattice Data and Their Graph-ical Representation.Classification,Clustering and Data Mining Applications (Editedby Banks, D. et el.). Springer, Berlin, Tokyo, 251–258.MYERS, W.L., PATIL, G.P., JOLY, K. (1997): Echelon Approachto Areas of Con-cern in Synoptic Regional Monitoring.Environmental and Ecological Statistics, 4,131–152.

KeywordsMULTIDIMENSIONAL SPATIAL DATA, ECHELON ANALYSIS, CLUSTERING

Graduate School of Environmental and Life Science, OkayamaUniversity, 3-1-1Tsushima-naka Okayama 700-8530, [email protected] ·School of Law, Okayama University, 3-1-1 Tsushima-naka Okayama 700-8530, [email protected]

25

Research Literature Analytics through Mapping Narratives

Fionn Murtagh

Abstract

With large volumes of scholarly journal submissions, or conference paper submissions,it is useful and indeed necessary to determine narratives ofwriting and of researchinvolved. The same issue arises in the narrative of researchgrant funding proposals.

Some conferences and journals now use matching of submissions with reviewers,based on the content of the submitted paper, and a collectionof past work by the re-viewers (Charlin et al., 2011). In Murtagh (2010) we looked at discipline themes andsubthemes with implications for strategy, and thematic focus and coverage. This wasin connection with the work of a national research funding agency. In Murtagh et al.(2011), we looked at narrative within published journal articles.

Our objectives include taking account of subdiscipline differentiation, and mappingthe semantics of the content considered. For scalability, this work also involves use ofthe Solr (Apache Lucene) storage, retrieval and discovery system.

ReferencesCHARLIN, L., ZEMEL, R. and BOUTILIER, C. (2011): A Frameworkfor Optimiz-ing Paper Matching. InProceedings of 27th Conference on Uncertainty in ArtificialIntelligence (UAI), Barcelona.MURTAGH, F., GANZ, A. and REDDINGTON, J. (2011): Semantics from Narra-tive: State of the Art and Future Perspectives. In: M. Gettler Summa, L. Bottou, B.Goldfarb, F. Murtagh, C. Pardoux and M. Touati (Eds.):Statistical Learning andData Science. Chapman & Hall/CRC, 91–102.MURTAGH, F. (2010): The Correspondence Analysis Platform for Uncovering DeepStructure in Data and Information, Sixth Boole Lecture,Computer Journal, 53 (3),304–315.

KeywordsCLUSTERING, FACTOR ANALYSIS, BIG DATA, ANALYTICS, SEMANTICS

Department of Computer Science, Royal Holloway, University of London, EghamTW20 0EX, [email protected]

26

Effects of Moment-to-moment Likeability Patterns on theVirality of Online Ads

Tammo Bijmolt1

Abstract

Classification methods have been developed and applied numerous times in the market-ing research discipline, most notably cluster analysis andlatent class methods in marketsegmentation studies. More recently, classification methods have been applied to newtopics, such as online media and customer databases. The presentation will provide abrief overview of recent applications of classification methods in marketing and nextillustrate this using a specific project. In particular, I will discuss a study on consumers’evaluation of online commercials and their willingness to share content (viral advertis-ing). The analysis captures the dynamics of likeability evaluations by identifying MtMpatterns using trajectory finite mixture modelling and nextexamines the effect of thesepatterns on ad virality. The model is estimated using uniquedata consisting of morethan 12.000 respondents and 30 ads. The results show, among others, that high likeabil-ity values at ad beginning and end are important, while the end effect is higher.

Faculty of Economics and Business, University of Groningen

27

Formal Concepts for Classification

Bernhard Ganter

Abstract

In recent decades, a rich mathematical theory was developed, which can be regarded asa theoretical basis of poly-hierarchical classification. It bears the name “Formal ConceptAnalysis”. The usual tree hierarchies are replaced by mathematically more interestingstructures, namely complete lattices, which are interpreted as hierarchies of formal con-cepts. This name refers to the fact that extensional and intensional hierarchies are jointlyrepresented. The use of metric approaches is possible, but is of minor importance. For-mal Concept Analysis has expressive graphics, an extensivealgebraic theory and pow-erful algorithms. The mathematical setting is both simple and versatile, mathematicallyrigorous and flexible.

The origins and initial inspirations of this research area were within the classificationsocieties. Meanwhile, an independent community with lively publication and confer-ence activities has developed. The lecture describes methodology and applications ofFormal Concept Analysis by means of simple examples and informs about recent de-velopments.

28

Multinomial Logistic Regression Ensembles

Hongshik Ahn1

Abstract

We propose a method for multiclass classication problems using ensembles of multi-nomial logistic regression models. A multinomial logit model is used as a base classierin ensembles from random partitions of predictors. The multinomial logit model can beapplied to each mutually exclusive subset of the feature space without variable selection.By combining multiple models the proposed method can handlea huge database with-out a constraint needed for analyzing high-dimensional data, and the random partitioncan improve the prediction accuracy by reducing the correlation among base classiers.The proposed method is implemented using R and the performance including overallprediction accuracy, sensitivity, and specicity for each category is evaluated on two realdata sets and simulation data sets. To investigate the quality of prediction in terms ofsensitivity and specicity, area under the ROC curve (AUC) isalso examined. The per-formance of the proposed model is compared to a single multinomial logit model and itshows a substantial improvement in overall prediction accuracy. The proposed method isalso compared with other classication methods such as Random Forest, Support VectorMachines, and Random Multinomial Logit Model.

KeywordsCLASS PREDICTION; ENSEMBLE; LOGISTIC REGRESSION; MAJORITY VOT-ING; MULTINOMIAL LOGIT; RANDOM PARTITION

Department of Applied Mathematics and Statistics, Stony Brook University, StonyBrook, NY 11794-3600

29

Age-specific Disease Network For The Major Disease InKorea

Taerim Lee1 and Hongseok Kim2

Abstract

Objectives: The purpose of this paper is to analyze the relationship among major dis-eases in Korea using social network analysis and word cloud based on the literaturedata. Differences across three age groups are also studied.

Methods: We used social network analysis to draw a network graph for major diseasesbased on the relationships among the diseases by a literature search, using the prevalencerate and the mortality of such diseases from 2011 Korean National Health NutritionExamination Survey and causes of death statistics in Korea.

Results: We find that smoking and obesity is the most important factor of causingother diseases. Except obesity, anemia, hepatitis, atopicdermatitis and some other dis-eases, most diseases become more common and more dangerous across the older agegroup. We can visually recognize these results from the graphs made by social networkanalysis and wordle.

Conclusions: We made the age-specific social network graphsbetween 24 major dis-eases in Korea across three age groups. We could know most disease became more andmore prevalent and severe with people being older.

KeywordsSOCIAL NETWORK ANALYSIS, DISEASE NETWORK, KOREAN DISEASE NET-WORK, WORD-CLOUD, WORDLE

Dept. of Information Statistics, KNOU· A Public Health Doctor at Suncheon city healthcenter Dept. of Information Statistics, KNOU

30

Analysis of Questionnaire Survey with Ordinal-polytomousUsing the Binomial Confidence Limits

Ueno, T.1, Tatsunami, S.1, Otaki, M.2, and Kuwabara, R.2

Abstract

The questionnaire survey is used frequently in investigations of quality of life (QOL) aswell as other social problems.

We assume that response scales of a survey form is ordinal-polytomous and considerdata fromn responders on the questionnaire instrument consisted ofm items. Letyi j bethe response to thej-th item fromi-th responder and ¯yi the average ofi-th responder’sresponse. Putz= (zi j ) as follows:

zi j =

N.A. if yi j is N.A.,

1 if yi j ≥ yi ,

0 if yi j < yi .

Put p the probabilityzi j = 1, then each column ofz is a variable with the binomialdistributionB(n, p). We count up the number of 1 inj-th column ofz and classify theitems into a few levels by using the binomial confidence limits. The number of levelsshould be determined appropriately depending on the numberof column. Then we pickup all items with the same levels and apply them the above procedure repeatedly. Weapply this procedure, until we can no longer separate any group of items into smallergroups. we obtain a classification of items of a questionnaire survey. The correlationof the items in the same level is not necessary high. We consider that this procedure isuseful in the interpretation of a questionnaire survey.

ReferencesFAYERS, P. M. and MACHIN, D. (2000):Quality of Life, Assessment, Analysis andInterpretation. Wiley, England.

KeywordsQUESTIONNAIRE SURVEY, BINOMIAL CONFIDENCE LIMIT

Medical Statistics, St. Marianna University School of Medicine, Kawasaki, Japan [email protected] · Institute of Radioisotope Research, St. Mari-anna University Graduate School of Medicine, Kawasaki, Japan 216-8511

31

Comparison Of Methods For Handling Missing Data In AMulti-Item Instrument

I. Eekhout123, H.C.W. de Vet13, J.W.R. Twisk123, J.P.L. Brand4, M.R. de Boer25, andM.W. Heymans123

Abstract

Regardless of the proportion of missing values, complete-case analysis is most fre-quently applied, although advanced techniques such as multiple imputation are avail-able. The objective of this study is to explore the performance of simple and more ad-vanced methods for handling missing data in case some, many,or all item scores aremissing in a multi-item instrument.

Real-life missing data situations were simulated in a multi-item variable used as acovariate in a linear regression model. Various missing data mechanisms were simu-lated with an increasing percentage of missing data. Subsequently, several techniquesto handle missing data on level of item score and total score were applied such as meanimputation, two-way imputation and multiple imputation todecide on the most optimaltechnique for each scenario. Fitted regression coefficients were compared, using the biasand coverage as performance parameters.

Mean imputation caused biased estimates in every missing data scenario when dataare missing for more than 10

We recommend applying multiple imputation to the item scores in order to get themost accurate regression model estimates. Moreover, we advise not to use any form ofmean imputation to handle missing data, despite the fact that this is often times recom-mended in questionnaire manuals.

KeywordsMISSING DATA, MULTIPLE IMPUTATION, ITEM IMPUTATION, ORDINAL DATA,MULTI-ITEM QUESTIONNAIRE, SIMULATION

Department of Epidemiology and Biostatistics, VU University Medical Center, Amster-dam, The [email protected] · Institute for Health Sciences, Facultyof Earth and Life Sciences, VU University, Amsterdam, The Netherlands· EMGO+ In-stitute for Health and Care Research, Amsterdam, The Netherlands· Skyline Diagnos-tics, Rotterdam, The Netherlands· Department of Health Sciences, Univerity MedicalCentre Groningen, University of Groningen, The Netherlands

32

Common and Cluster-specific Simultaneous ComponentAnalysis

Kim De Roover1, Marieke E. Timmerman2, Batja Mesquita3 and Eva Ceulemans1

Abstract

In many fields of research, so-called ‘multiblock’ data are collected, i.e., data containingmultivariate observations that are nested within higher-level research units (e.g., inhab-itants of different countries). Each higher-level unit (e.g., country) then corresponds toa ‘data block’. For such data, it may be interesting to investigate the extent to which thecorrelation structure of the variables differs between thedata blocks. More specifically,when capturing the correlation structure by means of component analysis, one may wantto explore which components are common across all data blocks and which componentsdiffer across the data blocks. Therefore, we propose a common and cluster-specific si-multaneous component method which clusters the data blocksaccording to their cor-relation structure and allows for common and cluster-specific components. Model esti-mation and model selection procedures are presented and themethod is applied to datafrom cross-cultural values research to illustrate its empirical value.

KeywordsSIMULTANEOUS COMPONENT ANALYSIS, CLUSTERWISE SIMULTANEOUSCOMPONENT ANALYSIS, MULTIBLOCK DATA, MULTIGROUP DATA, MULTI-LEVEL DATA

Methodology of Educational Sciences Research Unit, KU Leuven,Andreas Vesaliusstraat 2, box 3762, 3000 Leuven, Belgium. Email:[email protected] · Heymans Institute of Psychology, Univer-sity of Groningen· Social and Cultural Psychology Research Unit, KU Leuven

33

Extending Clusterwise non-negative matrix factorization(NMF) to hierarchically organized data

Joke Heylen1, Philippe Verduyn2, Iven Van Mechelen2 and Eva Ceulemans1

Abstract

Researchers are often interested in capturing variabilityin time profiles. Often thesestudies induce hierarchically organized time series data,in that the time profiles arenested within higher order units. For instance, when studying the intensity of emotionsand how this fluctuates across time, researchers ask subjects to recollect distinct emo-tional episodes and to draw their intensity course over time(Verduyn et al., 2009). Thequestion then rises how the variability in these time profiles can be captured, taking indi-vidual differences into account. To this end, we extend Clusterwise non-negative matrixfactorization (NMF) (Heylen et al., 2012), to hierarchically structured data. In this ex-tension, the higher order units (e.g., persons) are clustered according to the differentshapes that their time profiles take. To gain insight into which shapes typically occur forthe higher order units in specific clusters, we partition thetime profiles within each clus-ter. We propose an algorithm for fitting the hierarchical clusterwise NMF model to dataand evaluate it by means of a simulation study. Finally, we fitthe model to empiricalintensity profiles of emotional episodes nested within subjects.

ReferencesHEYLEN, J., CEULEMANS, E., VAN MECHELEN, I. and VERDUYN, P.(2012,august):Clusterwise Non-negative Matrix Factorization (NMF) for capturingvariability in time profiles.Paper presented at the International Conference on Com-putational Statistics, Limassol, Cyprus.VERDUYN, P., VAN MECHELEN, I., TUERLINCKX, F., MEERS, K. andVANCOILLIE, H. (2009): Intensity profiles of emotional experience over time.Cognitionand Emotion, 23(7), 1427–1443.

KeywordsTIME PROFILES, HIERARCHICALLY ORGANIZED DATA, CLUSTERING, FUNC-TIONAL DATA ANALYSIS

Methodology of Educational Sciences, KU Leuven, [email protected] · Quantitative Psychology and IndividualDifferences, KU Leuven, Belgium.

34

Generalized Reduced Clustering Analysis

Michio Yamamoto

Abstract

This work develops a new procedure for finding an optimal cluster structure of mul-tivariate objects and also finding an optimal subspace for clustering, simultaneously.The proposed method is conducted by minimizing a distance between objects and theprojections with clustering penalties, and it can be considered as a generalized modelincluding some existing cluster analyses with dimension-reduction such as the reducedk-means analysis (De Soete and Carroll, 1994) and the factorial k-means analysis (Vichiand Kiers, 2001). In addition, even if the data have a structure which is independent tothe true cluster structure and affects the performance of clustering, the proposed methodfinds the optimal subspace to partition the objects by eliminating the effect of the dis-turbing structure. An efficient alternating least-squaresalgorithm, consisting of the gra-dient projection algorithm and thek-means algorithm, is described. Analyses of artificialand real data examples demonstrate that the proposed methodcan give correct resultsbut existing methods can not.

ReferencesDe Soete, G. and Carroll, J.D. (1994): K-means clustering ina low-dimensional Eu-clidean space. In: Diday, E., Lechevallier, Y., Schader, M., Bertrand, P., and Burtschy,B. (Eds.):New approaches in classification and data analysis. Springer, Heidelberg,212-219.Vichi, M. and Kiers, H.A.L. (2001): Factorialk-means analysis for two-way data.Computational Statistics & Data Analysis, 37, 49-64.

KeywordsDIMENSION REDUCTION, CLUSTERING, GRADIENT PROJECTION ALGORITHM,K-MEANS ALGORITHM

Osaka University, [email protected]

35

Mixtures Of Factor Analyzers And UnobservedHeterogeneity In Questionnaire Data

Robert Kapłon1

Abstract

A model of a mixture of factor analyzers was proposed to concurrently perform clus-tering and reduction of the number of dimensions when the number of dimensions wasrelatively large in relation to the sample size. Whilst the classification and visualizationof high-dimensional data seems to be the primary purpose, one may find MFA use-ful in accounting for population heterogeneity in data. This suggestion stems from thefact that finite mixture models have been successfully applied to explain heterogene-ity among customers in many marketing problems. Thus, in this paper we consider thepossibility of applying mixtures of factor analyzers to questionnaire data, so as to cap-ture unobserved heterogeneity. Firstly, we show how a traditional factor analysis modelthat ignores heterogeneity can lead to misleading inferences. Afterwards, based on thesetheoretical findings, a simulation experiment is conductedto investigate features of datawhich may indicate unobserved heterogeneity, thereby justifying the use of a mixture offactor analyzers. These results are then used to propose a procedure that allows us to de-cide – without parameter estimation for the MFA model – whichof these two competingmodels should be utilized. Finally, we test the proposed model on a real data set.

ReferencesALLENBY, G.M. and ROSSI, P. (1999): Marketing Models of Consumer Hetero-geneity.Journal of Econometrics, 89, 57–78DILLON, W.R. and KUMAR, A. (1994): Latent Structure and Other Mixture Mod-els in Marketing: An Integrative Survey and Overview. In: R.P. Bagozzi (Eds.):Ad-vanced Methods of Marketing Research. Blackwell, Oxford, 295–351.FRÜHWIRTH-SCHNATTER S. (2006):Finite Mixture and Markov Switching Mod-els. Springer.MCLACHLAN, G.J. and PEEL, D. (2000).Finite Mixture Models. Wiley, New York.

KeywordsFACTOR ANALYZERS, MIXTURE MODELS, HETEROGENEITY

Wrocław University of Technology, Wybrzeze Wyspianskiego 27, 50-370 Wrocław,[email protected]

36

Estimation Methods for Categorical Marginal Models:Comparing MAEL, GEE, and GSK.

Renske E. Kuijpers1, Wicher P. Bergsma2, L. Andries van der Ark1, and Marcel A.Croon1

Abstract

Categorical marginal models can be used for modeling dependent data. For example,marginal models are used to construct hypotheses tests and standard errors for certaincoefficients, such as Cronbach’s alpha and scalability coefficients. The most used esti-mation method for marginal models is maximum likelihood (ML). However, for largersets of items, problems with memory capacity occur. These problems can be avoidedby using maximum augmented empirical likelihood (MAEL; Vander Ark, Bergsma &Croon, 2013). MAEL estimation uses all nonzero cells in a contingency table, plus anumber of well-chosen zero cells. MAEL is a rather new method, and further investiga-tion is needed. More common estimation methods for marginalmodels are generalizedestimating equations (GEE), and GSK. GEE (Liang & Zeger, 1986) represents an ex-tension of the generalized linear model (GLM). In contrast to ML estimation, GEE doesnot assume a certain probability model for the data. The GSK method (Grizzle, Starmer& Koch, 1969) is based on Weighted Least Squares (WLS). Here,the new estimationmethod MAEL is compared to GEE and GSK, using simulation studies as well as areal-data example.

ReferencesGRIZZLE, J.E., STARMER, C.F. and KOCH, G.G. (1969): Analysis of CategoricalData by Linear Models.Biometrics, 25, 489-504.LIANG, K-Y. and ZEGER, S.L. (1986): Longitudinal Data Analysis Using General-ized Linear Models.Biometrika, 73, 13-22.VAN DER ARK, L.A., BERGSMA, W.P. and CROON, M.A. (2013):AugmentedEmpirical Likelihood Estimation of Categorical Marginal Models for Large SparseContingency Tables. Manuscript submitted for publication.

KeywordsMARGINAL MODELS, MAXIMUM LIKELIHOOD, GENERALIZED ESTIMAT -ING EQUATIONS, ESTIMATION

Tilburg University,[email protected] · London Schoolof Economics

37

Applying Multilevel Latent Class Analysis To Large-ScaleEducational Assessment Data: Predicting Students’Mathematical Strategy Choices From Teachers’ InstructionalPractice

Marije F. Fagginger Auer, Marian Hickendorff, and CornelisM. van Putten

Abstract

The usefulness of multilevel latent class analysis (LCA) for educational data is demon-strated, by applying this technique to data from the 2011 large-scale assessment ofDutch primary schools’ mathematics. The relation between the instructional practicereported by 107 teachers and the mathematical strategy choices of 1619 students wasinvestigated. Multilevel LCA allowed modeling of the oftenignored classroom effects,and one of its so far sparsely exploited features - the possibility of including predic-tors at different hierarchical levels - enabled modeling ofthe joint influence of teacherand student characteristics on learning outcomes. Four latent strategy choice classes ofstudents were found, and teachers had a strong effect on students’ probability of beingin these classes. Effects were found of student characteristics and of teachers’ strategyinstruction, instruction formats and instruction differentiation. It is concluded that mul-tilevel (teacher) effects should not be ignored in strategyresearch, and that multilevelLCA is especially suited for application in educational research.

KeywordsMULTILEVEL LATENT CLASS ANALYSIS, APPLICATION, EDUCATION

Leiden University, Institute of Psychology, Methods & Statis-tics, Wassenaarseweg 52, 2333 AK Leiden, the [email protected]

38

A Tuning Strategy for COSA

Maarten M.D. Kampert and Jacqueline J. Meulman

Abstract

It is well known that noise variables can overwhelm the few signals embedded in high-dimensional settings. To overcome this problem for data from these high-dimensionalsettings Friedman and Meulman (2004) proposed clustering objects on subsets of at-tributes (COSA). This technique outputs a dissimilarity matrix that can be used in con-junction with a wide variety of (distance-based) clustering algorithms, including hier-archical methods. In order to avoid distinctly suboptimal solutions, COSA employs ahomotopy strategy for which tuning parameters need to be set. However, a clear guid-ance on the different choices for these tuning parameters has not yet been published. Wepropose a tuning strategy for hierarchical clustering. Furthermore, we compare COSAwith other state of the art methods on simulated and real-life data.

ReferencesFRIEDMAN J.M., and MEULMAN , J.J. (2004): Clustering objects on subsets ofattributes.Journal of Royal Statistics Society Series B, 66, 815–849.

KeywordsHIGH-DIMENSIONAL DATA, VARIABLE SELECTION, HIERARCHICAL CLUS-TERING

Mathematical Institute, Leiden University

39

Accuracy Of Reliability Estimates

Pieter R. Oosterwijk, Klaas Sijtsma, and L. Andries van der Ark

Abstract

Test-score reliability is one of the most reported measuresfor assessing measurementquality of Psychological and Educational tests. Well knownexamples of estimates oftest-score reliability are Cronbach’s coefficient alpha, Guttman’s lambda-2, the greatestlower bound, and the Molenaar-Sijtsma estimate. Coefficient alpha has received criticalreviews for being incorrectly interpreted and being too conservative, and the greatestlower bound for being biased. However, the inaccuracy of thefour reliability estimateshas received little attention in the literature but is a threat to the practical usefulness ofreliability estimates in small and modest samples. The actual extent of the inaccuracy ofreliability coefficients due to factors as sample size and number of items under empiricalconditions is unknown. In a simulation study, we investigated the inaccuracy of coeffi-cient alpha, Guttman’s lambda-2, the greatest lower bound,and the Molenaar-Sijtsmaestimate. As measures of inaccuracy, we used the spread of the sample distribution ofa reliability estimate for different levels of sample size,numbers of item, numbers ofanswer categories, and value of the test-score reliability. We found that the spread ofthe sample distributions (95% interpercentile range) mostly depends on sample size andnumber of items. For multitude of conditions in the simulation design results show thatreliability estimates are to inaccurate to be useful in practice.

KeywordsCRONBACH’S ALPHA, COEFFICIENT ALPHA, GREATEST LOWER BOUNDTOTHE RELIABILITY, GUTTMAN’S LAMBDA-2, MOLENAAR-SIJTSMA RE LIA-BILITY, RELIABILITY ESTIMATION METHODS

Department of Methodology and Statistics, Tilburg University, P.O.Box 90153, 5000LE Tilburg, the [email protected]; [email protected];[email protected]

40

A Big Data Intensive Application System with Symbolic DataAnalysis and its Implementation

Hiroyuki Minami and Masahiro Mizuta

Abstract

Big data analysis has become a remarkable topic in the world. Most reports are fo-cused on how to handle them with the studies based on databasetechniques. Some ofthem note the importance of statistical approach, but it’s just mentioned. It is tough forstatisticians to analyze Big data straightforwardly with conventional methods (mainlyfor n× p or dissimilarity (n×n) matrices). It takes vast time and working memory in atypical computer.

Symbolic Data Analysis (SDA) is a powerful approach and applicable for most char-acteristics (starting with “V”) on Big Data. “Second level”data expression likeConceptin SDA is useful to overcome Variety. It is somehow effectivefor Volume to shrink theamount of the handled data since the expression can be regarded as data aggregation.However, much computing power and storage capacity are essentially needed for Bigdata analysis even if we would succeeded in shrinking. It is the same as to the otherfeature Velocity.

The idea of cloud computing introduced as distributed computing facility like Hadoopand MapReduce functions and distributed file system might lead us to the feasible so-lution. We have developed a statistical application systembased on SDA to seek andutilize the affinity of cloud computing for SDA.

In the paper, we introduce our system and its implementationfrom the practical view-point. Through some examples, we discuss its performance and utility.

ReferencesDIDAY, E. and NOIRHOMME-FRAITURE, M. (2008):Symbolic Data Analysis andthe SODAS Software. Wiley.MINAMI, H. and MIZUTA, M. (2012): SDA framework is the tool for Big DataAnalysis?Book of Abstracts. 3rd Workshop in Symbolic Data Analysis, 21.

KeywordsSYMBOLIC DATA, CLOUD COMPUTING

Information Initiative Center, Hokkaido University, [email protected], [email protected]

41

An Generalization Of Centre And Range Method For FittingA Linear Regression Model To Symbolic Interval Data UsingRidge Regression, Lasso And Elastic Net Methods

Oldemar Rodríguez1

Abstract

In [RODRIGUEZ O. (2000)] we had made four proposals for linear regression with in-terval data type, the simple regression with empirical correlation, linear regression basedon the maximum and minimum correlation, linear regression based on the mid-pointsand linear regression based on top-points of the hypercubes. Then in [BILLARD, L.,DIDAY, E., (2000)] the authors have presented a linear modelto an interval-valued dataset fitting the mid-points of the interval values assumed by the variables in the learningdata set and applies this model to the lower and upper boundaries of the interval valuesof the independent variables to do the prediction. In [LIMA-NETO, E.A., DE CAR-VALHO, F.A.T., (2008-2010)] the authors have proposed a newapproach to symbolicinterval data that fits the linear regression model on the mid-points and ranges of theinterval values assumed by the variables in the learning set.

Ridge Regression shrinks the regression coefficients by imposing a penalty on theirsize, then the coefficients minimize a penalized residual sum squared. In the paper“Re-gression Shrinkage and Selection via the Lasso"[TIBSHIRANI, R., (1996)] the authorpropose a new method for estimation in linear models that minimizes the residual sumof squares subject to the sum of the absolute value of the coefficients being less than aconstant. The penalties used in Lasso provide a natural variables selection to encouragesparsity and simplicity in the solution. In the paper [HASTIE, T., AND ZOU H. (2005)]the elastic net method was proposed, this is also a regularization and variable selectionmethod which is a convex combination of the lasso and ridge penalty methods.

In this paper we used Ridge Regression, Lasso and Elastic Netmethods in order toimproved the Center and Range method for fitting a linear regression model to symbolicinterval data. Finally, the approaches presented are applied to a real and simulated datasets and their performance are compared with Centre and Range method.

ReferencesBILLARD, L., DIDAY, E., (2000). Regression analysis for interval-valued data.In: Data Analysis, Classification and Related Methods,Proceedings of the SeventhConference of the International Federation of Classification Societies (IFCSŠ00),Springer, Belgium, pp. 369-374.BILLARD, L., DIDAY, E., (2003). From the statistics of data to the statistics ofknowledge: symbolic data analysis.J. Amer. Statist. Assoc. 98 (462), 470-487.HASTIE, T., TIBSHIRANI, R. and FRIEDMAN, J. (2008).The Elements of Statis-tical Learning; Data Mining, Inference and Prediction. New York: Springer.HASTIE, T., AND ZOU H. (2005). Regularization and variable selection via theelastic net.J. R. Statist. Soc. B 67, Part 2, pp. 301-320.GIORDANI P., (2011).Linear regression analysis for interval-valued data basedon the Lasso technique. Technical Report n. 7, Department of Statistical Sciences,Sapienza University of Rome.

CIMPA, School of Mathematics, University of Costa [email protected] // [email protected]

42

LIMA-NETO, E.A., DE CARVALHO, F.A.T., (2008). Centre and range method tofitting a linear regression model on symbolic interval data.Computational Statisticsand Data Analysis 52, 1500-1515.LIMA-NETO, E.A., DE CARVALHO, F.A.T., (2010). Constrainedlinear regressionmodels for symbolic interval-valued variables.Computational Statistics and DataAnalysis 54, 333-347.TIBSHIRANI, R., (1996). Regression shrinkage and selection via the lasso.Journalof the Royal Statistical Society - Series B 58, 267-288.RODRIGUEZ, O. (2000).Classification et Modèles Linéaires en Analyse des Don-nées Symboliques. Ph.D. Thesis, Paris IX-Dauphine University.

KeywordsLINEAR REGRESSION, ELASTIC NET, LASSO, RIDGE REGRESSION, SYM-BOLIC DATA ANALYSIS.

43

Symbolic Data Clustering. A Review

Justyna Wilk1

Abstract

Clustering is the unsupervised classification of patterns into relatively homogeneousgroups and one of the most important methods of exploratory data analysis. However,clustering is a complex problem. Its difficulty is deepeningwhile clustering symbolicdata more complex than classical data situation. They significantly contribute in datamining to present the huge data sets in a reduced form and alsoin more complete andnatural phenomena description.

Although there are numerous studies on symbolic data analysis and its applications,there is a lack of an overview study which would complete and systematize the knowl-edge of symbolic data clustering. The subject of this paper is to consider methods suit-able for symbolic data clustering. We present taxonomy of clustering techniques and areview of their applications in symbolic data analysis. We discuss clustering procedureof symbolic data and recommend methods suitable for symbolic data analysis.

ReferencesBOCK, H.-H. and Diday, E. (2000):Analysis of Symbolic Data. Exploratory Methodsfor Extracting Statistical Information from Complex Data. Springer-Verlag, BerlinHeidelberg.DIDAY, E. and BRITO, P. (1989): Symbolic Cluster Analysis. In: O. Opitz (Ed.):Conceptual and Numerical Analysis of Data. Springer-Verlag, Berlin Heidelberg,45–84.EVERITT, B.S. and LANDAU, S. and LEESE, M. (2001):Cluster Analysis. Arnold,London.WILK, J. (2011): Analiza skupien na podstawie danych symbolicznych [Cluster anal-ysis based on symbolic data]. In: E. Gatnar, M. Walesiak (Eds.): Analiza danychjakoIJciowych i symbolicznych z wykorzystaniem programu R[Symbolic and quali-tative data analysis with using of R software]. C.H. Beck, Warsaw, 262–279.

KeywordsCLUSTER ANALYSIS, SYMBOLIC DATA ANALYSIS, CLASSIFICATION

1Wrocław University of Economics, Department of Econometrics and Computer Sci-ence, Nowowiejska 3, 58-500 Jelenia Góra, Poland,[email protected]

44

The Ensemble Conceptual Clustering of Symbolic Data

Marcin Pełka1

Abstract

Ensemble approach based on aggregating information provided by different models hasbeen proved to be a very useful tool in the context of the supervised learning. The maingoal is to increase the accuracy and stability of the classification. Recently the sametechniques have been applied for cluster analysis where by combining a set of differentclusterings, a better solution can be received.

Since Michalski wrote about conceptual clustering as a new branch of machine learn-ing (Michalski 1980) there has been increasing attention tothat tasks. Conceptual clus-tering is not only the inherent structure of the data that drives cluster formation, but alsothe description language which is available to the learner.

The article proposes to apply conceptual clustering in ensemble learning of sym-bolic data. The main contribution of the paper is the proposal how to solve a theoreticalproblem of the conceptual clustering results aggregation.An adaptation of bagging isproposed. In the empirical part of the paper some simulationexperiment results arepresented (based on artificial and real symbolic data sets).

ReferencesBOCK, H.-H., DIDAY, E. (Eds.) (2000):Analysis of symbolic data. Explanatorymethods for extracting statistical information from complex data. Springer Verlag,Berlin-Heidelberg.FRED, A.L.N., JAIN, A.K. (2005): Combining multiple clustering using evidenceaccumulation.IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol.27, 835–850.MICHALSKI, R.S. (1980): Knowledge acquisition through conceptual clustering: Atheoretical framework and algorithm for partitioning datainto conjunctive concepts.International Journal of Policy Analysis and Information Systems, Vol. 4, 219–243.

KeywordsSYMBOLIC DATA ANALYSIS, ENSEMBLE CLUSTERING, CONCEPTUAL CLUS-TERING

Wrocław University of Economics, Department of Econometrics and Computer Sci-ence, Nowowiejska 3, 58-500 Jelenia Góra, Poland,[email protected]

45

The Hierarchy Test Of Geographic Units based on BorderLengths

Andrzej Sokołowski1, Danuta Strahl2, Małgorzata Markowska3, and MarekSobolewski4

Abstract

Whenever the geographical or administrative units are subjects for classification (clus-tering) one can wonder if the results are influenced by upper level classification. If wecluster districts, prefectures, counties into homogeneous groups it would be interestingto know whether the partition has anything in common with upper level regions or coun-tries. In the paper we propose a procedure for testing such influence. Administrativeunits are neighbors with different common border lengths. The differences in lengthsare quite large. It is natural to assume that the relations between units may be somehowstatistically proportional to the common border length. The universal test cannot be sug-gested since the place, neighborhood and common border lengths are different for eachset of analysed units. So we propose the procedure based on computer-intensive wayof finding critical values for a given problem. Results are compared with the test basedsolely on number of neighbors.

ReferencesAUDRETSCH, D.B. and FELDMAN, M.P. (1996): R&D Spillovers and the Geog-raphy of Innovation and Production.The American Economic Review, vol.86, No.3,630-640.UNWIN, D.J. (1996): GIS, spatial analysis and spatial statistics.Progress in HumanGeography, 20, 4, 540-551.MOORE, D.A. and CARPENTER, T.E. (1999): Spatial AnalyticalMethods and Ge-ographic Information Systems: Use in Health Research and Epidemiology.Epidemi-ologic Reviews, vol.21, No.2, 143-161.

KeywordsSPATIAL METHODS, CLUSTERING, INNOVATIONS

Cracow University of [email protected] · Wroclaw Uni-versity of [email protected] · Wroclaw University of Eco-nomics [email protected] · Rzeszow University of [email protected]

46

Statistical Modeling the Optimal Level of FX Reserves forPoland

Eugeniusz Gatnar

Abstract

Modeling the optimal level of international reserves is an important issue for centralbanks, especially for emerging economies such as Poland. FXreserves can be seen as aform of self-insurance against sudden stops in capital flow.Therefore, they can preventeconomies from crises and mitigate their impact, but, on theother hand, they are costly.

The research on FX reserve adequacy started in late sixties by Heller and since thenseveral models have been developed, e.g. by Frenkel and Jovanovich (1981), Wijnholdsand Kaptyen (2001), Aizeman and Lee (2005), and Jeanne and Ranciere (2009).

In this paper we introduce a model that allows estimation theoptimal level of foreignexchange reserves for Poland.

ReferencesAIZEMAN J., LEE J. (2005): International Reserves: Precautionary Versus Mer-cantilist Views, Theory and Evidence, NBER Working Paper No. 11366, NationalBureau of Economic Research, Cambridge, Massachusetts.FRENKEL J., JOVANOVICH B. (1981): Optimal International Reserves: A Stochas-tic Framework, Economic Journal, 1981, Vol. 91, pp. 507âAS-514.JEANNE O., RANCIERE R. (2009): The Optimal Level of International Reservesfor Emerging Market Countries: Formulas and Applications,IMF Working Paper,WP/06/229, Washington.WIJNHOLDS O., KAPTYEN A. (2001): Reserve Adequacy in Emerging MarketEconomies, IMF Working Paper 01/143, International Monetary Fund, Washington.

KeywordsFX RESERVES, RESERVE ADEQUACY, FINANCE, REGRESSION MODELS, STATIS-TICS

University of Economics in Katowice, 1 Maja 50, 40– 287 Katowice, [email protected]; National Bank of Poland, Swi-etokrzyska 11/21, 00–919 Warszawa, Poland,[email protected]

47

Latent Transitions with Mixture Rasch Model of BankruptcyRisk in the Classification of Polish Firms

Barbara Pawełek1, Józef Pociecha2, and Adam Sagan3

Abstract

Many types of bankruptcy prediction models have been formulated by the business the-ory and practice. Among them more popular are: multidimensional discriminant analy-sis, Logit models, neural networks and classification trees.

The aim of the paper is to present the results of modeling of bankruptcy using latenttransition models (LTA) with mixture Rasch measurement model (MRM) of bankruptcyrisk and the time-invariant and time-varying covariates. The measurement model isbased on the financial indicators of firms economic performance.

The data from Polish industry is used for estimation of classprevalences, within-classvariability on the latent variable and transition probabilities across classes that reflect thelevel of bankruptcy risk.

Finally, the variety of LTA-MRM models with actual bankruptcy as a distal outcomeis used for establishing the level of predictive validity.

ReferencesCHO, S.-J., COHEN, A.S., KIM, S.-H. and BOTTGE, B. (2010), Latent TransitionAnalysis with a Mixture Item Response Theory Measurement Model, Applied Psy-chological Measurement, 34(7), 483-504.PAWEŁEK, B. and POCIECHA, J. (2012), General SEM Model in Researching Cor-porate Bankruptcy and Business Cycles. In: J. Pociecha and R. Decker (Eds.):DataAnalysis Methods and Its Applications. C.H. Beck, Warsaw, 215-231.

KeywordsBANKRUPTCY RISK, LATENT TRANSITION ANALYSIS, MIXTURE RASCHMODEL

Cracow University of [email protected] · CracowUniversity of [email protected] · Cracow Univer-sity of [email protected]

48

Automatic Determination The Number Of Clusters InSpectral Clustering

Marek Walesiak and Andrzej Dudek

Abstract

This paper will test the usefulness of seven indices (within-group dispersion, Davies-Bouldin index, Calinski and Harabasz index, Hartigan index, Krzanowski and Lai index,Silhouette index, gap index) assessing the quality of classification in the issue of theselection of the number of clusters in the spectral clustering taking into account the fourtypes of distance (squared Euclidean distance, Euclidean distance, manhattan distance,GDM1 distance).

The article evaluates twenty eight clustering procedures (four spectral clusteringmethods and seven indices) based on simulated data (classicand non-classic). Eachclustering result is compared with the known cluster structure applying corrected Randindex.

ReferencesHUBERT, L. and ARABIE, P. (1985): Comparing partitions,Journal of Classifica-tion, 2(1), 193–218.NG, A., JORDAN, M. and WEISS Y. (2002): On spectral clustering: analysis andan algorithm. In: T. Dietterich, S. Becker, Z. Ghahramani (Eds.),Advances in NeuralInformation Processing Systems 14. MIT Press, Cambridge, 849–856.WALESIAK, M. (2011):Uogolniona miara odleglosIJci GDM w statystycznej anal-izie wielowymiarowej z wykorzystaniem programu R [The Generalized DistanceMeasure GDM in multivariate statistical analysis with R]. Wydawnictwo UE, Wro-claw.WALESIAK, M. and DUDEK, A. (2012):clusterSim package. URL http://www.R-project.org.WANG, J. (2010): Consistent selection of the number of clusters via crossvalidation,Biometrika, 97(4), 893–904.

KeywordsCLUSTER ANALYSIS, SPECTRAL CLUSTERING, NUMBER OF CLUSTERS

Wrocław University of Economics, Department of Econometrics and Computer Sci-ence, ul. Nowowiejska 3, 58-500 Jelenia Góra, Poland, [email protected],[email protected]

49

A Spectral-Mean Shift Algorithm for Clustering of SymbolicData

Andrzej Dudek1 and Marcin Pełka1

Abstract

Clustering methods have been applied with a success in many different areas. In clusteranalysis objects are usually decried by single-valued variables. This allows to representthem as a vectors, where each column represents a variable. However this kind of datarepresentation is too restrictive for more complex data. Totake into account uncertaintyand/or variability to the data, variables must assume sets of categories or intervals, evenwith weights or frequencies. Such kind of data have been mainly studied inSymbolicData Analysis(SDA).

The article proposes a new clustering method for symbolic data – the spectral meanshift clustering (SMSC). Spectral clustering is a point of interest in many papers sincethe end of the XX century. It is not a new clustering method, but rather a new method ofpreparing data for further cluster analysis. The mean shiftalgorithm is a nonparametricclustering technique which does not require prior knowledge of the number of clustersand their shape.

The proposed algorithm is a combination of spectral and meanshift approaches forsymbolic data in order to deal better with non-gaussian clusters with noisy variablesand/or outliers.

ReferencesBOCK, H.-H., DIDAY, E. (Eds.) (2000):Analysis of symbolic data. Explanatorymethods for extracting statistical information from complex data. Springer Verlag,Berlin-Heidelberg.CHENG Y. (1995): Mean shift, mode seeking, and clustering.IEEE Transactions onPattern Analysis and Machine Intelligence, Vol. 17, No. 8, p. 790–799.NG, A., JORDAN, M., WIESS, Y. (2002): On spectral clustering: analysis and al-gorithm. [In:] T. Diettrich, S. Becker, Z. Ghahramani (Eds.), Advances in NeuralInformation Processing Systems 14, MIT Press, p. 849–856.

KeywordsSYMBOLIC DATA ANALYSIS, SPECTRAL CLUSTERING, MEAN SHIFT

Wrocław University of Economics, Department of Econometrics andComputer Science, Nowowiejska 3, 58-500 Jelenia Góra, Poland,[email protected], [email protected]

50

Asymptotics of ReducedK-means Clustering

Yoshikazu Terada1

Abstract

Reducedk-means clustering proposed by De Soete and Carroll (1994) isa method forclustering objects in a low-dimensional subspace. The advantage of this method is thatboth clustering of objects and low-dimensional subspace reflecting the cluster structureare simultaneously obtained.

The relationship between conventionalk-means clustering and reducedk-means clus-tering is discussed. Conditions ensuring almost sure convergence of the estimator of re-ducedk-means clustering as unboundedly increasing sample size have been presented.The results for a more general model considering conventional k-means clustering andreducedk-means clustering are provided. The rate of convergence forthe convergence ofthe empirically optimal clustering scheme is also discussed. Moreover, a new criterionand its consistent estimator are proposed to determine the optimal dimension number ofa subspace, given the number of clusters. For more details, see Terada (2013).

ReferencesDE SOETE, G. and CARROLL, J.D. (1994):K-means clustering in a low-dimensional Euclidean space. In: Diday, E., Lechevallier,Y., Schader, M., Bertrand,P. and Burtschy, B. (Eds.):New Approaches in Data Analysis. Springer, Heidelberg,212–219.TERADA, Y. (2013): Strong consistency of Reducedk-means clustering.arXiv.

KeywordsSTRONG CONSISTENCY, DIMENSION REDUCTION,K-MEANS

Graduate School of Engineering Science, Osaka University,1-3 Machikaneyama, Toy-onaka, Osaka, [email protected]

51

Non-hierarchical Clustering Algorithm For MixedNumerical And Categorical Three-Way Three-Mode Data

Takahiro Umei1 and Hiroshi Yadohisa2

Abstract

Three-way three-mode data are defined as a set of multivariate data for the same objectsand variables. Three-way factorialk-means (Vichi et al., 2007) and Tucker 3 cluster-ing (Rocci and Vichi, 2007) have been proposed as algorithmsfor clustering such data.However, these algorithms can only deal with numerical data. For applying these al-gorithms to categorical data, the data first need to be converted into numerical data byusing concept of dummy variables. However, it is difficult tointerpret the clusteringresults because of the requirement of a large number of variables. For such a problem,Chang el al.(2004) proposed an approach to without increasing the number of variablesand consider the importance of variables, in the multivariate data clustering. Therefore,it is easy to interpret the clustering results.

In order to overcome the problems encountered in previous studies, this paper pro-poses a new non-hierarchical clustering algorithm that extends Chan et al.’s (2004)three-way three-mode data clustering algorithm. Concretely, our algorithm enables easyinterpretation of the clustering results of three-way three-mode data and considers theimportance of variables and occasions for each cluster.

ReferencesCHAN, E.Y., CHING, W.K. and HUANG, J.Z. (2004): An optimization algorithm forclustering using weighted dissimilarity measures.Pattern Recognition, 37(5), 943-952.ROCCI, R. and VICHI, M. (2005): Three-mode component analysis with crisp orfuzzy partition of units.Psychometrika, 70(4), 716-736.VICHI, M., ROCCI, R. and KIERS, H.A.L. (2007): SimultaneousComponent andClustering Models for Three-way Data: Within and Between Approaches.Journal ofClassification, 24(1), 71-98.

KeywordsSUBSPACE CLUSTERING, VARIABLES AND OCCASIONS WEIGHTS,K-MODECLUSTERING

Doshisha [email protected] · Doshisha [email protected]

52

Using Simulation Strategies to Test Clustering AlgorithmPerformances

Marina Marino1 and Cristina Tortora2

Abstract

In literature a wide number of clustering methods exist. Theeasiest and most commonused methods, like k-means or hierarchical clustering, have good performances underthe following conditions: 1) small number of variables (lower than the number of units),2) orthogonal variables (spherical clusters), 3) clustershaving the same variance, 4)absence of outliers. When one, or more than one, of these conditions are not verifiedclustering methods can fail into detect the clustering structure underling the data. In thiswork a simulation study is used to test the performance of a recently proposed clusteringmethods, Factor PD-clustering (FPDC), when the optimalityconditions are not verified.FPDC is a factorial clustering method proposed by Tortora etal. in 2011. It is based onProbabilistic Distance clustering (PD-clustering) proposed by Ben-Israel and Iyigun in2008. FPDC makes a linear transformation of original variables into a reduced numberof orthogonal ones using a common criterion with PD-Clustering. Factor PD-clusteringmakes alternatively a Tucker 3 decomposition and a PD-clustering on transformed datauntil convergence is reached. This method could significantly improve PD-clusteringperformances and allows us to work with large datasets. The method gives good resultswhen optimality conditions are not respected. The simulation design is based on thestructure proposed by Marona and Zamar in 2002.

ReferencesBEN-ISRAEL, A. AND IYIGUN, C. (2008): Probabilistic d-clustering.Journal ofClassification, 25(1):5–26.MARONNA, R.A. AND ZAMAR, R.H. (2002): Robust estimates of location anddispersion for high-dimensional datasets.Technometrics, 44(4):307–317.TORTORA, C., GETTLER SUMMA, M., AND PALUMBO, F. (2011). Factorial pd-clustering.Proceedings of the Joint Conference of the German Classification Society.

KeywordsFACTOR PD-CLUSTERING, SIMULATION STUDY

Università di Napoli Federico II, [email protected] · University of Guelph,[email protected]

53

Random Forest Variable Importance Measures: CurrentDevelopments

Anne-Laure Boulesteix1 and Silke Janitza2

Abstract

The random forest method is an increasingly common supervised learning tool used invarious application fields such as, e.g., bioinformatics and genetics. The variable impor-tance measures (VIMs) that are automatically calculated asa by-product of the algo-rithm are often used to rank predictors with respect to theirability to predict the investi-gated response. It is now well-known that VIMs may be affected by substantial biases,for instance in favour of categorical predictors with many categories. After a brief sur-vey of these issues, we address further topics related to variable importance measures:the bias affecting the Gini VIM in favor of categorical predictors with approximatelybalanced categories, a new permutation VIM based on the areaunder curve that is morerobust against class imbalance in the response variable than the usual permutation VIM,and the development of statistical tests for VIMs.

ReferencesBOULESTEIX, A.-L., JANITZA, S., KRUPPA, J. and KÖNIG, I. (2012): Overviewof random forest methodology and practical guidance with emphasis on computa-tional biology and bioinformatics.Wiley Interdisciplinary Reviews: Data Mining andKnowledge Discovery, 2, 493–507.BOULESTEIX, A.-L., BENDER, A., LORENZO-BERMEJO, J. and STROBL, C.(2012): Random forest Gini importance favors SNPs with large minor allele fre-quency.Briefings in Bioinformatics, 13, 292–304.JANITZA, S., STROBL, C. and BOULESTEIX, A.-L. (2013): An AUC-based per-mutation variable importance measure for random forests.BMC Bioinformatics(ac-cepted).

KeywordsRANDOM FOREST, ENSEMBLE METHOD, SUPERVISED LEARNING, VARI-ABLE IMPORTANCE

Department of Medical Informatics, Biometry and Epi-demiology, Ludwig-Maximilians-University of Munich,[email protected] · Department of Medical In-formatics, Biometry and Epidemiology, Ludwig-Maximilians-University of Munich,[email protected]

54

Detecting Threshold Interactions In Binary Classification:STIMA

Claudio Conversano1 and Elise Dusseldorp2

Abstract

Simultaneous Threshold Interaction Modeling Algorithm (STIMA) is a tool enabling usto automatically select interactions in a Generalized Linear Model (GLM) through theestimation of a suitable defined tree structure called “trunk”. STIMA integrates GLMwith a classification tree algorithm or a regression tree one, depending on the nature ofthe response variable (nominal or numeric). Accordingly, it can be based on the Clas-sification Trunk Model or on the Regression Trunk Model. In both cases, interactionterms are expressed as “threshold interactions” instead oftraditional cross-products.Compared with standard tree-based algorithms, STIMA is based on a different splittingcriterion as well as on the possibility to “force” the first split of the trunk by manuallyselecting the first splitting predictor. Different specifications of the generalized linearmodel with threshold interaction effects can be provided bySTIMA on the basis of thenature of the response variable. In this paper, we focus on the binary response case andpresent results on real and synthetic data in order to compare the performance of STIMAwith that of alternative methods (e.g., logistic regression, MARS, Support Vector Ma-chines, Random forests).

ReferencesCONVERSANO, C. and DUSSELDORP, E. (2010): Simultaneous Threshold In-teraction Detection in Binary Classification. In Lauro, C.N., Greenacre, M.J. andPalumbo, F. (eds.)Studies in Classification, Data Analysis, and Knowledge Organi-zation, Springer, Berlin-Heidelberg, 225-232.DUSSELDORP, E., CONVERSANO, C. and VAN OS, B.J. (2010): Combining anadditive and tree-based regression model simultaneously:STIMA, Journal of Com-putational and Graphical Statistics, 19, 514–530.

KeywordsGENERALIZED LINEAR MODELING, RECURSIVE PARTITIONING, INTERAC-TION EFFECTS, CLASSIFICATION TRUNK, REGRESSION TRUNK

Department of Business and Economics, University of Cagliari, [email protected] · Netherlands Organisation for Applied Scientific ResearchTNO, Leiden, The [email protected]

55

A Recursive Partitioning-Based Method To BalanceCovariates When Estimating Causal Effects

Massimo Cannas1, Claudio Conversano1 and Francesco Mola1

Abstract

Estimation of causal effects within observational data mayrequire prior adjustment forbalancing covariates distribution across treated and control units. We present an empir-ical method for the identification of a balanced group of observations which has beenimplemented in an algorithm that uses a balance measure criterion to recursively splitthe original dataset based on the value of covariates. Observations are finally partitionedin subsets characterized by different degrees of homogeneity. The final subset of obser-vations on which causal inference can be carried out is selected according to a suitable-defined threshold measure and bootstrap is used to assess thestability of the selectionmethod as well as the properties of the average treatment effect estimators. Results onboth simulated and real data illustrate the effectiveness of the proposed approach.

ReferencesCRUMP, R.K., HOTZ, V.J., IMBENS, G.V. and MITNIK, O.A. (2009): Dealing withlimited overlap in estimation of average treatment effectsDEHEJIA, R. and WAHBA, S. (1999): Causal Effects in Nonexperimental Studies:Reevaluating the Evaluation of Training Programs,Journal of the American Statisti-cal Association, 94(448): 1053–1062. Biometrika, 96(1): 187–199IACUS, S. M. and PORRO, G. (2009): Random Recursive Partitioning: a MatchingMethod for the Estimation of Average Treatment EffectsJournal of Applied Econo-metrics, 24: 363–385.TRASKIN, M. and SMALL, D.S. (2011): Defining the Study Population for an Ob-servational Study to Ensure Sufficient Overlap: A Tree Approach,Statistics in Bio-sciences, 3: 94–118.

KeywordsCAUSAL INFERENCE, RECURSIVE PARTITIONING, BOOTSTRAP

University of Cagliari, Department of Business and [email protected], [email protected], [email protected]

56

Recursive Partitioning for Hybrid Image Classification usingCaptions and Image Features

Adalbert Wilhelm1

Abstract

Methods for finding groups of similar objects in large data sets with the purpose of facil-itating data interpretation play an important role in exploratory data analysis. However,classical cluster analysis methods do not scale well with anincreased number of ob-jects and/or dimensions. Recent work in the field has focusedon designing algorithmsthat can overcome these difficulties while providing meaningful solutions. We proposea projection-based hierarchical partitioning method inspired by the OptiGrid algorithm.Given a data sample, the present algorithm searches for low-density points (local min-ima) in selected-dimensional projections, and partitionsthe data by a hyperplane pass-ing through the best split point found, if any. Measures suchas iterative implementation,objects and dimensions sampling, and simplified search for projections and local min-ima, ensure the computational efficiency of the algorithm. Acomparative evaluation ofthe algorithm is presented based on synthetic and referencedata. Performance of thealgorithm is explicated for some image analysis tasks.

ReferencesILIES, I. and WILHELM, A. (2010): Projection-Based Partitioning for Large, High-Dimensional Datasets.Journal of Computational and Graphical Statistics, 19, 474–492.SCHOBER, J.-P., HERMES, T. and HERZOG, O. (2005): PictureFinder: Descriptionlogics for semantic image retrieval. In:Proceedings of the 2005 IEEE InternationalConference on Multimedia and Expo. Amsterdam, 1571–1574.SIVIC, J. and ZISSERMAN, A. (2003): Video Google: A text retrieval approach toobject matching in videos. In:Proceedings of the 9th IEEE International Conferenceon Computer Vision. Nice, 1470–1477.

KeywordsDIMENSION REDUCTION, HIERARCHICAL PARTITIONING, IMAGE CLASSI-FICATION, PERFORMANCE MEASURES

School of Humanities and Social Sciences, Jacobs University Bremen, Campus Ring 1,28759 Bremen, Germany,[email protected]

57

Change of Aspects of Industrial Classification System fromHierarchical Structure to Network Structure

Hiroki Furuzumi1, Yoshiro Matsuda2, and Yasumasa Baba3

Abstract

The countries in the world employ their own SIC (Standard Industrial Classification)scheme such as JSIC in Japan, NAICS (North American SIC System) among USA,Canada, and Mexico as their common SIC. The most common ISC ofindustries is theISIC (International SIC) scheme by UNSD. As these SIC schemes are used for encodingthe activities of each establishment of a company but not forthe company itself, toassign a unique industrial classification code to each company becomes a problem hardto solve by statistical officers of every country. In order toclassify an establishmentby its economic activities and/or amounts of turnover, we should face up to the factthat majority of a companies are operating plural establishments of different activities.Most of SIC schemes are composed of several levels from the bottom or minute to anupper aggregated level, i.e. they are classified in a hierarchical classification scheme.The broadest boundary of classifying industries lies between a sphere of goods andservices and that of monetary aspects. One way to assign a code to a company which isrunning plural business is to assign only one position leaving the number of the businessout of account, and the other extreme way is to assign one position in the upper level ofaggregation.

A more adequate way of classification, however, is to abandona hierarchical classifi-cation scheme, and to use a network structure instead. Usingmicro data sets ofFinancialStatements Statistics of Corporations by Industryof Ministry of Finance Japan, we pro-pose to reclassify a company by first and second turnover of each company. It will showa different scheme from the case using only the first turnoverin a hierarchical classifi-cation scheme. For example, a certain company’s position inan industrial classificationscheme will occupy its positions both in real industrial classification sphere and that ofmonetary. And so, to express those relations in a network structure will make its positionmore clearly in the industrial classification scheme. We propose a different classificationcriterion for companies and establishments.

KeywordsPLURAL ESTABLISHMENT ENTERPRISES, MICRO DATA SETS, JSIC, NAICS

University of Hyogo. Kobe, [email protected] · AomoriPublic College. Aomori, [email protected] · The Institute of StatisticalMathematics. Tokyo, [email protected]

58

Econometric Models of Durable Goods’ Prices: A HedonicApproach

Anna Król1

Abstract

The classic demand-supply models of commodities’ prices inprinciple establish marketequilibrum price of a certain good at the intersection of curves representing the quan-tities offered by the producers and quantities claimed by the consumers. In contrast tothose models the hedonic approach links the price of the goodwith the set of those itsattributes, which are valued by the buyers and significant for the manufacturers. Themodel, which represents above mentioned relationship, refered to as hedonic regres-sion, allows to price the commodity and to estimate the prices of its respective attributes(so-called implicit prices), including the prices which are not directly observable on themarket (e.g. the commodity’s brand).

This paper presents hedonic analysis of prices for two groups of durable goods: usedcars and laptop computers, making use of extensive offers database gathered by theauthor. The conducted research provides insights into consumers preferences towarddifferent variaties of analised commodities, as well as introduce estimates of marketvaluations of significant goods’ characteristics.

ReferencesNESHEIM, L. (2006): Hedonic Price Functions.CeMMAP working papersCWP18/06. Centre for Microdata Methods and Practice, Institute for Fiscal Studies.TRIPLETT, J. (1986): The Economic Interpretation of Hedonic Methods.Survey ofCurrent Business, 36(1), 36–40.WOOLDRIDGE, J.M. (2002):Econometric Analysis of Cross Section and PanelData. The MIT Press, Cambridge.

KeywordsHEDONIC PRICE METHODS, DURABLE GOODS, IMPLICIT PRICES

Wrocław University of [email protected]

59

Smart Growth Versus Economic And Social Cohesion –Econometric Panel Analysis

Beata Bal-Domanska1 and Elzbieta Sobczak1

Abstract

Within the framework of the EU Europe 2020 strategy smart growth is listed as oneof the leading policy objectives aimed at improving the situation in such domains aseducation, research and innovation, as well as digital society. It can be demonstrated thatsmart growth represents the set of instruments which are supposed to result in dynamicgrowth and therefore enhance economic and social cohesion affecting the increase inpopulation life quality.

The objective of the paper is to evaluate relations occurring between smart growthdefined from the perspective of three pillars (smart specialization, creativity and inno-vation) as well as economic and social cohesion. Aggregate measures with a commongrowth pattern were used to measure smart growth and economic and social cohesion asrepresenting complex phenomena. They became the basis for the construction of econo-metric models allowing for the assessment of smart growth oneconomic and socialcohesion. Estimation techniques for panel data were used todescribe mutual relationsbetween these phenomena. The study was performed among the European Union coun-tries.

ReferencesA strategy for smart, sustainable and inclusive growth, European Commission, Com-munication from the Commission EUROPE 2020, Brussels, 3.3.2010.ARELLANO M. (2003): Panel Data Econometrics. Oxford: Oxford UniversityPress.WALESIAK M. (2011): Uogólniona miara odległosci GDM w statystycznej analiziewielowymiarowej z wykorzystaniem programu R [General distance measure GDMin statistical multivariate analysis applying R programme] . Wrocław University ofEconomics Publishhing House, Wrocław.WOOLDRIDGE J.M. (2002):Econometric analysis of cross section and panel data.Massachusetts Institute of Technology.

KeywordsSMART GROWTH, ECONOMIC AND SOCIAL COHESION, AGGREGATE MEA-SURES, PANEL MODELS

Wrocław University of Economics, Department of Regional Economics, Nowowiejska3, 58-500 Jelenia Góra, Poland,[email protected],[email protected]

60

Workflow Classification Based On The K-Means Partitioning

Etienne Lord, Abdoulaye Baniré Diallo, and Vladimir Makarenkov

Abstract

Workflow applications can be described as collections of tasks and the related links de-fined for being processed in a well-established order. Many complex scientific and busi-ness processes can be modeled using workflow pipelines (Van der Aalst, 2011). Usually,workflows are organized to minimize the total cost and duration of the included opera-tions. Classification and effective integration of workflows is a growing concern wheninterdisciplinary scientific projects are designed or whenlarge organizations merge andneed to integrate their business processes. In particular,the issue of clustering the ex-isting workflow pipelines into larger and more effective workflows becomes more andmore relevant. We propose to use the weighted version of the k-means partitioning algo-rithm (Makarenkov and Legendre, 2001) in order to provide a classification of the givenset of workflows. Two versions of the optimization criterionwill be considered, the firstone allowing for clustering workflows with similar topological features (i.e. tasks, links)and the second one allowing for regrouping workflows depending on both the topologi-cal features and the execution time. We will present an application of our classificationtechnique on workflows generated by our Armadillo platform (Lord et al. 2012).

ReferencesVAN DER AALST, W.M.P. (2011):Process Mining: Discovery, Conformance andEnhancement of Business Processes. Springer-Verlag, Berlin.MAKARENKOV, V. and LEGENDRE, P. (2001): Optimal variable weighting for ul-trametric and additive trees and k-means partitioning: methods and software.Journalof Classification, 18, 245-271.LORD, E. et al. (2012): Armadillo 1.1: An Original Workflow Platform for Designingand Conducting Phylogenetic Analysis and Simulations.PLoS One, 7(1), e29903.

KeywordsBIOINFORMATICS WORKFLOWS, K-MEANS PARTITIONING, WORKFLOWCLASSIFICATION

Département d’Informatique, Université du Québec à Montréal, Montréal, [email protected], [email protected],[email protected]

61

Functional Principal Component Analysis with R

Malgorzata Sej-Kolasa1 and Miroslawa Sztemberg-Lewandowska2

Abstract

Principal component analysis (PCA) transforms the original set of variables into neworthogonal set of variables that are called principal components. Functional principalcomponent analysis (FPCA) has the same advantages as classical principal componentanalysis. What is more it allows to analyze dynamical data. The main difference betweenthem is: PCA is based on multidimensional data, FPCA is basedon functional data. Thefunctional data are curves, surfaces or anything else varying over a continuum. They arenot a single observation. The purpose of this article is to describe the stages of functionalprincipal component analysis and present of the selected packages and functions in Rsystem for the implementation of these steps. In addition authors show the usefulness ofapplying functional principal component analysis in orderto analyze longitudinal data.

ReferencesHALL P., MÃIJLLER H. G., WANG J. L. (2006): Properties of Principal ComponentMethods for Functional and Longitudinal Data Analysis.The Annals of Statistics Vol.34, No. 3, 1493-1517.INGRASSIA S., COSTANZO G. D. (2005): Functional principal component analy-sis of financial time series. In: Vichi M., Monari P., MignaniS., Montanari A. (Eds.)New Developments in Classification and Data Analysis. Springer-Verlag, Berlin,351-358.RAMSAY J. O., SILVERMAN B.W. (2005):Functional Data Analysis. Springer.RAMSAY J.O., HOOKER G., GRAVES S. (2009):Functional Data Analysis with Rand MATLAB. Springer.

KeywordsFUNCTIONAL DATA, FUNCTIONAL PRINCIPAL COMPONENT ANALYSIS, RSYSTEM, LONGITUDINAL DATA

Department of Econometrics and Computer Science, Wroclaw University of Eco-nomics, [email protected] · Department ofEconometrics and Computer Science, Wroclaw University of Economics, [email protected]

62

Implementation of Time Series Methods of Forecasting inTSprediction R Package

Tomasz Bartłomowicz

Abstract

The paper presents a Time Series Prediction (TSprediction)package developed for Rprogram. The package contains an implementation of the mostpopular time series meth-ods of forecasting which include: time series models with trend (e.g. analytical mod-els, Holt model), time series exponential smoothing models(e.g. simple exponentialsmoothing model, seasonal smoothing model), time series models with seasonal fluctu-ations (e.g. Winter’s seasonal multiplicative model, Winter’s seasonal additive model,model with cyclical component), moving average time seriesmodels (e.g. simple mov-ing average model) and autoregressive time series models (ARMA and ARIMA mod-els).

In addition to time series methods of forecasting TSprediction package contains func-tions that allow to define the most important ex post forecasterrors: mean error (ME),mean absolute error (MAE), mean square error (MSE), root mean square error (RMSE),mean percentage error (MPE) and mean absolute percentage error (MAPE).

Functions of TSprediction package will be illustrated withexamples of applicationsin empirical time series forecasting.

ReferencesCIESLAK M. (1997), Prognozowanie gospodarcze. Metody i zastosowania. PWN,Warszawa.COWPERTWAIT P.S.P., METCALFE A.V. (2008),Introductory Time Series with R.Springer, New York.

KeywordsFORECASTING, TIME SERIES, R PROGRAM

Wrocław University of Economics, Department of Econometrics andComputer Science, ul. Nowowiejska 3, 58-500 Jelenia Góra, Poland,[email protected]

63

Latest developments of theRSDA: An R package for SymbolicData Analysis

Oldemar Rodríguez1 and Johnny Villalobos2

Abstract

In this new version of theR packageRSDA we have integrated the packageR2S, thatwas developed to transform relational data into symbolic data with the R packageRSDA.The main features of this package is the possibility to take into account different typesof symbolic variables (continuous, interval, histogram ormulti–valued).

Methods like centers interval principal components analysis, histogram principalcomponents analysis, multi-valued correspondence analysis and linear regression mod-els have been implemented in this version. This new version also includes new featuresto manipulate symbolic data through a new data structure that implements SymbolicData Frames.RSDA includes functions to transform relational data into symbolic data. This fea-

ture use a new set of base data types (continuous, interval, histogram or multi–valued),symbolic operators and SQL-functions to allow the creationof symbolic tables directlyin the database. The new types are implemented in the Data Base Management SystemPostgreSQL, a powerful open source object-relational database system. PostgreSQL isreleased under the PostgreSQL License, a liberal Open Source license, similar to theBSD or MIT licenses, so we have permission to use, copy, modify, and distribute thissoftware and its documentation for any purpose, without fee.

ReferencesBOCK H-H. and DIDAY E. (eds.) (2000).Analysis of Symbolic Data. Exploratorymethods for extracting statistical information from complex data. Springer, Germany.CHAMBERS, J.M. (2008).Software for Data Analysis: Programming withR.Springer, New York.EVERITT B.S. and HOTHORN T. (2010).A Handbook of Statistical Analysis UsingR. Chapman & Hall book, Florida.RODRIGUEZ R. and VILALOBOS J. (2011).RSDA: An R package for SymbolicData Analysis.Workshop In Symbolic Data Analysis Namur, Belgium.R DEVELOPMENT CORE TEAM (2007).R: A Language and Environment forStatistical Computing.R Foundation for Statistical Computing, Vienna, Austria.http://www.R-project.org.THE POSTGRESQL GLOBAL DEVELOPMENT GROUP (2012).R: PostgreSQL Developer’s Guide. PostgreSQL Development Team.http://www.postgresql.org.

KeywordsINTERVAL DATA, HISTOGRAM DATA, POSTGRESQL, RELATIONAL DATA BASE,SYMBOLIC DATA ANALYSIS.

CIMPA, School of Mathematics, University of Costa [email protected] // [email protected] · Schoolof Computer Science, National University, Costa [email protected]

64

Microeconometrics Multinomial Logit Models and theirImplementations in MMLM R Package

Andrzej Bak1 and Tomasz Bartłomowicz2

Abstract

Microeconometrics logit models are useful in analysis of categorical data (microdatadescribing individuals) often collected in marketing research based on discrete choices.Among microeconometrics models for unordered categories most frequently are usedmultinomial logit model (MNLM), conditional logit model (CLM) and mixed logitmodel (MLM). The main distinction between those models is following: MNLM fo-cuses on the individuals as the unit of analysis and uses the individual’s characteristicsas explanatory variables; CLM focuses on the set of alternatives and the explanatoryvariables are characteristics of those alternatives; MLM focuses on individuals and char-acteristics of the choice options (alternatives) and the explanatory variables are charac-teristics of individuals and alternatives.

The main aim of this paper is to present a Microeconometrics Multinomial LogitModels (MMLM) package developed for R program which can be used to estimate theprobability of choice of an individual among a set of alternatives. The package containsan implementation of multinomial, conditional and mixed logit models and functionswhich can be used in discrete choice method to design the research (e.g. to build frac-tional factorial design), encode the alternatives, estimate the models, etc. Functions ofMMLM package will be illustrated with examples of applications in empirical analysisof consumer preferences.

ReferencesAGRESTI A. (2002),Categorical Data Analysis. Second Edition, Wiley, New York,CAMERON A.C., TRIVEDI P.K. (2005),Microeconometrics. Methods and Appli-cations. Cambridge University Press, New York.JACKMAN S. (2007), Models for Unordered Outcomes. Political Sci-ence 150C/350C. http://jackman.stanford.edu/classes/350C/07/unordered.pdf(12.03.2012).SO Y., KUHFELD W.F. (1995), Multinomial Logit Models.http://support.sas.com/techsup/technote/mr2010g.pdf(12.03.2012) .WINKELMANN R., BOES S. (2006),Analysis of Microdata. Springer, Berlin.

KeywordsMICROECONOMETRICS, DISCRETE CHOICE MODELS, PREFERENCES,R PRO-GRAM

Wrocław University of Economics, Department of Econometrics and ComputerScience, Nowowiejska 3, 58-500 Jelenia Góra, Poland, [email protected]·Wrocław University of Economics, Department of Econometrics and Computer Sci-ence, Nowowiejska 3, 58-500 Jelenia Góra, Poland, [email protected]

65

Latent Spaces of the Product Baskets - A Hybrid Model ofOn-line Shopping

Adam Sagan1 and Mariusz Łapczynski2

Abstract

The aim of the paper is to identify the latent relations between product choices (on-line shopping data) using the integrated hybrid model of market basket and latent spaceanalysis.Large number of association rules were post-mined (Zhao, Zhang and Cao 2009) bycombining them with SNA that explains the relational properties (with respect to sup-port, confidence and lift indices) of products network (Raeder and Chawla 2011).

We propose the model-based Latent Space Analysis for clustering the products net-work using two-stage maximum likelihood and bayesian MCMC estimation (Handcock,Raftery and Tantrum 2007). Optimal number of segments was found on the basis ofAIC/BIC criteria.

Relational properties of product networks are explained using alternative-specificlogit p* models and autocorrelation statistics (Wassermanand Pattison 1996). An Rpackagelatentnet, UCINET and Mplushave been used during the estimations.

ReferencesHANDCOCK, M., S., RATFERY, A. E. and TANTRUM, J., M. (2007): Model-BasedClustering for Social Networks,Journal of Royal Statistical Society, 170(2), 301–354RAEDER, T. and CHAWLA, N., V., (2011): Market Basket Analysis with Networks,Social Network Analysis and Mining, 2011, 1( 2), 97–113WASSERMAN, S. and PATTISON M. (1996), Logit Models and Logistic Regres-sions for Social Networks: an Introducion to Markov Graph and p* Psychometrica61(3), 401–425ZHAO, Y., ZHANG, Ch. and CAO, L.,(2009): Post-mining of Association Rules:Techniques for Effective Knowledge Extraction, Information Science Reference

KeywordsMARKET BASKET ANALYSIS, SOCIAL NETWORK AUTOCORRELATION, LA-TENT SPACE MODEL

Cracow University of [email protected] · Cracow University [email protected]

66

Multilevel Principal Covariates Regression

Marlies Vervloet, Wim Van den Noortgate, Katrijn Van Deun and Eva Ceulemans

Abstract

Principal Covariates Regression (PCovR; De Jong & Kiers, 1991) is a weighted combi-nation of Principal Component Analysis (PCA) and linear regression. Like PCA, PCovRreduces the predictors to a few components and, like regression, it predicts the criteria,but on the basis of the components. The extent to which both aspects play a role whenconstructing the components is determined by a weighting parameter that has to be spec-ified by the user. In this paper, we extend PCovR to multileveldata (e.g. persons nestedin groups). As part of the criterion variance of such data canbe contributed to between-group differences while another part is due to within-groupdifferences, the method firstsplits the data into a between-group part and a within-grouppart (for a similar approach,see Timmerman, 2006). Subsequentially, a separate PCovR analysis is conducted on thebetween-group part and on the within-group part. Multilevel PCovR involves a fewmodel selection challenges, as for both the between-group and the within-group model,an appropriate number of components and weighting parameter value needs to be cho-sen. To this end, we propose some model selection strategies, based on the work ofVervloet et al. (in press). The use of these strategies and the interpretation of the result-ing model are illustrated by means of a real-data application.

ReferencesDE JONG, S. and KIERS, H.A.L. (1991): Principal covariates regression. Part I.Theory.Chemometrics and Intelligent Laboratory Systems, 14, 155–164.TIMMERMAN, M.E. (2006): Multilevel component analysis.British Journal ofMathematical and Statistical Psychology, 59, 301–320.VERVLOET, M., VAN DEUN, K., VAN DEN NOORTGATE, W., and CEULE-MANS, E. (in press): On the selection of the weighting parameter value in PrincipalCovariates Regression.Chemometrics and Intelligent Laboratory Systems.

KeywordsMULTICOLLINEARITY, REGRESSION, MULTILEVEL DATA

KU Leuven, Belgium

67

Three-step Estimation Method For Discrete Micro-MacroMultilevel Models

M. Bennink1, M. A. Croon1 and J. K. Vermunt1

Abstract

In ‘reversed’ multilevel analysis, a group-level outcome is explained by means ofindividual- and/or group-level predictors using a latent variable model (Croon and vanVeldhoven, 2007). The scores of the individual-level unitsare treated as indicators of alatent variable defined at the group-level and the outcome variable is regressed on thislatent group-level variable.

Maximum likelihood estimators can be obtained by estimating the model in one step.This one-step approach is not very practical to apply, especially when one wishes to usemore than just a few lower-level predictors.

A solution would be to apply a three-step estimation method with a correction forclassification error (Bolck, Croon, Hagenaars 2004; Vermunt 2010; Bakk, Tekle, & Ver-munt, in press). The application of this three-step method to discrete micro-macro mul-tilevel models is discussed in the current presentation.

ReferencesBAKK, Zs., TEKLE, F. and VERMUNT, J. K. (in press): Estimating the associationbetween latent class membership and external variables using bias adjusted three-stepapproaches.Sociological Methodology.BOLCK, A., CROON, M. A. and HAGENAARS, J. A. (2004). Estimating latentstructure models with categorical variables: One-step versus three-step estimators.Political Analysis, 12, 3–27.CROON, M. A., and van VELDHOVEN, M. J. P. M. (2007). Predicting Group-levelOutcome Variables from Variables Measured at the Individual Level: A Latent Vari-able Multilevel Model.Psychological Methods, 12, 45–57.VERMUNT, J. K. (2010). Latent class modeling with covariates: Two improvedthree-step approaches.Political Analysis, 18, 450–469.

KeywordsTHREE-STEP APPROACH, MULTILEVEL ANALYSIS, MICRO-MACRO ANALY-SIS, GENERALIZED LATENT VARIABLE MODELS

Tilburg University, Tilburg, the [email protected]

68

Single-array SNP Genotype Classification WithSemi-Parametric Log-Concave Mixtures

Paul H.C. Eilers1 and Ralph C.A. Rippe2

Abstract

SNP (pronounced as “snip") stands for single nucleotide polymorphism, positions on thegenome (DNA) that differ between individual organisms. Forhumans, millions of SNPshave been located. Using microarrays the state of up to a million SNPs can be measuredat the same time, using a single drop of blood or a small amountof body tissue. Theresults are being used on a very large scale in genome-wide association scans, in whichobservable properties of may individuals are regressed on SNP states.

Each SNP has two alleles, which we indicate here by A and B. Because DNA is orga-nized in chromosomes, and chromosomes form pairs, the stateof a SNP, its genotype,can be AA, AB or BB (it is not possible to discriminate betweenAB and BA). A crucialstep is the assignment of genotypes to all SNPs for each person in a study.

Microarray technology is based on chemical fluorescence. Unfortunately this tech-nique is far from perfect and so clustering methods are needed. Commonly this is im-plemented by estimating the AA, AB and BB clusters and cluster memberships for eachSNP in turn, for a set of microarrays.

We have developed an alternative approach, in which all SNPson one array are clus-tered at the same time. It is based on estimating a mixture of three two-dimensionalsemi-parametric densities, using tensor product P-splineto model their logarithms. Thepenalties have been chosen in such a way that they force the component densities to belog-concave.

Genotyping whole arrays has large logistic advantages, both in speed and in organi-zation of the workflow. We present the theory behind our proposal and by applying it tosamples from the HapMap archive we show its excellent performance.

KeywordsSPLINES, PENALTIES, GENOTYPING

Erasmus University Medical Center, Rotterdam, The [email protected] · Leiden University, Leiden, The [email protected]

69

On Featureless K-Means Clustering

Sergey D. Dvoenko1

Abstract

In a featureless case the set of objects is represented only by results of pairwise compar-isons in the form of a distance, similarity or kernel-based matrix. Since publication byW.S. Torgerson, the cluster centers can be represented by their distances to other objectswithout using the feature space itself, which recently has become popular as the “kernelk-means”.

We show how k-means clustering can be executed with no computations related tocluster centers at all. This procedure, referred to as the meanless featureless k-meansclustering, makes permutations on the (dis)similarity square matrix resulting in the sameclustering for both featureless and feature-based cases.

It is shown that some heuristic clustering algorithms for “diagonalization” of simi-larity matrices, popular in Russia, are suboptimal versions of the meanless featurelessk-means procedure if the matrix is semidefinite positive.

ReferencesDHILLON, I., GUAN, Y. and KULIS, B. (2004): Kernel k-means: spectral cluster-ing and normalized cuts. In:Proceedings of the 10th ACM SIGKDD Int. Conf. onKnowledge discovery and data mining. ACM New York, NY, USA, 551–556.SCHOLKOPF, B. and SMOLA, A. (2002):Learning with kernels: Support VectorMachines, Regularization, Optimization and Beyond. MIT Press, Cambridge.BRAVERMAN, E.M. and others. (1971): Diagonalization of therelation matrix anddetecting of hidden factors.Trans. of Institute of Control Sciences. 1st Issue "Prob-lems of increasing of automata possibilities", Moscow, Institute of Control Sciences,42-79. (in Russian)TORGERSON, W.S. (1958):Theory and Methods of Scaling. Wiley. N.Y.

KeywordsK-MEANS, FEATURELESS, MEANLESS, DISTANCE, SIMILARITY

State University of Tula, Russia,[email protected]

70

Two Major Least-squares Divisive Clustering Methods:Bisecting K-Means, PDDP and in between

E. Kovaleva1 and B. Mirkin2

Abstract

We first show that both bisecting k-means and principal direction partitioning (pddp)are suboptimal methods for the same least-squares criterion with ternary bases corre-sponding to rooted binary trees. Also, we combine these two by using projection of datato a number of random directions rather than to one principaldirection. To specify adivisive algorithm, one is to choose: (a) next cluster to split; (b) rule to stop splitting;(c) cluster splitting method. We choose a representative subset of divisive clusteringoptions and compare them experimentally by using a specially designed Gaussian clus-ter structure generator. Most options are unstable over theincrease of noise. The pddpmethod, recently modified by using the minima of the principal direction density func-tion to specify all of (a), (b), and (c) above, appears to be unequivocally winning inmost experiments. Yet at really noisy situations, with bothbetween-cluster overlap andrandom entities, the winner is k-means bisecting with random directions.

ReferencesBOLEY D. Principal Direction Divisive Partitioning.Data Mining and KnowledgeDiscovery, 1998 2(4), 325-344.MIRKIN B. Mathematical Classification and Clustering, Kluwer, Dordrecht, 1996,448.MIRKIN B. Choosing the number of clusters,WIRE Data Mining and KnowledgeDiscovery, 2011, 1, 252-260.TASOULIS S.K., TASOULIS D.K., PLAGIANAKOS V.P. Enhancing Principal Di-rection Divisive clustering,Pattern Recognition 43, 2010, 3391-3411.

KeywordsCLUSTERING, LEAST-SQUARES APPROACH, PDDP, BISECTING K-MEANS

NRU Higher School of Economics, Moscow, [email protected] · NRUHigher School of Economics, Moscow, [email protected]

71

Scoring Dissimilarity between Binary Images by AligningSeries of Skeleton Primitives

Olesya A. Kushnir1 and Oleg S. Seredin2

Abstract

We propose a method for matching images by converting information of their skeletonsin a series of primitives. To build a series of skeleton primitives, we traverse the skele-ton counterclockwise starting from a terminal node. Each edge on the way generates aprimitive into the series as a set of two reals, first expressing the edge’s length, and thesecond, the angle between the current edge and the next edge.

To compare two skeletons, we optimally align their series ofprimitives by using thedynamic programming approach. The alignment score is translated into our dissimilarityfunction. To improve the accuracy of a classifier built over the similarity, we incorporatea third real into the primitive, that is related to the radialsize of the skeleton in therespective node. We apply this to classify medical plant leaves.

ReferencesBYSTROV, M. YU. (2011): Structural approach application for recognition of binaryimage skeleton. In:Proceedings of Petrozavodsk State University, 2 (115), 76 –80(in Russian).GUSFIELD, D. (1997):Algorithms on Strings, Trees, and Sequences. CambridgeUniversity Press, University of California, Davis.MESTETSKIY, L. AND SEMENOV, A. (2008): Binary image skeleton - continuousapproach. In:Proceedings of the Third International conference on computer visiontheory and applications (VISAPP 2008), 1, 251 – 258.MOTTL, V.V., BLINOV, A.B., KOPYLOV, A.V., KOSTIN, A.A. (1998): OptimalProduct Positioning Based on Paired Comparison Data. In:Graph-Based Represen-tations in Pattern Recognition (J.-M. Jolion and W.G. Kropatsch, ed.) Computing,Supplement 12. Springer-Verlag/Wien, 135 – 145.

KeywordsSKELETON, PRIMITIVE, ALIGNMENT, DISSIMILARITY

Tula State University, Tula, [email protected] · Tula StateUniversity, Tula, [email protected]

72

Least-squares Consensus Clustering versus: (a) otherConsensus Approaches and (b) K-Means

A. Shestakov1 and B. Mirkin2

Abstract

We take on two criteria for consensus clustering proposed byMirkin and Muchnik(1981, in Russian) and optimize them with similarity clustering approaches describedin (Mirkin, 2005, 2012). Given a set of partitions R on the same entity set, one criterionis to find a partition r, that is behind those in R, which is akinto current concepts ofensemble consensus clustering. The other criterion is to build a partition r from R. Bothcan be equivalently reformulated as similarity clusteringcriteria; the first working overthe conventional consensus matrix, the second over the summary projection matrix. Weconsider a number of recent clustering consensus methods: Voting Scheme (Weinges-sel, Dimitriadou, Hornik 2002), Borda Voting (Sevillano, Claudi Socoro, Alias 2009),Bayesian (Wang, Shan, Banerjee 2009), Fusion-Transfer consensus (Guenoche 2011),MCLA, CSPA and HGPA (Strehl, Ghosh 2002), and cVote (Ayad, Kamel 2010). Forexperiments, we take all three types of data: (a) UCI repository datasets, (b) speciallydrawn two-dimensional “ornaments”, and (c) generated Gaussian cluster datasets. Weevaluate found cluster partitions according to their similarity to the partition hidden indata. We address two issues:

1. How least-squares consensus algorithms fare in comparison with the others? An-swer: The least-squares consensus algorithms outperform the others, usually up to alarge margin.

2. Is it true that the least-squares k-means clustering criterion is a better criterionthan consensus? Answer: No. in most situations, least squares consensus partitionis closer to the hidden partition than that minimizing the k-means criterion. Thisshows that developing algorithms for reaching deep minima of k-means criterionmay be a wrong idea.

KeywordsCONSENSUS CLUSTERING, LEAST SQUARES CONSENSUS, ONE-BY-ONE CLUS-TERING, K-MEANS

NRU Higher School of Economics, Moscow, [email protected] · NRU Higher School of Economics,Moscow, [email protected]

73

Combination of Several Control Charts using DynamicWeighted Majority Algorithm

Dhouha Mejri1, Claus Weihs2 and Mohamed Limam3

Abstract

In most process control applications, it is assumed that theprocess output follows anormal distribution with known mean and standard deviation. However, in real worlddata come over time and the process concept to be learned are often not stable andmay drift overtime. Moreover, when monitoring a process with multivariate normal dis-tribution using Shewhart, CUSUM or EWMA control chart whichare designed to re-spectively detect large, moderate and small shifts, it has been proposed that overall per-formance of different shifts can be obtained by combining control charts. This articlepresents a new combination of three different control charts using a dynamic ensem-ble method that copes with concept drifting data streams labeled: “Dynamic WeightedMajority” (DWM-WIN) algorithm [MEJ12]. The proposed combination benefits fromthe online characteristic of DWM-WIN algorithm in directing the state of the processwhen a stream of data arrives overtime. It consists of two steps: first transforming thetask of determining the state of the process into a classification problem by treatingcontrol charts as classifiers. Second, DWM-WIN is applied asan ensemble method tocombine different control charts. A real dataset with concept drift is used to simulate thecombined control chart. The proposed control chart presents an online method for driftdetection and improves the overall performance of the individual control charts over theentire process shift range.

ReferencesMEJRI, D., KHANCHEL R., LIMAM M., (2012): An ensemble methodfor conceptdrift in nonstationary environment,Journal of Statistical Computation and Simula-tion, 82, 1–14.

KeywordsSTATISTICAL PROCESS CONTROL, ONLINE CLASSIFICATION, DYNAMIC WEIGHTEDMAJORITY ALGORITHM, CONCEPT DRIFT.

ISG Tunis, University of Tunis and Technical University of Dortmund, [email protected] · Technical University of Dortmund, Germany,[email protected] · ISG Tunis, University of Tunis and Dho-far University, Oman,[email protected]

74

Multiplicity Within Clustering: Challenges And Unificatio ns

Jacques-Henri Sublemontier

Abstract

Data clustering is one of the most important unsupervised learning task and remainchallenging one despite the huge amount of method proposed in the literature [?]. Thecurrent large amount of data generated each month, days or hours have leading to the socalled “Big Data” problem have made clustering as one of the main tool to make furtheranalysis applicable. We are now faced with multiple sourcesof information, massiveand heterogeneous, coming from marketing to biology or social network analysis. Thepresent study is concerned with the multiplicity within current clustering problem. Mul-tiplicity can be found either in the data to analyse but also in the analysis to providefor demanding users. Thus several learning and mining paradigm have emerged sincethe last decade, namely multi-view clustering, consensus clustering or clustering en-semble, multiple consensus clustering or subspace and semi-supervised clustering [?].We observe here several works dedicated to these problems, then to propose a flexibleframework unifying them all. The propose framework follow the collaborative cluster-ing principle, where the objective is to find collaborative mechanisms between a set ofclusterers in order to achieve different objectives related to presented problems.

ReferencesHANS-PETER KRIEGEL AND ARTHUR ZIMEK. Subspace Clustering,EnsembleClustering, Alternative Clustering, Multiview Clustering: What Can We Learn FromEach Other? InProceedings of MultiClustKDD, 2010.ANIL K. JAIN. Data clustering: 50 years beyond K-means InPattern RecognitionLetters, 2010.

KeywordsMULTI-VIEW CLUSTERING, CONSENSUS CLUSTERING, ALTERNATIVE CLUS-TERING, SEMI-SUPERVISED CLUSTERING, COLLABORATIVE CLUSTERING

LIFO - Université d’Orléans, ENSI de Bourges, Bâtiment IIIA, rue Léonard de Vinci,F-45067 ORLEANS Cedex [email protected],http://www.univ-orleans.fr/lifo/Members/sublemontier/

75

Non-Isometric Transforms in Time Series Classificationusing DTW

Tomasz Górecki1 and Maciej Łuczak2

Abstract

Over recent years the popularity of time series has soared. As a consequence there hasbeen a dramatic increase in the amount of interest in querying and mining such data. Inparticular, many new distance measures between time serieshave been introduced. Inthis paper, we propose a new distance function based on a derivatives and transformsof times series. In contrast to well-known measures from theliterature, our approachcombines three distances: DTW distance between time series, DTW distance betweenderivatives of time series and DTW distance between transforms of time series. Thenew distance is used in classification with the nearest neighbor rule. In order to providea comprehensive comparison, we conducted a set of experiments, testing effectivenesson 47 time series data sets from a wide variety of applicationdomains. Our experimentsshow that this new method provides a significantly more accurate classification on theexamined data sets.

ReferencesGÓRECKI, T. and ŁUCZAK, M. (2013): Using derivatives in timeseries classifica-tion. Data Mining and Knowledge Discovery 26(2), 310–331.DING, H., TRAJCEVSKI, G., SCHEUERMANN, P., WANG, X. and KEOGH, E.(2008): Querying and Mining of Time Series Data: Experimental Comparison ofRepresentations and Distance Measures. In: Proc. 34th Int.Conf. on Very Large DataBases, 1542–1552.KEOGH, E. and PAZZANI, M. (2001): Dynamic Time Warping with Higher Or-der Features. In: First SIAM International Conference on Data Mining (SDM’2001),Chicago, USA.

KeywordsDYNAMIC TIME WARPING, DERIVATIVE DYNAMIC TIME WARPING, TIM ESERIES, HILBERT TRANSFORM, COSINE TRANSFORM, SINE TRANSFORM

Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Umul-towska 87, 61-614 Poznan, [email protected] · Departmentof Civil and Environmental Engineering, Koszalin University of Technology,Sniadec-kich 2, 75-453 Koszalin, [email protected]

76

Performance of the Accelerated Hyperbolic SmoothingClustering Method

Adilson Elias Xavier1 and Vinicius Layter Xavier2

Abstract

This paper considers the solution of the minimum sum-of-squares clustering problemby using the Accelerated Hyperbolic Smoothing Clustering Method. The mathemati-cal modelling of this problem leads to amin− sum−min formulation which has thesignificant characteristic of being strongly non-differentiable. The proposed resolutionmethod adopts the Hyperbolic Smoothing (HS) strategy usinga specialC∞ differen-tiable class function. The final solution is obtained by solving a sequence of low dimen-sion differentiable unconstrained optimization sub-problems which gradually approachthe original problem. The proposed algorithm applies also apartition of the set of ob-servations into two non overlapping groups: "data in frontier" and "data in gravitationalregions". The resulting combination of the HS methodology with the partition schemefor the MSSC problem has interesting properties, which drastically simplify the com-putational tasks. Computational experiments were performed with synthetic very largeinstances with 5000000 observations in spaces with up to 10 dimensions. The obtainedresults show a high level performance of the algorithm according to the different criteriaof consistency, robustness and efficiency. The robustness and consistency performancescan be attributed to the complete differentiability of the approach. The high speed ofthe algorithm can be attributed to the partition of the set ofobservations into two nonoverlapping parts, which simplifies drastically the computational tasks.

ReferencesXAVIER, A.E. (2010): The Hyperbolic Smoothing Clustering Method. PatternRecognition, 43, 731-737.XAVIER, A.E. and XAVIER, V.L. (2011): Solving the Minimum Sum-of-SquaresClustering Problem by Hyperbolic Smoothing and Partition into Boundary and Grav-itational Regions.Pattern Recognition, 44, 70-77.

KeywordsCLUSTER ANALYSIS, MIN-SUM-MIN PROBLEMS, NON-DIFFERENTIABLE PRO-GRAMMING, SMOOTHING

Federal University of Rio de Janeiro - [email protected] · FederalUniversity of Rio de Janeiro - [email protected]

77

STATIS Based Multiblock Clustering

Ndèye Niang1 and Mory Ouattara12

Abstract

Clustering multiblock data has been addressed by several consensus methods proposedby authors such as Gordon A.D. and Vichi, M. (1998) among others. The principalidea of these consensus methods is to agglomerate the separate partitions obtain fromeach block into a global partition which has to be the most similar to the contributorypartitions according to some index, eg. the Rand index. CSPA(cluster based similaritypartitioning algorithm) consists of clustering a so-called association matrix whose en-tries are defined as the fraction of partitions in which two individuals are in the samecluster. This association matrix considered as a similarity matrix is used to reclusterthe individuals. Li et al (2008) pointed out some limitations of CSPA and proposed aweighted consensus clustering method. We propose a method based on the three waymethod STATIS (Lavit et al., 1994) to find the consensus partition: letXi be the indicatormatrix related to the ith contributory partition. ApplyingSTATIS, each of these matri-ces is associated to a connectivity matrixWi . STATIS yields a compromise matrixW,weighted average of theWi which is the most similar to theWi according to theRV index(Lavit et al). We propose to recluster the individuals usingthe STATIS compromise ma-trix. The proposed method is compared to CSPA on data sets from the UCI repository,with labelled individuals in order to have a reference partition.

ReferencesGORDON A.D. AND VICHI, M. (1998 b):Partition of partitions. Journal of Clas-sification 15, 265-285 .LAVIT, C. AND ESCOUFIER, Y., SABATIER, R. AND TRAISSAC, P. (1994): TheACT (STATIS method)Computational Statistics and Data Analysis, 18: 97-119.T. LI AND C. DING. ( 2008): Weighted Consensus Clustering. InProc. SIAMInt.Conf:on Data Mining (SDM), 798-809,

KeywordsSTATIS, MULTI BLOCKS, CLUSTERING, CONSENSUS

CEDRIC CNAM 292, rue Saint Martin, 75141 Paris Cedex 03, [email protected] · CSTB, Centre Scientifique et Tech-niques du Bâtiment, 84 Avenue Jean Jaurès, 77420 [email protected]

78

Identifying Common And Distinctive Processes UnderlyingMultiset Data

Katrijn Van Deun1, Age K. Smilde2, Henk A.L. Kiers3, and Iven Van Mechelen1

Abstract

In many research domains it has become common practice to rely on multiple sources ofdata pertaining to the same set of entities. Examples include a systems biology approachto immunology with collection of both gene expression data and immunological read-outs for the same set of subjects, and the use of several high-througput techniques forthe same set of fermentation batches. A major challenge is tofind the processes underly-ing such multiset data and to disentangle therein the commonprocesses from those thatare distinctive for a specific source. Several integrative methods have been proposedto address this challenge including canonical correlationanalysis, simultaneous com-ponent analysis, OnPLS, generalized singular value decomposition, DISCO-SCA, andECO-POWER. To get a better understanding of the relations between these methods,this paper brings the methods together and compares them both on a theoretical level,as well as in terms of analyses of high-dimensional micro-array gene expression dataobtained from subjects vaccinated against influenza.

ReferencesALTER, O., BROWN, P.O., and BOTSTEIN, D. (2003): Generalized singular valuedecomposition for comparative analysis of genome-scale expression data sets of twodifferent organisms.Proceedings of the National Academy of Sciences USA 100,3351-3356.LÕFSTEDT, J., and TRYGG, J. (2010): OnPLS - a novel multiblock method for themodelling of predictive and orthogonal variation.Journal of Chemometrics 25 (2010)441-455SCHOUTEDEN, M., VAN DEUN, K., and VAN MECHELEN, I. (2012): ECO-POWER: A novel method to reveal common mechanisms underlying linked data.In: A. COLUBI, K. FOKIANOS, and E.J. KONTOGHIORGHES (Eds.):Proceed-ings of COMPSTAT’2012. 20th International Conference on Computational Statis-tics. Physica-Verlag, Heidelberg. PP–PP.TENENHAUS, A., and TENENHAUS, M. (2011): Regularized generalized canoni-cal correlation analysis.Psychometrika, 76, 257-284.VAN DEUN, K., VAN MECHELEN, I., THORREZ, L., SCHOUTEDEN, M.,DEMOOR, B., VAN DER WERF, M.J., DE LATHAUWER, L., SMILDE, A.K.,andKIERS, H.A.L. (2012): DISCO-SCA and properly applied GSVD as swinging meth-ods to find common and distinctive processes.PLoS ONE, 7, e37840, 1-13.

KeywordsMULTISET, COMMON AND DISTINCTIVE, DATA INTEGRATION

KU Leuven, Leuven, [email protected] · Univer-sity of Amsterdam, Amsterdam, The Netherlands· University of Groningen, Groningen,The Netherlands

79

Fuzzy Clustering of Three-way Proximity Arrays

Paolo Giordani1 and Henk A.L. Kiers2

Abstract

The ADditive CLUStering (ADCLUS) model is a tool for overlapping clustering oftwo-way proximity matrices (objects× objects). In the Simple Additive Fuzzy Clus-tering (SAFC) model, a variant of ADCLUS providing a fuzzy partition of the objects,that is the objects belong to the clusters with the so-calledmembership degrees rangingfrom zero (complete non-membership) to one (complete membership), is introduced.The INdividual Differences CLUStering (INDCLUS) model is ageneralization of AD-CLUS for handling three-way proximity arrays (objects× objects× subjects). Here,we propose a fuzzified alternative to INDCLUS capable to offer a fuzzy partition of theobjects by generalizing in a three-way context the idea behind SAFC. This new modelis called Fuzzy INdividual Differences CLUStering (FINDCLUS). An algorithm is pro-vided for fitting the FINDCLUS model to the data. Finally, theresults of a simulationexperiment and some applications to synthetic and real dataare discussed.

ReferencesCARROLL, J.D. and ARABIE, P. (1983): INDCLUS: an IndividualDifferencesGeneralization of the ADCLUS Model and the MAPCLUS Algorithm. Psychome-trika, 48, 157–169.GIORDANI, P. and KIERS, H.A.L. (2012): FINDCLUS: Fuzzy INdividual Differ-ences CLUStering.Journal of Classification, 29, 170–198.SATO, M. and SATO, Y. (1994): An Additive Fuzzy Clustering Model. JapaneseJournal of Fuzzy Theory and Systems, 6, 185–204.SHEPARD, R.N. and ARABIE, P (1979): Additive Clustering: Representation ofSimilarities as Combinations of Discrete Overlapping Properties.Psychological Re-view, 86, 87–123.

KeywordsTHREE-WAY ANALYSIS, CLUSTERING, PROXIMITY DATA, INDCLUS,FUZZYAPPROACH

Department of Statistical Sciences, Sapienza University of Rome, P.le Aldo Moro, 5,00185 Rome, [email protected] · Heymans Institute, Uni-versity of Groningen, Grote Kruisstraat 2/1, 9712 TS Groningen, The [email protected]

80

Principal Covariates Clusterwise Regression

Eva Ceulemans1, Eva Vande Gaer1, Henk A. L. Kiers2, Iven Van Mechelen3, and TomF. Wilderjans1

Abstract

In the behavioral sciences, many research questions pertain to a regression problem inthat one wants to predict a criterion on the basis of a number of predictors. Althoughin many cases ordinary least squares regression will suffice, sometimes the predictionproblem is more challenging, for three reasons: First, manypredictors can be available,making it difficult to grasp their mutual relations as well astheir relations to the cri-terion. In that case, it may be very useful to reduce the predictors to a few summaryvariables, on which one regresses the criterion and which atthe same time yield insightinto the predictor structure. Second, the population understudy may consist of a fewunknown subgroups that are characterized by different regression models. Third, theobtained data are often hierarchically structured, with for instance observations beingnested into persons. Although some methods have been developed that partially meetthese challenges (i.e., Principal Covariates Regression -PCovR-, clusterwise regression-CR-, and structural equation models), none of these methods adequately deals with allof them simultaneously. To fill this gap, we propose the PCCR method, which combinesthe key ideas behind PCovR (De Jong and Kiers, 1992) and CR (Spath, 1979). ThePCCR method is validated by means of a simulation study and byapplying it to datagathered in daily life on eating disorders.

ReferencesDE JONG, S. and KIERS, H. A. L. (1992): Principal covariates regression. Part I.Theory.Chemometrics and Intelligent Laboratory Systems, 14, 155-164.SPATH, H. (1979): Algorithm 39: Clusterwise linear regression.Computing, 22, 367-373.

KeywordsMULTICOLLINEARITY, DIMENSION REDUCTION, CLUSTERWISE REGRES-SION, MULTILEVEL DATA

Methodology of Educational Sciences Research Group, KU Leuven. Email:[email protected] · Heymans Institute, Faculty of Behaviouraland Social Sciences, University of Groningen. Email:[email protected] · Re-search Group of Quantitative Psychology and Individual Differences, KU Leuven.Email:[email protected]

81

Clusterwise PARAFAC To Identify Heterogeneity InThree-Way Data

Tom F. Wilderjans and Eva Ceulemans

Abstract

Three-way data, like, for example, sensory profiling data (e.g., products rated on a setof features by different judges) and EEG data (e.g., the spectrum of multichannel EEGrecordings over time for a set of participants), are frequently encountered in practice.When analyzing such three-way data, often the PARAFAC modelis adopted to disclosethe structure underlying the data (in terms of components).An implicit assumption ofthe PARAFAC model is that the underlying components are the same for all objects(i.e., products and participants). In many circumstances,however, this is a too restric-tive assumption in that different groups of objects may exist for which the data can besummarized well by a different set of components. For example, groups of participantsmay differ in the components that underlie their EEG recordings and other dimensionsmay be used to evaluate the quality of different groups of products. Therefore, in thispresentation, a new clusterwise PARAFAC generic modeling strategy is proposed. Thekey ingredient of this strategy is that the objects are partitioned into a set of mutuallyexclusive clusters, and that for each cluster of objects, a separate PARAFAC model isfitted, resulting in (cluster-specific) components that areallowed to vary across objectclusters. As a consequence, the data of objects belonging tothe same cluster can besummarized well by the same components, whereas different components are underly-ing the data from objects from different clusters. To evaluate the performance of thenew clusterwise PARAFAC strategy the results of an extensive simulation study will bediscussed. Finally, an application of the strategy to EEG and/or sensory profiling datawill be presented.

KeywordsCANDECOMP/PARAFAC, POPULATION HETEROGENEITY, THREE-WAYDATA,EEG DATA, QUALITATIVE (AND QUANTITATIVE) DIFFERENCES BETWEENOBJECTS

Methodology of Educational Sciences Research Group, Faculty of Psychology and Ed-ucational Sciences, KU Leuven, Andreas Vesaliusstraat 2 box 3762, 3000 Leuven, Bel-gium. Email:[email protected]

82

Structure-Revealing Data Fusion Model

Evrim Acar, Anders J. Lawaetz, Morten A. Rasmussen, and Rasmus Bro

Abstract

In many disciplines, data from multiple sources are acquired and jointly analyzed forenhanced knowledge discovery. However, the task of fusing data is challenging sincedata are often incomplete, heterogeneous, i.e., in the formof higher-order tensors andmatrices, and have both common (shared) and individual (unshared) components. Witha goal of addressing these challenges, we formulate data fusion as a coupled matrix andtensor factorization problem tailored to automatically reveal common and individualcomponents. In order to solve the coupled factorization problem, we use a gradient-based all-at-once optimization algorithm, which easily extends to coupled analysis ofincomplete data sets. We demonstrate that the proposed approach provides promisingresults in joint analysis of metabolomics data sets consisting of fluorescence and NMRmeasurements of plasma samples of a group of colorectal cancer patients and controls.

ReferencesACAR, E., KOLDA T. G. and DUNLAVY D. M. (2011): All-at-once Optimizationfor Coupled Matrix and Tensor Factorizations,arXiv:1105.3422.

KeywordsDATA FUSION, COUPLED MATRIX AND TENSOR FACTORIZATIONS, MISS-ING DATA, GRADIENT-BASED OPTIMIZATION

Faculty of Science, University of Copenhagen, Denmark{evrim, ajla, mortenr, rb}@life.ku.dk

83

Effects of Resampling Schemes on Stability of ClusterValidation Indices

Rainer Dangl and Friedrich Leisch

Abstract

Model validation in clustering involves the question whether the appropriate numberof groups was chosen. In order to investigate this, a wide range of indices has beendeveloped so far. Examples include the Rand Index, Jaccard-Coefficient, CH Index,KL Index, Gap Statistic, Prediction Strength, etc. In recent years, increased computa-tional power has facilitated the feasibility of resamplingbased validation studies, whichthrough repeated calculation of validity measures by usingresampled data provide amore stable trend towards a particulark. This in turn poses new questions - not only thechoice of a particular index may affect the outcome of the validation process, but alsothe method of resampling. Three main options are available:bootstrapping, splittingand random selection, depending on if an internal or external index is used. The ques-tion now arises whether the resampling scheme has an influence on the index values.The present study investigates exactly this problem. For this purpose, the three schemesand a selected range of cluster validation indices are benchmarked on simulated data.

ReferencesDOLNICAR, S. and LEISCH, F. (2010): Evaluation of structureand reproducibilityof cluster solutions using the bootstrap.Marketing Letters, 21, 83–101.MILLIGAN, G. and COOPER, M. (1985): An examination of procedures for deter-mining the number of clusters in a data set.Psychometrika, 50 (2), 159–179.TIBSHIRANI, R. and WALTHER, G. and HASTIE T. (2000): Estimating the numberof clusters in a dataset via the Gap Statistic.Journal of the Royal Statistical Society:Series B (Statistical Methodology), 63, 411–423.

KeywordsRESAMPLING, MODEL VALIDATION, CLUSTERING

Institute for Applied Statistics and Computing, University of Natural Re-sources and Life Sciences, Vienna, Peter-Jordan-Strasse 82, 1190 Vienna, [email protected]; [email protected]

84

Functional Canonical Correlation Analysis

Mirosław Krzysko1 and Łukasz Waszak2

Abstract

In this paper we propose a new method of constructing canonical correlations andcanonical variables for the pair of stochastic processes

X(t) =p

∑k=1

αkϕk(t), Y(t) =q

∑l=1

βlψl (t)

represented by a finite number of orthonormal basis functions

ϕ(t) = (ϕ1(t), ...,ϕp(t))′, ψ(t) = (ψ1(t), ...,ψq(t))

′,

wheret ∈ [0,T], α1, ...,αp andβ1, ...,βq are random variables with zero means and finitevariances. Canonical correlation analysis for a random process with finite basis expan-sion is equivalent to multivariate canonical correlation analysis between two randomvectorsα = (α1, ...,αp) andβ = (β1, ...,βq).

This problem has been initiated by Leurgans et al. (1993) anddeveloped by Ramsayand Silverman (2005).

ReferencesLEURGANS, S.E., MOYEED, R.A. AND SILVERMAN, B.W. (1993): Canonicalcorrelation analysis when the data are curves,J.R. Statist. Soc. B 55, No 3, 725-740RAMSAY, J.O., SILVERMAN , B.W. (2005).Functional Data Analysis(2nd ed).Springer.

KeywordsFUNCTIONAL DATA, ORTHONORMAL BASIS, CANONICAL CORRELATIONANALYSIS, REPRODUCING KERNEL HILBERT SPACE, KERNEL

Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Umul-towska 87, 61-614 Poznan, [email protected] · Faculty of Mathemat-ics and Computer Science, Adam Mickiewicz University, Umultowska 87, 61-614 Poz-nan, [email protected]

85

Pearson’s Product-Moment Correlation is a Special Case OfCohen’s Weighted Kappa

Matthijs J. Warrens

Abstract

In behavioral and biomedical sciences it is frequently required that two observers eachindependently rate the same set of targets on an ordinal scale. The raters may be clini-cians who classify children on asthma severity, or pathologists that rate the severity oflesions from scans. A widely used descriptive statistic forquantifying the agreementbetween the two observers is Cohen’s weighted kappa (Cohen 1968, Warrens 2011,2012).

Weighted kappa was proposed for situations where the disagreements between theobservers are not all equally important. For example, when categories are ordered, theseriousness of a disagreement depends on the difference between the ratings. Weightedkappa allows the use of weights to describe the closeness of agreement between cate-gories.

Since the magnitude of weighted kappa is greatly influenced by the relative magnitudeof the weights (Warrens 2013) a practical problem since its introduction has been, whatweights should be chosen? In this talk we show that if cell weights may be calculatedfrom the data, then the sample estimate of Pearson’s product-moment correlation is aspecial case of Cohen’s weighted kappa.

ReferencesCOHEN, J. (1968): Weighted Kappa: Nominal Scale Agreement With Provision forScaled Disagreement or Partial Credit.Psychological Bulletin, 70, 213–220.WARRENS, M. J. (2011): Cohen’s Linearly Weighted Kappa is a Weighted Averageof 2×2 Kappas.Psychometrika, 76, 471–486.WARRENS, M. J. (2012): Some Paradoxical Results for the Quadratically WeightedKappa.Psychometrika, 77, 315–323.WARRENS, M. J. (2013): Conditional Inequalities Between Cohen’s Kappa andWeighted Kappas.Statistical Methodology, 10, 14–22.

KeywordsCOHEN’S KAPPA, WEIGHTED KAPPA, ORDINAL AGREEMENT

Leiden University, Institute of Psychology, Unit Methodology and Statistics, P.O. Box9555, 2300 RB Leiden, The Netherlands,[email protected]

86

Ternary Diagrams Based On A Probabilistic Ideal PointModel

Mark de Rooij1 and Paul Eilers2

Abstract

The ternary diagram is a familiar and useful display of triples of probabilities that sumto one. The scales of the diagram are linear, and so small probabilities lead to dots closeto the boundary or even in the corners. Details are hard to judge then.

We propose a transformation, inspired by the probabilisticideal point model of DeRooij (2009). In a plane an objecti with probabilities(pi1, pi2, pi3) is represented by apoint with coordinates(xi ,yi), such thatpi j = cexp(−d2

i j ), whered2i j = (xi −u j)

2+(yi −

v j)2 is the squared Euclidean distance to an “anchor point"j, with coordinates(u j ,v j).

These anchor points are defined by the user, and will generally form an equilateral tri-angle.

The proposed display has several interesting properties. Triples with very small prob-abilities can be represented well. Equal log-odds of pairs of probabilities corresponds tostraight lines perpendicular to the line connecting two anchor points. Equal log-odds ofa single probability against the two others are given by smooth curves.

ReferencesDE ROOIJ, M. (2009): Ideal point discriminant analysis witha special emphasis onvisualization.Psychometrika, 74, 317–330.

KeywordsBIPLOTS, MULTIDIMENSIONAL SCALING, COMPOSITIONAL DATA

Leiden University, The [email protected] · Erasmus Uni-versity Medical Center, The [email protected]

87

The Matter Of Scale: Perceiving Distances And ProximitiesIn The Bi-Partial Clustering Setting

Jan W. Owsinski

Abstract

In the analysis of empirical data the issue of scale is of paramount importance. If thereexists a clear knowledge of the “actual” space of (feasible)attribute values, then it hasan obvious influence on the interpretation of the ones that are available for analysis.While in the case of, say, “independent” binary data this might often be trivial, it isby far not, when we deal with continuous data or with seriously restricted domains ofmultidimensional discrete, or even binary data. This is related at the same time to thedegree of “fillin” of this “feasible spac” with the data and tothe distance/proximityrelations among observations available.

Yet, the issue of scale appears as important on several levels. First, it intervenes atthe level of individual observations and the distance/proximity definitions, and in closeassociation with feature/variable importance. With this respect there is a distinct feed-back loop in reasoning, for it is the geometric properties that suggest which variablesare important, while importance may be held as having impacton the way geometry istreated of the data set. Then,second, it bears direct influence onseparation, distancesand proximities among groups of observations. Finally, it appears through theassess-ment of the entire image of data(do we deal with two or three models? is the propernumber of clusters four or six?). The intuitions relative tothese three basic levels maynot necessarily be consistent.

In many cases it may be of high significance to analyse explicitly the influence ofthe perception of scale on the results of respective analysis. The bi-partial approach,proposed by the present author, allows for an explicit consideration of this aspect, atleast at two of the previously mentioned levels. The bi-partial approach, which stemsmainly from clustering, but applies to numerous domains of data analysis at large (seeOwsinski, 2011), proposes to use a two-part objective function,namely

minP

{QSD(P) =CS(P)+CD(P)},

whereP is a partition of the data set that we look for,CS(P) corresponds to theoverall assessment of similarity(S) of the components forming partitionP (we want thecomponents to be possibly dissimilar), andCD(P) corresponds to the overall assessmentof the internal “compactness” of the particular groups, butmeasured through distances(D), so that we would like it to be possibly small. This formulation can be replaced byits “dual”, namely

maxP

{QDS(P) =CD(P)+CS(P)},

with analogous notation.Given that in this formulation we deal at the same time with distances and proxim-

ities, at least two of the previously mentioned levels of perception are involved. One

Systems Research Institute Polish Academy of Sciences,[email protected]

88

refers to the basic definitions of distances and proximitiesfor pairs of observations (ob-jects). Actually, one deals in this context with the bidirectional transformationd ↔ sbetween distance and proximity definitions. In quite a natural way, this transformationinvolves the establishment of respective scale (e.g., for standardised magnitudes ofdands, whens= 1−d and vice versa, meaning we operate within a unit figure), whetherdone explicitly or implicitly. Further, joint consideration of CS(P) andCD(P) (in the“primal” formulation) implies establishment of a certain scale at the level of groups ofobservations.

If so, experiment can be carried out on (a) the character of results of respective anal-ysis (cluster analysis, first of all) as a function of the scale (transformation) parameters,quite intuitively – from one cluster to the number of clusters equal the number of obser-vations (excluding identical ones); and (b) comparison of the results thus obtained withthose indicated either by the humans or by the usually applied statistical criteria.

The paper presents the rationale and the purposefulness of the exercise, and illustratesit with simple examples for the basic concrete formulationsof the bi-partial objectivefunction.

ReferencesOWSINSKI, J.W. (2011): The bi-partial approach in clustering and ordering: themodel and the algorithms.Statistica & Applicazioni, Special Issue, pp. 43-59.

KeywordsDISTANCE, PROXIMITY, SCALE, PERCEPTION LEVELS, BI-PARTIAL OBJEC-TIVE FUNCTION

89

Comparing Direct Estimators of the Mode

Andrzej Sokołowski1 and Kamil Fijorek2

Abstract

Since Karl Pearson paper in 1895 many estimators for the modewere proposed in sta-tistical literature. They can be grouped into two classes: indirect and direct. The firstone involves the estimation of density function and then finding its maximum. Thereare different types of direct estimators which work withoutprior estimation of density.In the paper several direct estimators are compared with simulation studies based onspecially designed generating models, both for univariateand multivariate distributions,with single and multiple modes.

ReferencesPEARSON, K. (1895): Contribution to the mathematical theory of evolution – II:Skew variation in homogeneous material.Philosophical Transactions of the RoyalSociety of London, A, 186, 343-414.SOKOŁOWSKI, A. (2013):Bezposrednie estymatory modalnej. Wydawnictwo Uni-wersytetu Ekonomicznego, Kraków.SAGER, T. (1978): Estimation of a Multivariate Mode.The Annals of Statistics,vol.6, No.4, 802-812.BICKEL, D.R. and FRÜWIRTH, R. (2006): On a fast, robust estimator of the mode:Comparison to other robust estimators with applications.Computational Statistics &Data Analysis, vol. 50, 12, 3500-3530.

KeywordsMODE, MODE ESTIMATION, SIMULATIONS

Cracow University of [email protected] ·Cracow University of [email protected]

90

k-NN Algorithm for Instantaneous Classification

Carmen Villar-Patiño1 and Carlos Cuevas-Covarrubias2

Abstract

k-NN (k-nearest neighbors) algorithms are standard methods of statistical classification.They are accurate and distribution free. In spite of these convenient features,k-NN im-plies a high computational cost. How to implementk-NN efficiently is an importantquestion in applied pattern recognition. We describe a new condensation method fork-NN and we explore its performance in instantaneous color identification problems.As in some other solutions reported in the literature, we represent the training data setin terms of a reduced collection of informative prototypes.This is similar to thek-NNmodel based approach; never the less, our method includes two parameters to be cali-brated in order to obtain a convenient exchange of precisionfor condensation; we callthis k-NN “controlled condensation”. We evaluate its performance with a real data setin a computer vision context. The results suggest that this proposal is accurate and effi-cient. It is a good alternative to implement efficient applications ofk-NN in challengingclassification problems.

References

GUO, G.; WANG, H.; BELL, D; BI, Y.; and GREERL, K. (2003):KNN model-based approach in classification.On The Move to Meaningful Internet Systems 2003:CoopIS, DOA, and ODBASE, 2888, 986-996.JIMENEZ, R. and CUEVAS, C. (2010): Curvas ROC y Vecinos Cercanos, Porpuestade un nuevo algortimo de Condensación,Revista de MatemÃatica:Teoria y Aplica-ciones,18, 21-32.MURTY, M. N., and DEVI, V. S. (2012):Pattern Recognition: An Algorithmic Ap-proach. Springer and Universities Press.

KeywordsSUPERVISED CLASSIFICATION,k-NN, CONDENSATION, COMPUTER VISION,COLOR CLASSIFICATION

Universidad Anáhuac, Estado de México, Mé[email protected] · e-mail:[email protected]

91

Flexible Multiclass Support Vector Machines: An Approachusing Iterative Majorization and Huber Hinge Errors

G.J.J. van den Burg1 and P.J.F. Groenen2

Abstract

A flexible multiclass support vector machine (SVM) is proposed which can be used forclassification problems where the number of classesK ≥ 2. Traditional extensions of thebinary SVM to multiclass problems such as the one-vs-all or one-vs-one approach suf-fer from unclassifiable regions. This problem is avoided in the proposed method by con-structing the class boundaries in aK −1 dimensional simplex space. Nonlinear classi-fication boundaries can be constructed by using either kernels or spline transformationsin the method. Similar to earlier work by Groenen et al. (2008), an Iterative Majoriza-tion algorithm is derived to minimize the constructed loss function. The performanceof the method is measured through comparisons with existingmulticlass classificationmethods on several datasets. From this we find that in most cases the performance ofthe proposed method is similar to that of existing techniques, but in some cases classifi-cation accuracy is higher.

ReferencesGROENEN, P.J.F., NALBANTOV, G. and BIOCH, J.C. (2008): SVM-Maj: A Ma-jorization Approach to Linear Support Vector Machines withDifferent Hinge Errors.Advances in Data Analysis and Classification, 2, 17–43.

KeywordsMULTICLASS SUPPORT VECTOR MACHINES, ITERATIVE MAJORIZATION,CLASSIFICATION

Econometric Institute, Erasmus University Rotterdam, P.O. Box 1738, 3000 DR [email protected] · Econometric Institute, Erasmus University Rotterdam,P.O. Box 1738, 3000 DR [email protected]

92

Power-Stress for Multidimensional Scaling

Patrick J.F. Groenen1 and Jan de Leeuw2

Abstract

Several loss functions exist for multidimensional scaling. Two important ones are basedon the sum of squared differences of distances and dissimilarities (Stress) and on differ-ences of squared distances and squared dissimilarities (S-Stress). The Power-Stress lossfunction incorporates these loss functions as it takes the sum of squared differences ofdistances and dissimilarities to some power larger than one, that is,

σpower(X) = ∑i< j

wi j (δ λi j −dλ

i j (X))2,

with X is the n× p configuration,wi j s are known nonnegative weights, theδi j s areknown dissimilarities, anddi j (X) is the Euclidean distance between rowsi and j ofX. Thus, we fit distances raised to some powerλ ≥ 1 to the dissimilarities raised to thesame power. Larger choices ofλ leads to emphasizing the fit of larger dissimilarities andconversely the smallerλ to spreads the emphasis over the dissimilarities. In this paper,we propose a new majorization algorithm to minimize the Power-Stress loss function.The core of this algorithm is the majorization of∑i< j wi j d2λ

i j (X) by a term of the form

tr[(X′X)λ ]. As with any majorizing algorithm, a monotonically nonincreasing series ofPower-Stress values is obtained that in almost all practical situations ends up in a localminimum. We show some of the main steps in the derivation of this algorithm andprovide some numerical comparisons.

KeywordsMULTIDIMENSIONAL SCALING, MAJORIZATION, STRESS, S-STRESS

Econometric Institute, Erasmus University, Rotterdam, P.O. Box 1738, 3000 DR Rotter-dam, The [email protected] · Department of Statistics, Universityof California, Los Angeles, CA 90095-1554, [email protected]

93

Variable Selection in Cluster Analysis Using ResamplingTechniques: a Proposal

Hans-Joachim Mucha1 and Hans-Georg Bartel2

Abstract

Variable selection is a well-known problem in many areas of multivariate statistics suchas classification and regression. The hope is that the structure of interest may be con-tained in only a small subset of variables. In contradictionto supervised classificationsuch as discriminant analysis, a quite difficult problem in cluster analysis is to do vari-able selection because there is nothing known about the trueclasses. In addition, vari-able selection in cluster analysis is highly related to the main difficult problem of deter-mining the number of clusters present in the data (Hennig, 2007). The latter is subjectof many investigations and papers considering resampling techniques as practical tools(Jain and Moreau, 1987). We propose a new and general approach to variable selectionusing non-parametric resampling techniques. General means it can be applied to anycluster analysis method. The starting point is an assessment of the evidence of univari-ate clusterings. Concretely, we are looking for the most stable univariate clustering (i.e.,the best variable) with respect to indexes such as the adjusted Rand. Here, additionally,one gets a rough idea about what the number of clustersK is. Subsequently we look foradditional variables as long as an improvement of the stability of clustering is realized.To be more precise, we are going to find the most stable bivariate (and furthermore mul-tivariate) clustering. We demonstrate the performance of our proposal on both syntheticand real data. Here, different resampling techniques such as nonparametric bootstrap-ping and subsampling are used (Mucha and Bartel, 2013).

ReferencesHENNIG, C. (2007): Cluster-wise assessment of cluster stability. ComputationalStatistics and Data Analysis 52: 258–271.JAIN A. K. and MOREAU, J. V. (1987): Bootstrap technique in cluster analysis.Pattern Recognition 20: 547–568.MUCHA, H.-J. and BARTEL H.-G. (2013): Soft Bootstrapping inCluster Anal-ysis and Its Comparison with Other Resampling Methods. In: M. Spiliopoulou,L. Schmidt-Thieme and R. Janning (Eds.):Data Analysis, Machine Learning andKnowledge Discovery. Springer, Berlin, forthcoming.

KeywordsCLUSTERING, VARIABLE SELECTION, RESAMPLING

Weierstrass Institute for Applied Analysis and Stochastics (WIAS), 10117 Berlin,Mohrenstraße 39, Germany,[email protected] · Department of Chem-istry at Humboldt University, Berlin, Brook-Taylor-Straße 2, 12489 Berlin, Germany,[email protected]

94

Adversarial Risk Analysis in Auctions

David Banks

Abstract

Adversarial risk analysis (ARA) is a decision-analytic approach to strategic games. Itbuilds a Bayesian model for the decision process of an opponent, with subjective dis-tributions over all unknown quantities. Then the analyst maximizes his expected utilitywith respect to the distribution over the action space induced by model for the opponentand the corresponding uncertainties. This talk applies theARA perspective to auctions,an important and well-studied class of strategic games. Under some assumptions, the re-sults align with Bayes Nash equilibrium solutions. But the approach also introduces aninteresting new class of auction problems, which are both realistic and mathematicallychallenging.

Duke University

95

Gaussian Process Classification And Duration Models ForCredit Risk

Silvia Figini1 and Aki Vehtari2

Abstract

Credit risk models are used to evaluate the insolvency risk caused by credits that enterinto default. Many models for credit risk have been developed over the past few decades.In this paper, we focus on those models that can be formulatedin terms of the probabilityof default by using semi-parametric and non parametric survival analysis models (seee.g. Figini and Fantazzini 2009).In order to write the default probability in terms of the conditional distribution functionof the time to default, in this paper we compare classical survival models with GaussianProcess (GP) which are a powerful tools for probabilistic modeling purposes. As pointedout in Vehtari et al. 2013, despite their attractive theoretical properties GPs providepractical challenges in their implementation.In this contribution we compare in terms of cross validation(see e.g. Vehtari and Ojanen2012) the results of the survival model with respect to GP. Anempirical study, based onreal data, illustrates the performance of each model.

ReferencesFIGINI, S. and FANTAZZINI, D. (2009): Random Survival Forest models for SMECredit Risk Measurement,Methodology and computing in applied probability, 11,29–45.VEHTARI A., VANHATALO, J., RIIHIMAKI J., HARTIKAINEN J., JY LANKIP. and TOLVANEN V. (2013): GPstuff: A Toolbox for Bayesian Modeling withGaussian Processes,Journal of Machine Learning Research Machine Learning OpenSource Software, in press.VEHTARI A., and OJANEN J. (2012): A survey of Bayesian predictive methods formodel assessment, selection and comparison,Statistics Surveys, 6:142-228.

KeywordsSURVIVAL ANALYSIS, GAUSSIAN PROCESS, CROSS VALIDATION, PROBA-BILITY OF DEFAULT, CREDIT RISK.

University of Pavia [email protected] · University of [email protected]

96

Model Averaging For Credit Risk Modelling

Silvia Figini1 and Marika Vezzoli2

Abstract

When many competing models are available for estimation, model averaging repre-sents an alternative to model selection. Despite model averaging approaches have beenpresent in statistics for many years, only recently they arestarting to receive attentionin applications especially in credit risk modelling (see e.g. Figini and Fantazzini 2009).In this paper we investigate model averaging and ensemble learning in order to ob-tain a well calibrated credit risk model in terms of predictive accuracy. We compareBayesian (see e.g. Steel, 2011 and the references therein) and classical model aver-aging approaches, like Random Forest (Breiman, 2001), Boosting (Freud and Schapire,1996), and CRAGGING (Vezzoli and Zuccolotto, 2011) with thefinal aim of improvingthe predictive performance of the models.

ReferencesBREIMAN, L. (2001): Random Forests,Machine Learning, 45, 5–32.FIGINI, S. and FANTAZZINI, D. (2009): Random Survival Forest models for SMECredit Risk Measurement,Methodology and computing in applied probability, 11,29–45.FREUND, Y. and SCHAPIRE, R.E. (1996): Experiments with a newboosting algo-rithm, Machine Learning: Proceedings of the Thirteenth International Conference,148–156. San Francisco: Morgan Kaufman.STEEL, M.F.J. (2011): Bayesian Model Averaging and Forecasting, Bulletin of E.U.and U.S. Inflation and Macroeconomic Analysis, 30–41.VEZZOLI, M. and ZUCCOLOTTO, P. (2011): CRAGGING measures ofvari-able importance for data with hierarchical structure, in S.Ingrassia, R. Rocci, M.Vichi (Eds.),New Perspectives in Statistical Modeling and Data Analysis, 393–400.Springer.

KeywordsMODEL AVERAGING, PREDICTIVE PERFORMANCE, CLASSIFICATION, EN-SEMBLE METHODS, WEAK LEARNER

University of Pavia [email protected] · University of [email protected]

97

Multiobjective Optimization Of Financing Household GoalsWith Multiple Investment Programs

Lukasz Feldman1, Radoslaw Pietrzyk2, and Pawel Rokita2

Abstract

In the article there is proposed a technique of facilitatinglife-long financial planningfor a household by finding the optimal match between unit-linked products and multi-ple financial goals of different realization terms and magnitudes. This is, moreover, amulticriteria optimization. One of the objectives is compliance between expected termstructure of cumulated net cash flow throughout the life cycle of the household with itslife-length risk aversion and bequest motive. The second isfinancial liquidity in all peri-ods under expected values of all stochastic factors. The third is minimization of net cashflow volatility. The fourth is minimization of costs of the investment plan combination.The result is a set of unit-linked investment programs with accompanying informationwhich programs are destined to cover which financial goal. Payoffs of one program maybe used to cover more than one goal and the order may be other than sequential.

ReferencesCAMPBELL, J.Y. (2006): Household finance.Journal of Finance, Vol. 61, No. 4,1553-1604.CARROLL C. (2006): The Method of Endogenous Gridpoints for Solving DynamicStochastic Optimization Problems.Economics Letters, Vol. 91, Issue 3, 312-U320.CORRIGAN J., MATTERSON W., NANDI S. (2009):A Holistic Framework for LifeCycle Financial Planning. Milliman.

KeywordsMULTIOBJECTIVE OPTIMIZATION, PERSONAL FINANCE, ASSET SELECTION,INTERTEMPORAL CHOICE

Wroclaw University of [email protected] · WroclawUniversity of [email protected] · Wroclaw Univer-sity of [email protected]

98

Power Of Skewness Tests In The Presence Of Fat TailedFinancial Distributions

Krzysztof Piontek

Abstract

The best known and mostly used test of skewness is the Jarque-Bera approach. However,this test is not reliable for discriminating between symmetric and asymmetric returndistributions in the presence of leptokurtosis that is usually observed in financial data.Testing skewness is still an open and significant issue.

The goal of this paper is to investigate the power of some skewness tests when ap-plied to fat-tailed (typical for finance) return distributions. Four approaches are brieflyreviewed and discussed in respect of testing skewness in thewhole return distribution:classical Jarque-Bera test, adjusted Jarque-Bera test (taking fater tails into considera-tion), test based on the Pearson type IV distribution and Peiro test without any assump-tion about the type of distribution.

In the empirical part, the power of each test is estimated by using Monte Carlo simu-lations. Different asymmetric and fat tailed distributions are used to data generation. Thefrequency of rejecting a null hypothesis (of symmetry of thedistribution, if it is false)is used as an approximate value of the power of test. Data series of different number ofobservations and different skewness values are simulated.

The last part summarizes results, compares values obtainedby using different testmethods and gives hints for risk managers.

ReferencesASAI, M. and DASHZEVEG, U. (2008): Distribution-Free Test for Symmetry withan Applic. to S&P Index Returns.Applied Economics Letters, 15(6),461–464.BERA, A., PREMARATNE, G. (2001): Adjusting the Tests for Skewness and Kur-tosis for Distributional Misspecifications, UIUC-CBA Research WP No. 01-0116.BRYS, G., HUBERT, M., STRUYF, A. (2003): A comparison of somenew measuresof skewness.Developments in Robust Statistics, 98–113.

KeywordsTESTS OF SYMMETRY, RETURN DISTRIBUTION, FAT TAILS

Department of Financial Investments and Risk Management, WrocławUniversity of Economics, ul. Komandorska 118/120, Wroclaw, [email protected]

99

Robust Clustering for Anti-Fraud Analysis

Andrea Cerioli1 and Domenico Perrotta2

Abstract

We address the problem of clustering the transactions that arise in international trademarkets, from the point of view of anti-fraud analysis. These observations typically fol-low a mixture of regression lines, corresponding to different market conditions. Outliersand high leverage points are also present and may provide information about anomalieslike fraudulent transactions. In order to eliminate the effect of outliers on the classifi-cation of “regular” trade, and in order to properly highlight them, robust methods areneeded (Riani et al. 2008; Garcìa-Escudero et al., 2009, 2010). However, robust cluster-ing techniques can fail when a large proportion of non-contaminated observations fallin a small region, which is another likely occurrence in international trade data sets. Insuch instances, the effect of a high-density region is so strong that it can override thebenefits of trimming and other robust devices. We propose to solve the problem by sam-pling a much smaller subset of observations which preservesthe cluster structure andretains the main outliers of the original data set. We show the advantages of our methodboth in empirical applications to international trade examples and through a simulationstudy.

ReferencesGARCÌA-ESCUDERO, L. A., GORDALIZA, A., SAN MARTÌN, R., VAN AELST,S. and ZAMAR, R. (2009): Robust linear clustering.Journal of the Royal StatisticalSociety B, 71, 301–319.GARCÌA-ESCUDERO, L. A., GORDALIZA, A., SAN MARTÌN, R. and MAYO-ISCAR, A. (2010): Robust clusterwise linear regression through trimming.Compu-tational Statistics and Data Analysis, 54, 3057–3069RIANI, M., CERIOLI A., ATKINSON A. C., PERROTTA, D. and TORTI, F. (2008):Fitting mixtures of regression lines with the forward search. In: Fogelman-Soulié Fet al. (Eds.):Mining Massive Data Sets for Security. IOS Press, Amsterdam, 271-286

KeywordsINTERNATIONAL TRADE, OUTLIERS, RLGA, TCLUST

Dipartimento di Economia, Università di Parma, [email protected] · European Commission, Joint Research Cen-tre, Ispra, [email protected]

100

An Extended Gravity Approach To Examining InternalMigrations. The Case Of Poland

Justyna Wilk1 and Michał Pietrzak2

Abstract

Internal migrations play a significant role in regional development. They determine asize and structure of human resources as well as stimulate regional labour markets etc.The subject of this paper is to formulate an approach with using of econometric grav-ity model and multivariate data analysis methods to determining dependencies betweensocio-economic aspects and migration phenomena. An attempt to apply it for the anal-ysis of internal migrations in Poland in 2004-2011 is also made.

Gravity model considers migration flows from origin to destination and explains theirconditions. Economic, household and labour market situation, innovativeness and livingconditions are examined in this paper. These potential pushand pull factors of popula-tion flows - complex in their nature - are defined with using of taxonomical syntheticmeasures. A significance, intensity and impact direction ofsocio-economic aspects andalso geographical distance on population inflows and outflows are examined. Two pe-riods of time are distinguished to identify relationships between economic cycle andintensity and conditions of domestic migrations.

ReferencesHWANG, C.L. and YOON, K. (1981):Multiple Attribute Decision Making Methodsand Applications. Springer, Berlin Heidelberg.LEE, E.S. (1966): A Theory of Migration.Demography, Vol. 3, No. 1, 47–57.LeSAGE, J.P. and PACE, R.K. (2008): Spatial Economic Modeling of Origin-Destination Flows.Journal of Regional Science, Vol. 48(5), 941–967.WHITE, M.J. and LINDSTROM, D.P. (2006): Internal Migration. In: D.L. Poston,M. Micklin (Eds.):Handbook of Population. Springer, Berlin-Heilderberg, 311–345.

KeywordsINTERNAL MIGRATION, REGIONAL DEVELOPMENT, GRAVITY MODEL,SYN-THETIC MEASURE

Wrocław University of Economics, Department of Econometrics and Computer Sci-ence, Nowowiejska 3, 58-500 Jelenia Góra, Poland,[email protected] ·Nicolaus Copernicus University in Torun, Department of Econometrics and Statistics,Gagarina 11, 87-100 Torun, Poland,[email protected]

101

Clustering of US counties based on their demographicstructures

Simona Korenjak-Cerne1, Vladimir Batagelj2, Nataša Kejžar3

Abstract

Population pyramid is a very informative graphical representation of a demographicstructure of a particular region. In the paper we will present the use of symbolic hier-archical clustering method, implemented in R-package clamix, in the study of demo-graphic structure of U.S. counties. The presented approachoffers an additional insightin the data and, as such is important especially for experts –demographers. The analysiswill be presented on a data of the latest US census from 2010, where also time changesbetween demographic structures from 2000 to 2010 will be observed. Another analy-sis considering also the distributions by ethnicity will bedone and compared with theresults of age-sex only analysis.

ReferencesBATAGELJ, V. (1988): Generalized Ward and Related Clustering Problems. In: H.H.Bock (Ed.):Classification and Related Methods of Data Analysis, North-Holland,Amsterdam, 67–74.BATAGELJ, V. and KEJŽAR, N. (2010): clamix - Clustering Symbolic Objects. Pro-gram in R, Available from: https://r-forge.r-project.org/projects/clamix/BILLARD, L. and DIDAY, E. (2006):Symbolic Data Analysis. Conceptual statisticsand data mining.Wiley, New York.KORENJAK-CERNE, S. and BATAGELJ, V. (2002): Symbolic Clustering of LargeDatasets. In: K. Jajuga, A. Sokołowski and H.H. Bock (Eds.):Classification, Clus-tering, and Data Analysis. Springer, Berlin, 319–327.U.S. Census Bureau, Census 2000 and Census 2010http://www.census.gov/population/age/data/decennial.html

KeywordsSYMBOLIC DATA ANALYSIS, HIERARCHICAL CLUSTERING, POPULATIONPYRAMID, SYMBOLIC OBJECT, R-PACKAGE CLAMIX

University of Ljubljana, Faculty of Economics, [email protected] · University of Ljubljana, Faculty of Mathematicsand Physics, [email protected] · University ofLjubljana, Faculty of Medicine, [email protected]

102

Strategic, Motivational And Emotional Aspects Of UniversityStudy. A Latent Class Approach

Anna Giraldo1, Silvia Meggiolaro2, and Elisa Visentin3

Abstract

University outcomes are strictly related to students’ attitudes, motivations and emotionstowards university study. These factors, as well as personal and households character-istics of the students, deeply influence study path. In this work we use a latent classapproach (McMoutcheon, 1987) to find the underlying latent factors that summarize aseries of items investigating students’ position as regards four domains: strategic skills,emotions, motivations, and resilience. Data come from a CAWI survey conducted in2012 on a cohort of students enrolled in academic year 2006/07 at Padova University(Clerici et al., 2012). Results show that the underlying latent factors are in line with psy-chological literature and they can be used in regression models as explicative variables,along with personal and households’ characteristics of thestudents, to explain more indepth students’ university outcomes.

ReferencesCLERICI, R., DA RE, L., GIRALDO, A., MEGA, C., VISENTIN E. (2012) As-petti strategici, motivazionali ed emotivi e successo accademico. Progettazione econduzione di un’indagine sugli studenti dell’Universitàdi Padova,Technical ReportSeries, 1, Department of Statistical Sciences, Universityof Padova.McCOUTCHEON, A.L. (1987)Latent Class Analysis. Sage, Newbury Park.

KeywordsLATENT CLASS FACTOR ANALYSIS, UNIVERSITY STUDY

Department of Statistical Science, via C. Battisti 241, [email protected] · Department of Statistical Science, via C. Bat-tisti 241, [email protected] · Department of Philosophy,Sociology, Education and Applied Psychology, Via Beato Pellegrino 8, [email protected]

103

The Comparative Log–Linear Analysis Of Unemployment InPoland In 2004–2011

Justyna Brzezinska

Abstract

In categorical data analysis we can analyze categorical variables simultaneously inmulti–way tables. Such tables present special problem of analysis and interpretation,which is usually connected with the number of variables. This paper presents the useof log–linear models which allow to analyze the independence and the path of associa-tion between any number of categorical variables. Different types of independence canbe analyzed: conditional independence, homogeneous association or conditional inde-pendence. There are several criteria for testing the goodness–of–fit of the model: thechi-square statistic, the likelihood ratio, information criteria (AIC, BIC).

With the rising unemployment rate in recent years, unemployment is one of the mostimportant economic and social problem in Poland. A strong differentiation is observedin the unemployment rates for various regions of Poland, especially for young and uni-versity graduates, as well as for males as females. The log–linear analysis will be pre-sented on the example from the Central Statistical Office of Poland. The comparativelog–linear analysis will be conducted for multi–way tableson unemployment in 2004–2011. All calculations will be conducted inR with the use ofloglm function inMASSlibrary.

ReferencesCHRISTENSEN, R.(1997): Log-linear Models and Logistic Regression. Springer–Verlag, New York.KNOKE D., BURKE P.(1997): Log-linear Models. Sage University Paper Series onQuantitative Applications in the Social Science, series no. 07-020, Beverly Hills andLondon Sage.

KeywordsLOG–LINEAR MODELS, MULTI-WAY CONTINGENCY TABLES, COMPARA-TIVE LOG–LINEAR ANALYSIS, UNEMPLOYMENT IN POLAND

Faculty of Management, University of Economics in Katowice, 1 Maja 50, 40–287 Ka-towice, [email protected]

104

Measurement of Quality in Cluster Analysis

Christian Hennig1

Abstract

There is much work on benchmarking is supervised classification, where “quality” cangenerally be measured as a function of misclassification probabilities. In unsupervisedclassification (cluster analysis), the measurement of quality is much more problematic,because in reality there is no true class label which can be used for cross-validationand the like. Furthermore, there is no guarantee that in situations where there is a trueclassification (for example, where benchmark data sets fromsupervised classificationare used to assess clustering methods, or where data is simulated from a mixture dis-tribution), this classification is unique. There can be a number of different reasonableclusterings of the same data, depending on the research aim.

I will discuss the use of statistics for the assessment of clustering quality that canbe computed from classified data without making reference to“the true clusters”. Suchstatistics have traditionally been called “cluster validation indexes” (such as the aver-age silhouette width), and sometimes been used for estimating the number of clusters.Most of the traditional statistics try to balance various aspects of a clustering againsteach other (such as within-cluster homogeneity and between-cluster separation), but inorder to characterize what advantages and disadvantages a clustering has, it is useful toformalize different aspects of cluster quality separately. This can also be used to explainmisclassification rates in cases where “true” clusterings exist as function of the featuresof these clusterings.

KeywordsBENCHMARKING, CLUSTER VALIDITY, MISCLASSIFICATION RATE,HOMO-GENEITY, SEPARATION, STABILITY

Department of Statistical Science, University College London, [email protected]

105

Resampling Methods for Exploring Cluster Stability

Friedrich Leisch

Abstract

Model diagnostics for cluster analysis is still a developing field because of its ex-ploratory nature. Numerous indices have been proposed in the literature to evaluategoodness-of-fit, but no clear winner that works in all situations has been found yet.Derivation of (asymptotic) distribution properties is notpossible in most cases. Over thelast decade several resampling schemes which cluster repeatedly on bootstrap samplesor random splits of the data and compare the resulting partitions have been proposed inthe literature. These resampling schemes provide an elegant framework to computation-ally derive the distribution of interesting quantities describing the quality of a partition.Due to the increasing availability of parallel processing even on standard laptops anddesktops these simulation-based approaches can now be usedin everyday cluster anal-ysis applications. We give an overview over existing methods, show how they can berepresented in a unifying framework including an implementation in R package flex-clust, and compare them on simulated and real-world data. Special emphasis will begiven to stability of a partition, i.e., given a new sample from the same population, howlikely is it to obtain a similar clustering?

KeywordsCLUSTER ANALYSIS, RESAMPLING METHODS, BOOTSTRAP, R

Institute for Applied Statistics and Computing, University of Natural Re-sources and Life Sciences, Vienna, Peter-Jordan-Strasse 82, 1190 Vienna, [email protected]

106

The Effect Of Data Generation On Our Understanding OfClustering Algorithms

Doug Steinley1

Abstract

Often, benchmarking in clustering and classification is conducted by comparing andcontrasting various algorithms and procedures on data setswith known structure viasimulation. These comparisons take place at both a broad level (e.g., a full experimentaldesign) and a narrow level (e.g., a couple of generated examples). Regardless of the ap-proach, it is found that the evaluation of the performance ofmethods is closely linked tothe nature of the generation. Results are provided that quantify the “robustness” of var-ious performance critiques based on how stable the assessment of clustering algorithmsacross generation schemes.

KeywordsCLUSTERING ALGORITHMS, BENCHMARKING

University of Missouri, Columbia

107

CLustering Constrained Symbolic Objects Constrained ByRules

Marc Csernel1

Abstract

To obtain a standardized layout when printing the abstract volume, Clustering is one ofthe most common operation in data analysis while constrained is not so common. Wepresent here a clustering method in the framework of Symbolic Data Analysis (S.D.A)which allows to cluster Symbolic Data. Such data can be constrained relations betweenthe variables, expressed by rules which express the domain knowledge. But such rulescan induce a combinatorial increase of the computation timeaccording to their number.We will present a way to cluster such data in a quadratic time.This method is based firston the decomposition of the data according to the rules called Normal Symbolic Form,then we apply to the data a clustering algorithm based on dissimilarities.

ReferencesBOCK, H.-H. and E. DIDAY (2000).Analysis of Symbolic Data: Explanatory Meth-ods for Extracting Statistical Information from Complex Data. Heidelberg: Springer.CSERNEL, M. and F. A. T. de CARVALHO (1999). Usual operationswith sym-bolic data under normal symbolic form.Applied Stochastic Models in Business andIndustry 15(4), 241–257.GORDON, A. D. (1999).Classification. Boca Raton, Florida: Chapman andHall/CRC.LECHEVALIER, Y. (1974). Optimisation de quelques criteres en classificationautomatique et application a l’etude des modifications des proteines seriques enpathologie clinique.Ph. D. thesis, Universite Paris-VI.

KeywordsCLUSTERING, RULES, SYMBOLIC OBJECTS, NORMAL FORM

Inria-Rocqencourt, BP-105-78180 Le Chesnay, [email protected]

108

Conceptual Clustering with Interval Representation

Paula Brito1 and Géraldine Polaillon2

Abstract

In this work, we propose a hierarchical conceptual clustering method, where eachformed cluster corresponds to a concept, i.e., a pair (extent, intent), based on the princi-ples of the methods in (Brito, 1995). The method allows considering simultanously datapresenting real or interval-valued numerical values, categorical ordered values and/orprobability/frequency distributions on a set of categories. Concepts are obtained by aGalois connection with generalisation by intervals, whichallows dealing with differentvariable types on a common framework (see Brito and Polaillon, 2011). In the case ofdistributional data, the obtained concepts are more homogeneous and more easily inter-pretable than those obtained by using the maximum and minimum operators previouslyproposed (Brito and Polaillon, 2005). A measure of generality of a concept is definedsimilarly for all these variable types, which is a weighted mean of variable-wise values.An example illustrates the proposed method.

ReferencesBRITO, P. (1995). Symbolic Objects : Order Structure and Pyramidal Clustering.Annals of Operations Research, 55, 277–297.BRITO, P., and POLAILLON, G. (2005). Structuring Probabilistic Data by GaloisLattices.Mathématiques et Sciences Humaines - Mathematics and Social Sciences,(43ème année) 169, (1), 77–104.BRITO, P. and POLAILLON, G. (2011). Homogeneity and Stability in ConceptualAnalysis. In: A. Napoli and V. Vychodil. (Eds.):Proc. of the 8th International Con-ference on Concept Lattices and Their Applications. INRIA, Nancy, France, 251–263.

KeywordsCONCEPTUAL CLUSTERING, INTERVAL DATA, DISTRIBUTIONAL DATA, SYM-BOLIC DATA

Faculdade de Economia & LIAAD-INESC Porto LA, Universidadedo Porto, Portu-gal [email protected] · SUPELEC Science des Systèmes (E3S) - DépartementInformatique, [email protected]

109

Hierarchical Symbolic Cluster Analysis with QuantileFunction Representation

Yusuke Matsui1, Hiroyuki Minami2, and Masahiro Mizuta2

Abstract

In symbolic data analysis, we can use various types of variables, e.g., interval val-ued variable, categorical multi valued variable, distribution valued variable. Britoetal. (2010) offered that quantile (or quartile) was a powerful tool for those variables.

In this paper, we focus on its extensive representation,quantile functionand proposehierarchical symbolic cluster analysis.

We assume each object is represented by ap dimensional distribution. We derivedis-tribution valued dissimilaritybetween distributions. We exploit quantile function fordistribution valued dissimilarity and develop a clustering method with the function,based on Mizuta (2011). We also demonstrate it with some example data.

ReferencesBRITO, P. and ICHINO, M. (2010): Symbolic clustering based on quantile represen-tation.Proceedings of COMPSTAT2010, Paris, France.MIZUTA, M. (2011): Hierarchical clustering for distribution valued dissimilaritydata.Proceedings of Joint Conference of GfKl, DAGM and IFCS.

KeywordsSYMBOLIC DATA ANALYSIS, DATA MINING, DISTRIBUTION VALUED D IS-SIMILARITY

Graduate School of Information Science and Technology, Hokkaido University, [email protected] · Information Initiative Center, Hokkaido Univer-sity, [email protected], [email protected]

110

Multilevel Consumer Preference Model on Symbolic Data

Adam Sagan1, Marcin Pełka2, and Aneta Rybicka2

Abstract

Multilevel data arises from a hierarchical and contextual data structure that comes fromheterogenuous populations and the complex sampling. This type of data is very popularin international, educational as well as marketing research (pupils nested in schools,individuals nested in households etc.). The multilevel modeling involves usually theclassical types of data and are based on the decomposition ofcovariance matrix into a“within” and “between” submatrices.

The main aim of the paper is to propose a new way of analyzing structured data(multilevel-like data) and use symbolic data in multilevelmodeling of the family mem-bers’ consumer preferences. Symbolic data analysis allowsto represent and model twotype of objects – single individuals (first-level objects),and aggregate objects (super-individuals, second-level objects). This allows to analyze not only dependencies, clus-ters, etc. at individual level of data but it allows also to analyze the dependencies ataggregate level. Moreover, symbolic data analysis allows to represent data in more de-tailed way and to keep all the information from individual level at aggregate levels.

ReferencesBOCK, H.-H., DIDAY, E. (Eds.) (2000):Analysis of symbolic data. Explanatorymethods for extracting statistical information from complex data. Springer Verlag,Berlin-Heidelberg.NAKANO, J., (2012): Regression Analysis for Aggregated Symbolic Data. In: J. Ar-royo, C. Maté, P. Brito and M. Noirhomme-Fraiture (Eds.)3 rd Workshop in SymbolicData Analysis, 33STEENBERGEN M.R., JONES B.S. (2002): Modeling Multilevel Data Structures.American Journal of Political Science, Vol. 46, No. 1, pp. 218–237.

KeywordsPREFERENCES, SYMBOLIC DATA, MULTILEVEL DATA ANALYSIS

Cracow University of Economics, Department of Market Analysis and Mar-keting Research, Cracow, Poland,[email protected] · Wroclaw Uni-versity of Economics, Department of Econometrics and Computer Science,Nowowiejska 3, 58-500 Jelenia Góra, Poland,[email protected],[email protected]

111

The Variance of the Adjusted Rand Index (and otherproperties)

Doug Steinley1

Abstract

The variance of the adjusted Rand index (Hubert & Arabie, 1985) is provided and itsproperties are explored. The variance is then used to highlight the differences betweentwo formulations of the expected value of the Rand index (Hubert & Arabie, 1985;Morey & Agresti, 1984), showing that the latter is asymptotically under-biased and itsassociated variance is consistently underestimated.

KeywordsCLUSTERING ALGORITHMS, BENCHMARKING

University of Missouri, Columbia

112

Identifying Clusters Bayesian Disease Mapping

Nema Dean1, Craig Anderson1, and Duncan Lee1

Abstract

In spatial modelling, it is often the case that, instead of individual point data, only ag-gregate data is available for each of a set of sub-areas for a given period. In diseasemodelling, the most common type of data available is a count of disease cases for par-ticular subdivisions of the area of interest for a year. Thisresults in population levelcount data rather than individual level binary outcomes. This type of data is known asareal data. In addition to the counts for each sub-area, neighbourhood information aboutwhich areas border each other is also available. One common assumption about arealdata is that there is a global level of correlation across bordering areas and that the dis-ease risk surface varies smoothly. In practice this is oftennot the case, with rich areasneighbouring poor areas with drastically different disease risks. This talk will discussan adaptation of hierarchical clustering to enforce spatial contiguity when clustering logstandardised incidence ratios (the ratio of observed to expected counts) in areal data. Thecandidate clusterings produced by the adapted hierarchical clustering will be modelledwith a piecewise constant (across clusters) conditional autoregressive (CAR) hierarchi-cal poisson log-linear model. The best clustering model is selected using the DevianceInformation Criterion. Results of the proposed approach onsimulations and real datawill be presented and discussed.

KeywordsAREAL DATA, CLUSTERING, SPATIAL MODELLING

University of Glasgow, 15 University Gardens, Glasgow G12 8QQ,[email protected]

113

Classification Boundary Mapping

Yuning He1 and Herbert Lee2

Abstract

In some problems, such as a computer simulation experiment,it may be of interest tomap the boundary between two classes. Having a physical understanding of the clas-sification boundary can lead to insights about the underlying problem. Our motivatingexample of a flight controller simulator leads us to the use ofa shape library for param-eterizing the boundary, which lets us better understand when the controller will be ableto stabilize an aircraft and when it could lead to catastrophic failure.

Modeling of classification is done via tree models. Taking a sequential design ap-proach, the tree models can be updated via particle learning. The shape library is usedto best model the classification boundary, with a shape set chosen that provides the bestsummarization, completeness, and minimality.

KeywordsCOMPUTER MODEL, CLASSIFICATION TREE, ACTIVE LEARNING, MODELSELECTION

National Aeronautics and Space Administration, Ames Research Center, Moffett Field,CA, USA [email protected] · University of California, Santa Cruz, CA, [email protected]

114

Deduplicating Text Records by Clustering the Results ofAggregated Conditional Classifiers

Rebecca Nugent1 and Samuel L. Ventura2

Abstract

Deduplication, or the process of linking records corresponding to unique entities withina single database, is an atypical record linkage problem. Traditional record linkagemethods (e.g. Fellegi and Sunter, 1969) assume a one-to-onematching across twodatabases and thus cannot be trivially applied to deduplication, where each unique en-tity may be duplicated any number of times. Recent alternatives extend the Fellegi-Sunter approach to work with three or more databases, but these approaches are notcomputationally feasible for the deduplication of large databases. We explore the use ofclustering approaches to identify (typically singleton orvery small) clusters of recordscorresponding to unique entities. We calculate pairwise distances between records usinga novel classification technique that conditions on informative features of record-pairs.We apply our methodology to the identification of unique inventors in the United StatesPatent and Trademark Office patent-inventor database and demonstrate its efficacy overalternative, more heuristic approaches.

ReferencesVENTURA, NUGENT, FUCHS: Methods Matter: Rethinking Inventor Disambigua-tion Algorithms with Classification Models and Labeled Records .submitted to Man-agement Science, March 2013.

KeywordsRECORD LINKAGE, DEDUPLICATION, AGGREGATION, RANDOM FORESTS,CLUSTERING

Department of Statistics, Carnegie Mellon University, Pittsburgh, [email protected] · Department of Statistics, Carnegie Mellon University,Pittsburgh, [email protected]

115

Classifications of Baseball Pitching Strategies and ExploringEffects of the New Official Balls in the Japanese ProfessionalBaseball League

Kazunori Yamaguchi1

Abstract

The baseball is one of the favorite sports in US and Japan. Pitching is one of the mostimportant parts of this game. Many researches have been donefor baseball statistics forpitching or team offence strategies using MLB data in US (e.g. see Tangoet al. 2007,Albert and Bennet 2003, Thorn and Palmer 1985).

The NPB (Nippn Professional Baseball) organization has changed the official balls in2011. They said that new balls were similar to balls used in the World Baseball Classicand that they were lower resilient than balls used before 2011.

We recognized pitchers have advantages after deriving new balls, but some of pitchersresults in 2011 were not better than those in 2010. We classify all pitchers by pitchingdata in 2010. Here we use the numbers of games, pitches, an average speed of fast balls.maximum speed, variety of pitches, courses of pitches, and so on as the pitching data.After classifications, we explore groups that pitchers got much better results in 2011than in 2010, or groups that pitchers got worse results in 2011 than in 2010, in order toexplore the effects of new balls on pitching strategies.

All data sets for this research are provided by Data Stadium Inc.

ReferencesALBERT, J. and BENNETT, J.(2003)Curve Ball: Baseball, Statistics, and the Roleof Chance in the GameSpringer.TANGO, T., LICHTMAN, M. and DOLPHIN, A.(2007):The Book: Playing the Per-centages in Baseball. Potomac Books.THORN, J. and PALMER, P.(1985):The Hidden Game of BaseballDoubleday, NewYork.

KeywordsBASEBALL, CLUSTER ANALYSIS, PITCHING STRATEGIES

College of Business, Rikkyo University, Tokyo 171-8501 [email protected]

116

Life Long Learning Idea on Background of Poles’ Needs

Marta Dziechciarz-Duda1 and Klaudia Przybysz2

Abstract

The rate of economic changes and the aging of the population made it necessary to givethe importance of lifelong learning a priority (see for example the Lisbon Strategy). Theconducted study concerned demand on the courses, training and certifications in Poland.

This article aims to analyze the educational needs reportedby respondents who arein production age. Our study contains the classification based on sex, education level,type of occupation in relation to whether they declare such needs. The research also in-cluded the type of courses undertaken and their assessment of the usefulness of furtherprofessional life. The proposed approach to this issue may be a substructure of a multi-dimensional analysis of the situation on the labor market and help to identify the factorsdetermining the attractiveness of potential employees in point of view of employers’needs.

ReferencesGATNAR, E. and WALESIAK, M. (2004):Metody Statystycznej Analizy Wielowymi-arowej w Badaniach Marketingowych. Wydawnictwo Akademii Ekonomicznej im.Oskara Langego we Wrocławiu, Wrocław.GROSSMAN M. (2005): Education and Nonmarket Outcomes.NBER Working Pa-per Series, Working Paper 11582. http://www.nber.org/papers/w11582.PSACHAROPOULOS, G. and PATRINOS, H. (2004): Returns to Investment inHigher Education. A Further Update.Education Economics, 12(2), 111–134.

KeywordsLIFE LONG LEARNING, MULTIDIMENSIONAL ANALYSIS, LABOR MARK ET

Wrocław University of [email protected] ·Wrocław University of [email protected]

117

Migration Of Population - The Analysis With The Use OfLog-Linear Models

Justyna Brzezinska

Abstract

Log-linear analysis is a widely used tool for the independence analysis of qualitativedata in multi–way contingency table. Cell counts are Poisson distributed and all vari-ables are treated as response. Log-linear models, where interaction terms are included,enable to examine various types of association (conditional independence, partial as-sociation, complete independence, homogenous association). In log-linear analysis wemodel cell counts in terms of associations among variables and marginal frequencies.For testing the goodness of fit the likelihood ratio test and information criteria AIC[Akaike 1973] and BIC [Raftery 1986] are used. The advantages of this method is thatwe can use several plots for visualizing contingency table,we can analyze any numberof categorical variables and we include interactions in themodel equation. The use oflog-linear analysis will be presented on the data on migration of population in Polandin 2011 reported by the Central Statistical Office. All calculations will be conducted inR with the use ofloglm function inMASS library.

ReferencesCHRISTENSEN, R.(1997): Log-linear Models and Logistic Regression. Springer–Verlag, New York.KNOKE D., BURKE P.(1997): Log-linear Models. Sage University Paper Series onQuantitative Applications in the Social Science, series no. 07-020, Beverly Hills andLondon Sage.

KeywordsLOG–LINEAR MODELS, MULTI-WAY CONTINGENCY TABLES, MIGRATIONOF POPULATION IN POLAND

Faculty of Management, University of Economics in Katowice, 1 Maja 50, 40–287 Ka-towice, [email protected]

118

The Influence of Emotion Recognition and AcademicPerformance on Group Popularity

Ivan Loredana

Abstract

This study analyzed the influence of academic grades and emotion recognition on theway social relations are structured within a relatively large group of college students (N= 154). Using DANVA-2 to assess individual differences in emotion accuracy and peers’nominations procedures, we investigated the relative contribution of positive and neg-ative emotions to three popularity dimensions: visibility, social interaction, and friendnominations. Compared with studies on children and adolescents, grades had a marginaleffect on popularity only when friendship ties are factoredin/considered. Furthermore,the accuracy in decoding facial expression of emotions was negatively correlated withthe number of friendship nominations, particularly for sadness. In the case of happinesswe found a positive relation between the accuracy of decoding using body items andstudents’ level of interaction. The results are discussed in the light of the functionalistemotion theories.

ReferencesBAUMEISTER, R.F. AND LEARY, M.R. (1995): The need to belong:Desire for in-terpersonal attachments as a fundamental human motivationPsychological Bulletin,117(3), 497-529.BOYATZIS, C. J. AND SATYAPRASAD, C. (1994): Children’s facial and gestu-ral decoding and encoding: Relations between skills and with popularity.Journal ofNonverbal Behavior, 18(1), 37-58.DE BRUYN AND VAN DEN BOOM (2005): Interpersonal behavior, peer popular-ity, and self-esteem in the early adolescence.Social Development, 14 (4), 555-573.

KeywordsEMOTION DECODING, GROUP POPULARITY; ACADEMIC PERFORMANCE

National School of Political and Administrative Studies, Povernei 6, Bucharest [email protected]

119

Hierarchical Classes Analysis vs. Formal Concept Analysis

Bernhard Ganter and Cynthia V. Glodeanu

Abstract

Hierarchical Classes Analysis(HCA) is a discrete, categorical data analysis methoddeveloped for applications in personality organisation and implicit belief systems. Thetechnique as well as its generalisations for three-way, numerical and ordinal data weresuccessfully applied in different clinical studies.

Formal Concept Analysis(FCA) is an instrument for data analysis based on latticetheory. Amongst other things FCA represents the whole information contained in a dataset by means of so-called formal concepts. These are understood as units with a con-ceptual extent and a conceptual intent. The extent containsall the objects shared by theattributes from its intent. The dual holds for the intent. Recently, an approach,Booleanfactorisations, using formal concepts was discussed that produces the smallest possiblenumber of factors in a sense similar to Factor Analysis.

We show that HCA and Boolean factorisations coincide for binary and three-waydata. Moreover, we discuss how this connection allows the two methods to benefit fromeach other. New doors for the application of Boolean factorisations are opened by HCA.The latter gains structural explanations, graphical representations and algorithmic is-sues. Further, we propose the modelling of fuzzy, i.e., vague data, within the frameworkof HCA.

References

J. SCHEPERS AND I. Van Mechelen (2010): Uniqueness of real-valued hierarchicalclasses models.Journal of Mathematical Psychology, 54, 215–221.B. GANTER AND R. WILLE (1996): Formale Begriffsanalyse: MathematischeGrundlagen. Springer, Berlin, Heidelberg.R. Belohlávek AND V. VYCHODIL (2010): Discovery of optimal factors in binarydata via a novel method of matrix decomposition.Journal of Computer and SystemSciences, 76, 3–20.

KeywordsFACTOR ANALYSIS, NON-METRIC ANALYSIS, DATA REDUCTION

Institute of Algebra, TU Dresden, 01062 Dresden, Germany{Bernhard.Ganter,Cynthia.Glodeanu}@tu-dresden.de

120

The Diversity of Pattern Structures in Formal ConceptAnalysis

Aleksey Buzmakov1, Sergei O. Kuznetsov2, and Amedeo Napoli3

Abstract

Pattern structures [3] provide an extension of Formal Concept Analysis (FCA [1]) fordealing with complex data. They are based on a triple(G,(D,⊓),δ ), whereG is a set ofobjects,(D,⊓) is a semi-lattice of descriptions, andδ is a mapping associating an objectwith a description. The similarity operation⊓ induces a subsumption relation in(D,⊓)such asc⊓d = c iff c⊑ d.

In this presentation, we would like to discuss the diversityand the capabilities of pat-tern structures in various applications. Pattern structures are used under many forms,e.g. numbers and intervals [3], graphs [3], strings and sequences [1], and ontology el-ements [2]. Moreover, the so-called projections are mathematical functions respectingsome properties and reducing the computational costs and the volume of resulting pat-terns. Accordingly, the pattern concept lattice can the be navigated and more easilyinterpreted by domain experts.

References1. A. BUZMAKOV, E. EGHO, S.O. KUZNETSOV, A. NAPOLI, AND C. RAÏSSI.

String Pattern Structures in FCA – An Application to Sequential Data Analysis, 2013.(Submitted.).

2. A. COULET, F. DOMENACH, M. KAYTOUE, AND A. NAPOLI. Using patternstructures for analyzing ontology-based annotations. InProceedings of ICFCA 2013,Springer LNCS, 2013.

3. B. GANTER AND S.O. KUZNETSOV. Pattern structures and their projections. InProceedings of ICCS, LNCS 2120, pages 129–142, 2001.

4. B. GANTER AND R. WILLE. Formal Concept Analysis. Springer, 1999.5. M. KAYTOUE, S.O. KUZNETSOV, AND A. NAPOLI. Revisiting Numerical Pat-

tern Mining with Formal Concept Analysis. InProceedings of IJCAI, pages 1342–1347, 2011.

KeywordsFORMAL CONCEPT ANALYSIS, PATTERN STRUCTURES, PROJECTION,CLAS-SICATION

LORIA (CNRS – Inria Nancy – U. de Lorraine)[email protected] ·HSE [email protected] · LORIA (CNRS – Inria Nancy – U. deLorraine)[email protected]

121

Decision Aiding Software And Consensus Theory

Florent Domenach1 and Ali Tayari

Abstract

There is variety of approaches, solutions, and methods on how to construct and derive aconsensus from a selection of phylogenetic trees, whether you consider the case wheretrees share the same set of taxa or when you have super-trees methods. Despite thenumber of existing consensus functions, practitioners often use a selected few - eitherbecause they are not aware of other existing functions, or not knowing which one(s)would be suitable. In order to tackle this problem, DASACT (Decision Aiding Softwarefor Axiomatic Consensus Theory) has been developed in orderto guide users in hischoice depending on a series of axiomatic properties.

DASACT is based on a previously written paper (Domenach and Tayari 2013) whichuses an exhaustive approach in order to examine the structural relationship (a conceptlattice) among a series of axiomatic properties and consensus functions. This lattice isused to determine relevance (determined using variety of distance functions) of consen-sus functions in respect to desired constraints (axiomaticproperties) set by the user. Itthen provides the consensus trees for users to compare and choose most appropriate.

ReferencesDAY, W.H.E. and MCMORRIS, F.R. (2003):Axiomatic Consensus Theory in GroupChoice and Biomathematics. Siam, Philadelphia.DOMENACH, F., and TAYARI, A., (2013): Implications of Axiomatic ConsensusProperties. In: Lausen, B., van den Poel, D., and A. Ultsch (Eds.):Algorithms from& for Nature and Life, Studies in Classification, Data Analysis, and Knowledge Or-ganization, Springer-Verlag GmbH, Heidelberg (to appear).GANTER, B. and WILLE, R. (1999):Formal Concept Analysis : MathematicalFoundations. Springer.

KeywordsCONSENSUS GENERATION, CONSENSUS THEORY, CONCEPT LATTICE,PHY-LOGENETIC TREE

Computer Science Department, University of Nicosia, 46 Makedonitissas Ave., PO Box24005, 1700 Nicosia, [email protected]

122

Experimental Comparison of Some Triclustering Algorithms

Dmitry V. Gnatyshak, Dmitry I. Ignatov, and Sergei O. Kuznetsov

Abstract

In this talk we show the results of the experimental comparison of five triclustering algo-rithms on real-world and synthetic data by resource efficiency and 4 quality measures.We also provide the results’ interpretation for analyses ofreal-world datasets.

The talk is organised as follows. In part 1 we give main definitions and describe thetriclustering methods selected for comparison. Part 2 describes all the experiments andtheir results along with specially introduced quality measures. Part 3 concludes the talkand indicates some further research direction.

ReferencesIGNATOV, D.I., KUZNETSOV S.O., MAGIZOV R.A., and ZHUKOV, L.E. (2011):From Triconcepts to Triclusters. In:13-th International Conference on Rough Sets,Fuzzy Sets, Data Mining and Granular Computing (RSFDGrC-2011). 257–264.JASCHKE R., HOTHO A., SCHMITZ C., GANTER B., and STUMME G. (2006):TRIAS - An Algorithm for Mining Iceberg Tri-Lattices. In:ICDM. 907–911.MIRKIN B. and KRAMARENKO A. (2011): Approximate Bicluster and TriclusterBoxes in the Analysis of Binary Data. In:13-th International Conference on RoughSets, Fuzzy Sets, Data Mining and Granular Computing (RSFDGrC-2011). 248–256.S. KROLAK-SCHWERDT, P. ORLIK, and B. GANTER (1994): TRIPAT:a modelfor analyzing three-mode binary data.Studies in Classification, Data Analysis,and Knowledge Organization, volume 4 of Information systems and data analysis.Springer, Berlin.

KeywordsFORMAL CONCEPT ANALYSIS, TRICLUSTERING, TRIADIC DATA, DATA MIN-ING

National Research University Higher School of [email protected]

123

A Framework For Modeling Covariances

Age K. Smilde, M.E. Timmerman, H.C.J. Hoefsloot, J.J. Jansen, and E. Saccenti

Abstract

In modern functional genomics it is more rule than exceptionthat multiple data tables(groups) are collected in a study pertaining to the same organism. In such cases it isworthwhile to analyze all data tables simultaneously to have a global view of the bi-ological system. This is the area of “data fusion”, which is alively research topic inbioinformatics. Most methods of analyzing such complex data focus on group means,treatment effects or time courses. However, considerable information may be presentin the covariances within a group, since this relates directly to individual differencesand heterogeneity of responses of the biological system to aperturbation. Hence, themethodology to study such covariances - and their changes upon treatment or in time -deserve attention in computational biology.

We will present a framework for modeling such covariances encompassing severalalready existing methods. Moreover, we will present a new method coined Combina-tion Simultaneous Component Analysis (COMSCA) which also fits in this framework.COMSCA aims to model differences in covariance matrices in terms of a few low rankprototypical component matrices The method is illustratedwith real-life examples fromtime-resolved metabolomics data.

KeywordsINDSCAL, COMSCA, IDIOSCAL, Common Principal Component, Covariances

Age K. Smilde· M.E. TimmermanHeymans Institute, University of Groningen, The Netherlands

H.C.J. Hoefsloot· E. SaccentiBiosystems Data Analysis, University of Amsterdam, The Netherlands

J.J. JansenInstitute for Molecules and Materials, Radboud UniversityNijmegen, The Netherlands

124

Biadditive Models, Alternative Estimation Procedures AndBetter Biplots

Fred A. van Eeuwijk1, Gerrit Gort1, Sabine K. Schnabel1, and Paul H.C. Eilers1,2

Abstract

Biadditive models are a useful model class for investigating interactions in two-waytables. An area where biadditive models are popular is plantbreeding and genetics,where sets of genotypes are evaluated across a range of environmental conditions, withthe results being summarized in two-way tables of genotype by environment (GxE)means. For GxE tables, various biadditive models have been proposed, like the Finlay-Wilkinson model (Yi j = µ +Gi +βiE j +εi j ), the additive main effects and multiplicativeinteractions model (Yi j = µ +Gi +E j +∑k γkiδk jεi j ), and the GGE or PCA model (Yi j =µ +E j +∑k γkiδk jεi j ).

For the estimation of parameters in biadditive models, least squares procedures area common choice. However, inference in a least squares framework offers limited pos-sibilities. We investigate Bayesian and penalized regression methods and discuss theirpossibilities.

For the interpretation of bilinear model fits, biplots, in which genotypes and environ-ments are assigned coordinates on the basis of their bilinear parameters, are an importanttool. Surprisingly, biplots often lack clarity and attractiveness. We propose a number ofcosmetic improvements.

Genotypes in the centre of biplots are less interesting, in contrast to those furtheraway. The convex hull has been used to identify the most extreme genotypes. So-calledalpha-bags are a generalization; they aim at hull that contains a chosen percentage ofthe genotypes. They are hard to compute and visually not veryattractive. As a quick andpleasing alternative we present expectile hulls, based on asymmetric least squares.

The convex hull is useful for the identification of groups of environments, mega-environments, which elicit comparable adaptations in genotypes. We extend this idea toexpectile hulls.

KeywordsSVD, EXPECTILES, ASYMMETRIC LEAST SQUARES

Biometris, Wageningen University and Research Centre, Wageningen, The Nether-lands fred.vaneeuwijk|gerrit.gort|[email protected] ·Erasmus University Medical Center, Rotterdam, The [email protected]

125

Triadditive Models for Three-way Tables

John C. Gower1, Casper J. Albers2, and Steffen Unkel1

Abstract

In this presentation we are concerned with three-way tables. Essentially, our approachis to adopt the usual linear models for representing main effects, two factor interactionsand three factor interactions. Just as two factor interactions may be approximated bymultiplicative bilinear terms, three factor interactionsmay be approximated by multi-plicative trilinear terms. In the bilinear case the approximations have standard least-square estimates based on singular value decompositions, but in the trilinear case, wepropose that the estimates be conditioned on the residuals from the bilinear model. Inprincipal, it would be possible to do a full unconditional least-squares solution but theconditional approach is easier and avoids difficulties withconstraints. In the bilinearcase identification constraints are not substantive but in the full trilinear case there isa troubling substantive interaction between the bilinear and trilinear parameter con-straints. This problem is avoided when using the conditional method of analysis andthe CANDECOMP algorithm may be applied directly to the conditioned residuals.

A special virtue of bilinear models is the way that they lend themselves to simplebiplots for visualing the interactions between rows and columns of the two classifyingfactors. This is particularly useful when bilinear interactions are adequately approxi-mated in two dimensions. It would be helpful if similar visualisations were available fortriaddittive interactions. We have made some progress in deriving triplots for rank-twotridimensional interaction tables. For each factor, this gives points in two-dimensionsdisplayed on three orthogonal surfaces. Each of the three faces may be shown sepa-rately and attempts can be made to show the full three dimensional visualisation.

We give some preliminary results for the ranks of trilinear interaction tables. Theseare special tables, as all their main and two-way margins arenull. However, it is not clearto us that, apart from mathematical interest, trilinear rank has any particular use from thepoint of view of data analysis. As with bilinear approximation, degree of approximationis more important than rank per se.

KeywordsINTERPRETATING INTERACTION, VISUALISING INTERACTION

Department of Mathematics and Statistics, The Open Univer-sity, Walton Hall, Milton Keynes, MK7 6AA, United [email protected]/[email protected] · Heymans Institute forPsychological Research, University of Groningen, Grote Kruisstraat 2/1, 9712 TSGroningen, The Netherlands,[email protected]

126

Three-way Candecomp/Parafac And The DivergingComponents Problem

Alwin Stegeman1

Abstract

Three-way Candecomp/Parafac (CP, also known as Canonical Polyadic Decomposition)can be viewed as a three-way generalization of the marix SVD (or PCA). Finding a best-fitting CP decomposition withRcomponents to a given three-way arrayZ is equivalentto finding a best rank-Rapproximation toZ . The CP decomposition consists ofR rank-1 arrays, where each rank-1 array is the outer vector productof three vectors. Contraryto PCA, a CP decomposition is rotationally unique under mildconditions. However,in many cases a best-fitting CP decomposition may not exist (whenR≥ 2). Trying tocompute a best-fitting CP decomposition then results indiverging components: some(groups of) rank-1 terms become nearly identical up to sign and arbitrarily large inmagnitude. To avoid this problem, several constraints can be imposed (orthogonality,nonnegativity). A different approach is to obtain the limitpoint of the CP-sequencefeaturing diverging components (Stegeman, 2012). The decomposition of the limit pointis more general than CP and its form can be inferred from the diverging CP-sequence.This decomposition form is then fitted to the dataZ using intitial values computed fromthe diverging CP-sequence. For a well-studied three-way dataset of ratings of TV shows(15 TV shows by 16 rating scales by 30 raters) it is shown that the decomposition of thelimit point has a clear and intuitive interpretation.

ReferencesSTEGEMAN, A. (2012): Candecomp/Parafac: from diverging components to a de-composition in block terms.SIAM Journal on Matrix Analysis and Applications, 33,291–316.

KeywordsCANDECOMP, PARAFAC, TENSOR DECOMPOSITION, LOW RANK APPROXI-MATION, DIVERGING COMPONENTS

Heymans Institute for Psychological Research, Universityof Groningen, Grote Kruis-straat 2/1, 9712 TS Groningen, The [email protected]

127

Cluster-weightedt-factor Analyzers for Clustering ofHigh-dimensional Data

Sanjeena Dang1, Antonio Punzo2, Salvatore Ingrassia3, and Paul D. McNicholas4

Abstract

Cluster-weighted modelling (CWM) is a flexible statisticalframework for modellinglocal relationships in heterogeneous populations on the basis of weighted combinationsof local models. We will extend cluster weighted models to include an underlying latentfactor structure of the independent variable resulting in afamily of parsimonious cluster-weightedt-factor analyzers (CWtFA). This provides the model with the flexibility ofclustering of high-dimensional data. Expectation-maximization framework along withBayesian information criterion (BIC) will be used for parameter estimation and modelselection. The approach is illustrated on simulated data sets as well as a real data set.

KeywordsCLUSTER-WEIGHTED MODELS, FACTOR ANALYZERS, HIGH-DIMENSIONALDATA

Department of Mathematics and Statistics, University of Guelph, ON, [email protected] · Department of Economics and Business, University ofCatania, Catania, [email protected] · Department of Economicsand Business, University of Catania, Catania, [email protected] ·Department of Mathematics and Statistics, University of Guelph, ON, [email protected]

128

Cluster-Weighted Modeling For Time To Event Data

Utkarsh J. Dang1 and Paul D. McNicholas2

Abstract

We implement a mixture of accelerated failure time models for a competing risks situa-tion in a cluster-weighted modeling (CWM) framework. CWM models the joint proba-bility of data arising from a population of sub-populationsusing combinations of localmodels. Both reliability and survival models analyze data on time to some event of inter-est in the engineering and medical fields respectively. Here, we present a novel approachto mixture group estimation and classification for time to event data. Finally, we presentour results on some simulated and real data where the time to failure and cause of failurewas recorded only on some of the observations.

KeywordsCLUSTER-WEIGHTED MODELING, ACCELERATED FAILURE TIME MODEL,COMPETING RISKS, EM-ALGORITHM

University of Guelph, 50 Stone Road East, Guelph, Ontario, N1G 2W1, [email protected] · University of Guelph, 50 Stone Road East, Guelph, Ontario,N1G 2W1, [email protected]

129

Modeling Bivariate Mixed-Type Data with the GeneralizedLinear Exponential Cluster-Weighted Model

Salvatore Ingrassia1 and Antonio Punzo2

Abstract

In the mixture with random covariates modeling frame, the recently proposed general-ized linear Gaussian cluster-weighted model (CWM) allows for flexible clustering anddensity estimation of a random vector composed by a responsevariable and by a set ofcovariates. In each mixture component, while the covariates are assumed to have a real-valued support and are modeled by a Gaussian density, various supports are allowedfor the response variable as conceived in the exponential family. For bivariate data,this paper presents the generalized linear exponential CWM. It extends the generalizedlinear Gaussian CWM by applying an exponential family distribution to the responsevariable too. This gives the possibility of modeling bivariate data of mixed-type. Thenatural counterparts, in the frames of mixture models with fixed covariates and latentclass models, are also defined and compared with the generalized linear exponentialCWM. Maximum likelihood parameter estimates are derived using the EM algorithmand model selection is carried out using the Bayesian information criterion (BIC). Ar-tificial and real data are finally considered to exemplify andappreciate the proposedmodel.

ReferencesINGRASSIA, S., MINOTTI, S. C., and VITTADINI, G. (2012). Local statisticalmodeling via the cluster-weighted approach with elliptical distributions.Journal ofClassification, 29(3), 363–401.INGRASSIA, S., MINOTTI, S. C., PUNZO, A., and VITTADINI, G. (2012): Gener-alized linear Gaussian cluster-weighted modeling. arXiv.org e-print 1211.1171, avail-able at:http://arxiv.org/abs/1211.1171.

KeywordsCLUSTER-WEIGHTED MODELS, LATENT CLASS MODELS, MIXED-TYPEDA-TA, EXPONENTIAL FAMILY DISTRIBUTIONS

Dipartimento di Economia e Impresa - Università di Catania (Italy)[email protected] · Dipartimento di Economia e Impresa - Universitàdi Catania (Italy)[email protected]

130

Cluster Inference using Modes

Surajit Ray

Abstract

Li, Ray and Lindsay (2007) proposed the method of modal clustering that identifyingthe local mode by starting at any point based on kernel density estimates and furtherclustering the data that converge to the same mode. Assessing the number of clustersafter modal clustering is lack of consideration. Ray and Lindsay (2005) introduced theridgeline manifold. It can capture the ridgeline between the two modes and find the an-timode, which is defined as the point on the ridgeline with thelowest density, betweenthem. In this work, we proposed two tests of modal significance based on the ridgelinemanifold. The first one is the paired test. Each point has the impact of kernel densityheights of mode and antimode. We considered the test statistic as the paired t-test statis-tic. The second method is to consider the ratio of the densityheights of antimode againstmode with lower density. We chose uniform as the reference distribution and simulatethe empirical distribution of both the paired test statistic and ratio statistic. We alsofound the empirical distribution of the paired test statistic is closed to T-statistic as thesample size is large.

131

IFCS Presidential AddressClassipedia: A Road Map to Help Traverse the ClassificationJungle

Iven Van Mechelen

Abstract

As a research domain, clustering and classification is aliveand kicking. Yet, the avail-able clustering models, algorithms, and data analysis techniques, in their entirety, forman inconvenient and intricate jungle. This is a most problematic obstacle for researchersin classification, for students who want to familiarize themselves with the domain, andfor applied researchers who are on the lookout for suitable clustering methods to addresssubstantive problems at hand. Within the IFCS, we want to work out a way to overcomethis obstacle. The proposed solution takes the form of a roadmap for the clustering do-main: Classipedia. In this talk, I will introduce the aim of the Classipedia project, theguiding questions and conceptual distinctions that constitute its conceptual backbone,and a first blueprint of its architecture. I will conclude by clarifying how the furtherdevelopment of this blueprint constitutes a thrilling challenge for the IFCS communityas a whole.

KeywordsCLUSTERING, CONCEPTUAL FRAMEWORK, CLASSIPEDIA

University of Leuven, Tiensestraat 102 box 3713, 3000 Leuven, [email protected]

132

A Restricted ADCLUS Type Model for Transition Matrices

Tadashi Imaizumi1

Abstract

The production positioning analysis of given brands is veryuseful on understandingconsumer market. However, the need to know is not the positioning of each product,but the the positioning of the categories that products willbelong to. ADCLUS typemodel(Arabie, Carroll and DeSarbo) is a useful model and method in production posi-tioning and market segmentation. LetF(t) be a given transition frequency matrix of sizen(t−1) rows andn(t) columns,

fi j (t)≈ fi j =m

∑k=1

wkpik(t)q jk(t),

wherewk represents the salience of the property or the categoryk, and

pik =

{

1, if objectoi has property or belong to the categoryk0, otherwise

We extend the above ADCLUS type model forT successive transition matrices forfinding common categories. LetF(1),F(2), ·,F(T) be T transition matrices. And weassume the columns ofF(t) are same to rows ofvecF(t +1). This means thatqik(t) =pik(t +1). We will estimate theT{P(t),W(t),m} andQ(T) under this restriction. Weadopt an optimization procedure which minimizes

T

∑t=1

[ fi j (t)−mt

∑k=1

wkpik(t)q jk(t)]2+

T−1

∑t=1

λt [qik(t)− pik(t +1)]2.

ReferencesARABIE, P., CARROLL, J., and DESARBO, W. (1981). Overlapping clustering: Anew method for product positioning.Journal of Marketing Research, 18, 310-317

KeywordsPRODUCT POSITIONING,CATEGORY

Tama University, 4-1-1 Hijirigaoka, Tama-shi, Tokyo, JAPAN, [email protected]

133

Clustering Of Time Series Via A Segmentation Approach

Christian Derquenne1

Abstract

The similarity between time series can be seen under two mainaspects: shape and levelof the curve. But it can be also interesting to discover similar behaviors on a piece oftime, to detect a break point on the instant, to analyze the links (linear, polynomial, ...)between two curves. The segmentation is a potential aid to synthesize a time series insegments. Each one owns a homogeneous behavior that can be compared with segmentsof another time series. Many segmentation methods are developed (Lavielle et al., 2006)and these ones can be used to make clustering of curves, (Hébrail et al., 2010). We havedeveloped a segmentation method based on an exploratory approach (Derquenne, 2011,2012) which has given very good results on simulated data andapplications. In thispaper, we introduce new similarity indexes based on our segmentation method, then weuse these ones to make clustering of time series. Furthermore, this clustering allows toidentify the characteristics of time series belonging to each cluster with respect to theproperties of similarity or dissimilarity. Lastly we propose some potential researches inthe domain of structural equations model for multivariate time series and the forecastingmodels.

ReferencesDERQUENNE, C. (2011): An Explanatory Segmentation Method for Time Series,in Proc. of Compstat 2010, Y. Lechevallier & G. Saporta (eds.), 935–942.DERQUENNE, C. (2012): Meta-segmentation of time series forsearching a bettersegmentation,in Proc. of Compstat 2012, Limassol, Cyprus, 191–204.HÉBRAIL G., HUGUENEY B., LECHEVALLIER Y. and ROSSI F. (2010): Ex-ploratory analysis of functional data via clustering and optimal segmentation.Neu-rocomputing 73(7-9): 1125–1141.LAVIELLE, M. and TEYSSIÈRE, G. (2006): Détection de ruptures multiples dansdes séries temporelles multivariées.Lietuvos Matematikos Rinikinys, Vol 46.

KeywordsTIME SERIES, SIMILARITY, CLUSTERING, SEGMENTATION

Electricité de France - Research and Development - 1, av. du Général de Gaulle - 92141Clamart Cedex - France [email protected]

134

Looking For A Best Compromise Between The UltrametricSupremum-Norm Approximations

B. Fichet

Abstract

All ultrametric Lp-norm approximations are well-known to be NP-hard, except theL∞-norm (supremum-norm) one, as shown by Farach et al (1995). The authors provided analgorithm to get a solution. Later on, Chepoi et al (2000), established in a general con-text, the link with the subdominant ultrametric approximation, showing that the greatestL∞-norm solution derives from a simple translation of the subdominant.

Similar results hold from some upperminimal ultrametrics,but not all of them. Fichet(2012) gave an algorithm to build those appropriate approximations, yielding minimalL∞-norm solutions by translation, hence interval solutions given by them and the sub-dominant. Then, following Chepoi et al (2000), an optimal consensus may summarizeany interval.

In this talk, we try to improve such a compromise. We focus on the choice of the up-perminimal ultrametric approximation, with the aim to get astructure similar to the oneof the subdominant, hence of the compromise, for instance having similar preordon-nances (linear interpoint distance preorders), similar tree-representations or commoncompatible (Robinsonian) order. We will discuss and justify those approaches, throughthe existence of solutions and our ability to compute them.

ReferencesFARACH, M., KANNAN, S. and WARNOW, T. (1995): A Robust Model for FindingOptimal Evolutionary Trees.Algorithmica, 13, 155-179.CHEPOI, V. and FICHET, B. (2000):l∞-Approximation via Subdominants.Journalof Mathematical Psychology, 44, 600-616.FICHET, B. (2012): Intervals as Ultrametric Approximations According to theSupremum Norm. In: M. Deza, M. Petitjean and K. Markov (Eds.): Mathematicsof Distances and Applications. ITHEA, Sofia, 147.

KeywordsULTRAMETRIC, SUBDOMINANT ULTRAMETRIC, UPPERMINIMAL ULTRA-METRICS, SUPREMUM-NORM APPROXIMATIONS.

LIF. Aix-Marseille University. 163 Avenue de Luminy. Case 901. F-13288 Marseillecedex [email protected]

135

Ultrametric Tree Representation For Three-WayThree-Mode Data With Weights Of Variables And Occasions

Kensuke Tanioka1 and Hiroshi Yadohisa2

Abstract

Three-way three-mode data is defined asXXX ∈ R|I |×|J|×|K|, whereI , J, andK represent a

set of objects, variables, and occasions, respectively, and | · | is the cardinality of a set.Here, we forcus on three-way three-mode data which consist of a set of multivariate dataamong various occasions for the same objects and variables.Such data, not to be con-fused with the three-way three-mode proximity data, are commonly observed in panelor psychological research. In paner research,I , J, andK represent a set of participants,questions, and years, respectively. When a classification structures ofI is calculatedfrom the three-way three-mode data, the masking variables and occasions, which pos-sess no classification structure, affects clustering structures. Milligan (1980) showedthe effects of masking variables in two-way two-mode data byconducting Monte Carlosimulations. These effects are also expected to occur in three-way three-mode data.

In this paper, we proposed three-way three-mode hierarchical clustering on the basisof the least squares criterion for weighting variables and occasions. Specifically, weextend the method of De Soete (1985) to three-way three-modedata. The method canconsider the effects of masking variables and occasions through adding the weights tovariables and occasions.

ReferencesDe SOETE, G., DESARBO, W.S., and CARROLL, J,D. (1985): Optimal variableweighting for hierarchical clustering: An alternating least-squares algorithm,Journalof Classification, 2, 173–192.MILLIGAN, G. W. (1980): An Examination of the effect of six types of error pertur-bation on fifteen clustering algorithms,psychometrika, 45, 325–342.

KeywordsALS, MASKING VARIABLES, MASKING OCCASIONS

Graduate school of Culture and Information Science, Doshisha [email protected] · Department of Culture and Information Science,Doshisha [email protected]

136

Which Movie Shall I Watch? Ultrametric BasedRecommendation System

Pedro Contreras1, Fionn Murtagh1, and Javier Pereira2

Abstract

In previous work we have shown how an ultrametric (Murtagh etal, 2008. Pereira etal, 2010. Contreras et al, 2012) can be used to create hierarchical clusters in constantalgorithmic time. In particular we make use of the Baire metric or the longest commonprefix to construct our classification trees. Sometimes whena technique to reduce thedata dimensionality was needed we opted to project the data randomly to one dimension(Murtagh et al, 2008).

Our aim in this work is to show how the Baire metric can be used to classify,match and retrieve categorical data. We demonstrate this bycreating a movie rec-ommendation system based in the Baire metric and using the MovieLens dataset(http://www.grouplens.org/node/73).

ReferencesCONTRERAS, P. and MURTAGH. F. (2012): Fast, Linear Time Hierarchical Clus-tering Using the Baire Metric. In: Journal of Classification, 29(2):118–143.MURTAGH, F., DOWNS, G. and CONTRERAS P. (2008): Hierarchical Clusteringof Massive, High Dimensional Data Sets by Exploiting Ultrametric Embedding. In:SIAM Journal on Scientific Computing, 30(2):707–730.PEREIRA, J., SCHMIDT, F. CONTRERAS, P., MURTAGH, F. and H. ASTUDILLO(2010): Clustering and Semantics Preservation in CulturalHeritage InformationSpaces. In: RIAO’2010, 9th International Conference on Adaptivity, Personalizationand Fusion of Heterogeneous Information, 100–105. Paris, France.

KeywordsULTRAMETRIC, BAIRE METRIC, CLUSTERING, RECOMMENDATION SYS-TEMS, INFORMATION RETRIEVAL.

Royal Holloway, University of London. Egham Hill, Egham. England. TW20 [email protected], [email protected] · Universidad Diego Portales.Avenida Ejército 441. Santiago, [email protected]

137

Model-Based Recursive Partitioning for DetectingInteraction Effects in Subgroups

Achim Zeileis1, Torsten Hothorn2, and Kurt Hornik3

Abstract

Recursive partitioning (also known as decision trees) is a standard approach for “learn-ing” a nonlinear regression relationship between some response variable and a set ofexplanatory variables. The result is a partition of the datathat can be easily visualizedand interpreted. However, classical decision trees typically lack a concept of “signifi-cance” and cannot be combined easily with classical parametric models. Therefore, ageneral framework for model-based recursive partitioningis suggested by Zeileiset al.(2008) that provides a synthesis between parametric modelsand the algorithmic treeapproach.

More formally, an algorithm for model-based recursive partitioning is suggested withthe following basic steps: (1) fit a parametric model to a dataset (e.g., via least squaresor maximum likelihood), (2) test for parameter instabilityover a set of partitioning vari-ables, (3) if there is some overall parameter instability, split the model with respect tothe variable associated with the highest instability, (4) repeat the procedure in each ofthe daughter nodes. The algorithm yields a partitioned (or segmented) parametric modelthat can be effectively visualized and that subject-matterscientists are used to analyzingand interpreting. It enables data-driven detection and modeling of subgroup interactionsin parametric regression models. The approach is illustrated using two logistic regres-sion trees for the risk of diabetes in Pima Indian women and the size of treatment effectsfor a chronic disease, respectively.

ReferencesZEILEIS, A., HOTHORN, T., and HORNIK, K. (2008): Model-Based RecursivePartitioning.Journal of Computational and Graphical Statistics, 17, 492–514.

KeywordsCHANGE POINTS, MAXIMUM LIKELIHOOD, MODEL TREES, PARAMETERINSTABILITY, RECURSIVE PARTITIONING

Universität Innsbruck, [email protected] · UniversitätZürich, [email protected] · WU Wirtschaftsuni-versität Wien, [email protected]

138

Predicting Individual Causal Effects (ICE)

Xiaogang Su1 and Joseph Kang2

Abstract

Within Rubin’s causal model, the individual causal effect (ICE) is defined asE(Y1−Y0|X = x) for a subject withX = x, whereY1 andY0 are potential outcomes. Knowledgeof ICE implies that of average causal effect (ACE) and sub-population causal effects,but not vice versa. Moreover, ICE plays a critical role in advancing personalized orstratified medicines. According to the formulation, estimation of ICE is essentially apredictive modeling problem. In this project, two machine learning methods are pro-posed for predicting ICE with observational data, where howto tease out the confound-ing and moderating effects of other covariates on causal inference is the key. The firstmethod is based on the causal inference tree (Su et al., 2012JMLR) while the secondis based onk-nearest neighbor and kernel smoothing. We compare the proposed meth-ods with available approaches via simulation and illustrate their use on with the NSWdata in (Dehejia and Wahba, 1999JASA) where the objective is to assess the impactof a labor training program, the National Supported Work (NSW) demonstration, onpost-intervention earnings.

ReferencesDEHEJIA, R. H. and WAHBA, S. (1999): Causal Effects in Nonexperimental Stud-ies: Re-evaluating the Evaluation of Training Programs.Journal of the AmericanStatistical Association, 94, 1053–1062.SU, X. G., KANG, J., FAN, J. J. , LEVINE, R. A., and YAN, X. (2012). Facilitat-ing Score and Causal Inference Trees for Large Observational Studies.Journal ofMachine Learning Research, 13, 2955–2994.

KeywordsCAUSAL INFERENCE, CONFOUNDING AND INTERACTING, OBSERVATIONALDATA, RECURSIVE PARTITIONING, KERNEL SMOOTHING

University of Alabama at Birmingham, [email protected] · Northwestern Univer-sity, [email protected]

139

A New Tool For Identifying Qualitative Treatment-SubgroupInteractions: QUINT

Elise Dusseldorp1 and Iven Van Mechelen2

Abstract

When for some disease two alternative treatments -A and B- are available, one subgroupof patients may display a better outcome with treatment A than with B, whereas for an-other subgroup the reverse may be true. If this is the case, a qualitative (i.e., disordinal)treatment-subgroup interaction is present. Such interactions imply that some subgroupsof patients should be treated differently, and are therefore most relevant for personalizedmedicine. In case of data from randomized clinical trials with many patient character-istics that could interact with treatment in a complex way, afew statistical approachesexist to detect treatment-subgroup interactions; examples include STIMA (Dusseldorpet al., 2010) and Interaction Trees (Su et al., 2009). However, a suitable approach todetect qualitative interactions is not yet available. In the present paper, we propose anew method for this purpose, called QUalitative INteraction Trees (QUINT). QUINTresults in a binary tree that subdivides the patients into terminal nodes on the basis ofpatient characteristics; these nodes are further assignedto one of three classes: a first forwhich A is better than B, a second for which B is better than A, and an optional third forwhich type of treatment makes no difference. Results with regard to the optimizationand recovery performance of QUINT will be presented.

ReferencesDUSSELDORP, E., CONVERSANO, C., Van OS, B.J. (2010): Combining an addi-tive and tree-based regression model simultaneously: STIMA. Journal of Computa-tional and Graphical and Statistics, 19, 514–530.SU, X., TSAI, C-L., WANG H., NICKERSON, D.M., LI, B. (2009): Subgroup anal-ysis via recursive partitioning.The Journal of Machine Learning Research, 10, 141–158.

KeywordsINTERACTION, MODERATOR, SUBGROUP ANALYSIS, PARTITIONING, CLUS-TER

KU Leuven, Netherlands Organisation for Applied ScientificRe-search TNO [email protected] · KU [email protected]

140

A Comparison Of Six Sequential Partitioning Methods ToFind Subgroups Involved In Treatment-SubgroupInteractions

Lisa Doove1, Elise Dusseldorp2, Katrijn Van Deun3, and Iven Van Mechelen4

Abstract

In case multiple treatment alternatives are available for some medical problem, thedetection of treatment-subgroup interactions is of key importance for personalizedmedicine and the development of optimal treatment assignment strategies. Random-ized Clinical Trials (RCT) often go without clear a priori hypotheses on the subgroupsinvolved in treatment-subgroup interactions, and with a large number of pre-treatmentcharacteristics in the data. In situations like this, relevant subgroups (defined in terms ofpre-treatment characteristics) are to be induced during the actual data analysis. For suchan analysis, recently six different methods have been proposed, all being of a sequen-tial partitioning type. These are Model-based recursive partitioning, Interaction Trees,STIMA, SIDES, Virtual Twins, and QUINT. However, they have been developed almostindependently, and the relations between them are not yet understood. This presentationcloses this gap. Using an illustrative RCT data set, a systematic comparison of the meth-ods is presented, hereby focusing on major similarities anddifferences.

ReferencesDOOVE, L.L., DUSSELDORP, E., VAN DEUN, K. and VAN MECHELEN, I.(2013): A comparison of five sequential partitioning methods to find person sub-groups involved in meaningful treatment-subgroup interactions.Manuscript submit-ted for publication.DUSSELDORP, E. and VAN MECHELEN, I. (2013): Qualitative interaction trees:A tool to identify qualitative treatment-subgroup interactions.Manuscript submittedfor publication.

KeywordsTREATMENT HETEROGENEITY, SEQUENTIAL PARTITIONING, SUBGROUPANALYSIS, TREATMENT-SUBGROUP INTERACTION

KU [email protected] · KU Leuven, [email protected] ·KU [email protected] ·KU [email protected]

141

Automatic Bayes Factors for Comparing Variances of TwoIndependent Normal Distributions

Florian Böing-Messing1 and Joris Mulder2

Abstract

When analyzing differences between two independent populations researchers com-monly focus on comparing means. However, it is equally important to investigate dif-ferences in the populations’ variances. We often would liketo know whether two pop-ulations are equally heterogeneous, whether population 1 is more heterogeneous thanpopulation 2, or whether population 2 is more heterogeneousthan population 1. Toanswer this question we shall perform a multiple hypothesistest on the variances oftwo independent normal distributions using the Bayes factor, a Bayesian testing crite-rion. The Bayes factor has two important properties which are not shared by classicalp-values. First, Bayes factors can straightforwardly be used for simultaneously testingmultiple hypotheses. Second, the Bayes factor has an intuitive interpretation as the rel-ative evidence in the data in favor of a hypothesis against another hypothesis. However,when using Bayes factors for testing equality constrained hypotheses the choice of theprior plays an important role due to the Jeffreys-Lindley paradox. In this paper differ-ent automatic priors will be compared when using Bayes factors in the above multipletesting problem. We investigate the performance of these priors by looking at importantproperties such as consistency, the information paradox, balancedness, and the similar-ity with classical p-values. Our results can be used as a guideline for choosing a priorwhen testing hypotheses on the variances of two independentnormal distributions.

KeywordsHOMOGENEITY OF VARIANCE, MULTIPLE HYPOTHESIS TESTING, DEFAULTBAYES FACTOR, JEFFREYS-LINDLEY PARADOX, INFORMATION PARADOX

Department of Methodology and Statistics, Tilburg University, Postbus 90153, 5000 LETilburg, the Netherlands,[email protected] · Department of Methodol-ogy and Statistics, Tilburg University, Postbus 90153, 5000 LE Tilburg, the Netherlands,[email protected]

142

Bayesian Model Selection For Evaluating Equality AndOrder Constraints On Correlation Matrices

Joris Mulder

Abstract

Researchers often formulate their expectations using equality constraints and order con-straints on correlation coefficients. When translating these expectations into a set ofcompeting equality-constrained and order-constrained models on the zero-level corre-lations in an unstructured correlation matrix, the goal is to determine which model re-ceives most evidence from the data. For this purpose, Bayes factors shall be developed.The Bayes factor is a Bayesian model selection criterion that can be used for quanti-fying the relative evidence in the data in favor of a model in comparison to anothermodel. Particular attention is paid to proper prior specification which plays a crucialrole when computing Bayes factors. Priors will be considered that (i) result in positivedefinite correlation matrices, (ii) are ‘balanced’ in the sense that every possible order-ing is equally likely a priori, and (iii) result in Bayes factors that are consistent whenevaluating equality constraints and order constraints on correlations.

KeywordsBAYES FACTOR, CORRELATION MATRIX, PRIOR, COMPLEXITY

Departement of Methodology and Statistics, Tilburg University, the [email protected]

143

Bivariate Dependence Patterns And Copulas: ModelDiscrimination And Robustness

Lianne Ippel1 and Johan Braeken2

Abstract

Different dependence patterns can be hiding behind the samevalue of a general depen-dence measure. In this landscaping Monte Carlo experiment,we investigate the distin-guishability of qualitatively different bivariate dependence structures that have equiva-lent rank-order correlation and fixed univariate distributions with the same means andstandard deviations. Hence, the difference in structure only shows graphically and notin any of the summary statistics.

A conceptual and graphical introduction will be given to copula functions, a mul-tivariate modeling approach that allows for construction of such varying dependencestructures. Model fit, general and local dependence measures are considered to studythe informativeness of the data in discriminating between four of these copula models.

Results stress the importance of focus when assessing differences between models.Although the models discriminate fairly well based upon fit statistics, model misspecifi-cation hardly affected general dependence measures. This robustness might make modelselection a seemingly non-issue in practice. In contrast, when focus is on local depen-dence measures and prediction, model misspecification can be be rather harmful.

KeywordsCOPULA FUNCTIONS, DEPENDENCE, MODEL DISCRIMINATION, MODEL SE-LECTION

Tilburg School of Social and Behavioral Sciences (TSB), Tilburg [email protected] · Departement of Methodology andStatistics, Tilburg [email protected]

144

Posterior Predictive checking as alternative to Asymptoticsand Bootstrapping in Latent Class Analysis

Geert H. van Kollenburg1, Joris Mulder2, and Jeroen K. Vermunt3

Abstract

As the use of latent class analysis becomes more widespread,the importance of cor-rect interpretation and availability of reliable fit statistics increases. Most methods forassessing model fit involve using asymptotic reference distributions, which may not al-ways be appropriate. Using asymptotic p-values on sparse frequency tables can lead toa dramatic increase in Type-I-error levels (Reiser & Lin, 1999).

Resampling techniques can provide empirical p-values thathave good properties,even under sparseness. We apply posterior predictive checks (Gelman, Meng & Stern,1996) to obtain empirical p-values for a number of commonly used fit statistics withinlatent class analysis.

In a Monte Carlo simulation study we compared the posterior predictive check tothe use of asymptotics and to the parametric bootstrap method. Results show that theposterior predictive check is a sound alternative to the useof asymptotics and that itworks equally well as the parametric bootstrap.

ReferencesGELMAN, A., MENG, X. L. and STERN, H. (1996): Posterior predictive assessmentof model fitness via realized discrepancies.Statistica Sinica, 6, 733–759.REISER, M. and LIN, Y. (1999): Goodness-Of-Fit Test for the Latent Class ModelWhen Expected Frequencies Are Small.Sociological methodology, 29(1), 81–111.

KeywordsLATENT CLASS ANALYSIS, BAYESIAN MODEL CHECKING, POSTERIORPRE-DICTIVE CHECK, BOOTSTRAP

Tilburg University, [email protected] ·Tilburg University, [email protected] ·Tilburg University, [email protected]

145

Statistical Modeling Of The Distribution Of FinancialReturns

Cuevas-Covarrubias C.1, Iñigo-Martínez J.2 and Rosales-Contreras J.3

Abstract

Most of the models applied in Finance assume that daily financial returns are normallydistributed; however, this fundamental assumption is not always satisfied in practice.Financial returns frequently showleptokurticdistributions: does it mean that the NormalDistribution is not useful in Financial Modeling? To estimate the distribution functionof financial returns is an important task in Actuarial Mathematics and Risk Theory.This article is a practical discussion on finite Gaussian Mixtures and its potential inFinancial Risk Modeling. It is based on the analysis of different financial series fromseveral markets in Latin America. Our discussion considersthe estimation of Marginaland Joint Distributions and compares the results with thoseobtained with other modelsproposed in the literature.. The empirical evidence shows that Gaussian Mixture modelshave an interesting potential in financial modeling for riskassessment. Our conclusionis that financial returns may not be normally distributed, but they frequently behave asa mixture of Gaussians.

ReferencesKLUGMAN, S.A.; and PARSA R. (1999): Fitting bivariate distributions with copu-las,Insurance Mathematics and Economics, 24 139-148BEHR A. and POETTER U. (2009): Modeling Marginal Distributions of Ten Eu-ropean Stock Market Index Returns,International Research Journal of Finance andEconomics,28, 104-119.McLACHLAN G. and PELL D. (2000):Finite Mixture Models. Wiley series in Prob-ability and Statistics, Wiley inter-science.

KeywordsFIANCIAL RETURNS, RISK THEORY, GAUSSIAN MIXTURES, EXPECTATION-MAXIMIZATION, COPULA MODELING.

Universidad Anáhuac, Estado de México, México,[email protected], · Universidad Anáhuac, Estado de México, Méx-ico. · Instituto Tecnológico de Estudios Superiores de Monterrey, México.

146

Combining Decision Trees And Stochastic Curtailment ForAssessment Length Reduction Of Test Batteries Used ForClassification

Marjolein Fokkema1, Niels Smits2 Henk Kelderman3

Abstract

For classification problems in psychology (e.g., clinical diagnosis), batteries of testsare often administered. However, not every test or item may be necessary for accurateclassification. In this paper, we introduce a combination ofclassification and regressiontrees (CART; Breiman, Friedman, Oshen & Stone, 1984) and stochastic curtailment(SC; Finkelman, He, Kim & Lai, 2011) to reduce assessment length of questionnairebatteries. First, the CART algorithm provides relevant subscales and cutoffs needed foraccurate classification, in the form of a decision tree. Second, for every subscale andcutoff appearing in the decision tree, SC reduces the numberof items needed for accu-rate classification. This procedure is illustrated by post-hoc simulation on a dataset of3579 patients, to whom the Mood and Anxiety Symptoms Questionnaire (MASQ) wasadministered. Subscales of the MASQ are used for predictingdiagnoses of depression.Results show that CART-SC provided an assessment length reduction of 56%, withoutloss of accuracy, compared to the more traditional prediction method of performing lin-ear discriminant analysis (LDA) on subscale scores. CART-SC appears to be an efficientand accurate algorithm for shortening test batteries.

ReferencesBREIMAN, L. and FRIEDMAN, J. and OLSHEN, R. and STONE, C. (1984): Clas-sification and Regression Trees. Wadsworth, New York.FINKELMAN, M.D. and HE, Y. and KIM, W. and LAI, A.M. (2011): Stochastic cur-tailment of health questionnaires: A method to reduce respondent burden.Statisticsin Medicine, 30, 1989–2004.

KeywordsTEST BATTERIES, COMPUTERIZED TESTING, SEQUENTIAL TESTING, CLAS-SIFICATION AND REGRESSION TREES, STOCHASTIC CURTAILMENT,EFFI-CIENCY

Vrije Universiteit, Amsterdam,[email protected] · Vrije Universiteit, Amsterdam·Vrije Universiteit, Amsterdam

147

Gaussian Tree Models For Discrimination

Gonzalo Perez–de–la–Cruz1 and Guillermina Eslava–Gomez2

Abstract

We consider Graphical Gaussian models with tree structure in discriminant analysis fortwo populations. We restrict to the case where each model hasthe same tree structure,though not necessarily the same concentration matrix. By considering a tree structure,the maximum likelihood estimator (MLE) for the concentration matrices can be ex-pressed analytically. Whereas by considering the same treestructure for each of the twoconcentration matrices, the estimation of the unknown structure is solvable by findingthe minimum weight spanning tree (MWST).

In this work, we propose to use the J-divergence as a measure of discrimination be-tween two populations, and the one to be optimized efficiently by finding the MWST.By using the MLE of each concentration matrix and the MWST we get an estimateddiscriminant function.

We illustrate the empirical performance of the proposed andof other existing methodsusing some data. This example shows similar performance forthe methods using treestructure on the models, and a better one respect to linear and quadratic discriminantanalysis for small sample sizes.

ReferencesCHOW, C. and LIU, C. (1968): Approximating Discrete Probability Distributionswith Dependence Trees.Information Theory, IEEE Transactions, 14, 462–467.FRIEDMAN, N., GEIGER, D. and GOLDSZMIDT, M. (1997): Bayesian NetworkClassifiers.Mach. Learn., 29, 131–163.LAURITZEN, S. L. (1996):Graphical Models. Clarendon Press, Oxford.TAN, V., SANGHAVI, V., FISHER, J. and WILLSKY, A. (2010): Learning Graphi-cal Models for Hypothesis Testing and Classification.IEEE Transactions on SignalProcessing, 58, 5481–5495.

KeywordsDISCRIMINANT ANALYSIS, GRAPHICAL GAUSSIAN MODELS, TREES,J-DIVERGENCE,STRUCTURE ESTIMATION.

Posgraduate Studies in Mathematics, National University of Mexico, UNAM. Mex-ico, D.F. [email protected] · Department of Mathemat-ics, Faculty of Sciences, National University of Mexico, UNAM. Mexico, D.F. [email protected]

148

Stochastic Curtailment Of Questionnaires For Three LevelClassification: Shortening The Ces-D For Assessing Low,Moderate, And High Risk Of Depression

Niels Smits1, Matthew Finkelman2, and Henk Kelderman3

Abstract

Health questionnaires are often built up from sets of questions which are totaled toobtain a sum score; often, this score is subsequently used toclassify respondents. Animportant consideration in designing questionnaires is tominimize respondent burden.Finkelman et al. (2011, 2012) introduced stochastic curtailment (SC) as an efficientmethod of questionnaire administration aimed at classification into two categories, suchas ‘at risk’ and ‘not at risk’. SC uses a prediction model for forecasting observed classmembership; the strategy is to stop testing when not yet administered items are unlikelyto change the respondent’s classification. The current paper adjusts SC for classificationinto three categories such as ‘low risk’, ‘moderate risk’, and ‘high risk’. It is shown thatthis adjustment is not trivial. The outcomes of a post hoc simulation study are presentedin which real responses on the Center for Epidemiologic Studies Depression scale wereused by several versions of SC for classification into three categories. SC substantiallyreduced the respondent burden while maintaining a high classification quality. Benefitsand limitations of this new methodology are discussed.

ReferencesFINKELMAN , M. D. , HE, Y., K IM , W., and LAI , A. M. (2011): Stochastic curtail-ment of health questionnaires: A method to reduce respondent burden.Statistics inMedicine, 30, 1989–2004.FINKELMAN , M. D., SMITS, N., KIM , W. and RILEY , B. (2012): Curtailment andstochastic curtailment to shorten the CES-D.Applied Psychological Measurement,36, 632–658.

KeywordsCOMPUTERIZED TESTING, RESPONDENT BURDEN, CURTAILMENT, ORDI-NAL REGRESSION

VU University [email protected] · Tufts University School of DentalMedicine, Boston· VU University Amsterdam

149

Tree-Based Prediction with Missing Data

Holger Cevallos Valdiviezo, Stefan Van Aelst

Abstract

In prediction problems missing data are frequently encountered. Misleading predictionsmay be obtained if the missing data issue is not addressed correctly. Thus, it is crucialto find an appropriate prediction rule, with low bias and highprecision, which takes theuncertainty caused by missing values into account. To handle this problem, we investi-gated the performance of ten prediction methods based on trees. Some methods handleincomplete data by themselves (e.g. via surrogates) while others use a preliminary im-putation step. The methods in question are: CART (surrogatesplits), Random Forest(RF) with imputation by either median or proximity matrix, Bagging (surrogate splits),Multiple Imputation via Sequential Trees (MIST) followed by either CART or RF, boot-strap samples imputed by conditional means followed by either CART or RF, bootstrapsamples imputed by draws from the conditional distributions followed by either CARTor RF.

We studied the performance of these methods on real and simulated high-dimensionaldatasets with 5%, 10% and 25% of missing data generated completely at random, atrandom and not at random. We considered both linear and nonlinear data generatingmodels in the simulations. The performance is evaluated on alarge test set using meansquared prediction error for regression and misclassification rate for classification. Over-all, MIST followed by RF showed a very good performance in allscenarios for bothregression and classification with stable predictions across missing data fractions andmissingness mechanisms. A computationally less intensivealternative is RF with im-putation by proximity matrix which performs well for lower fractions of missing data.Finally, we compare our findings to related results on the useof surrogates versus mul-tiple imputation that have been published recently.

ReferencesBURGETTE, L.F. and REITER, J. (2010): Multiple imputation for missing data viasequential regression trees.American Journal of Epidemiology, 172, 1070–1076.

KeywordsTREE METHODS, PREDICTION, MISSING DATA, IMPUTATION

Ghent [email protected];[email protected]

150

Sparse Classifier Ensembles for Improved Interpretability.

Werner Adler1, Zardad Khan2, Sergej Potapov1 and Berthold Lausen2

Abstract

Classification tree ensembles like bagged classification trees or random forests (Breiman,2001) often show improved classification performance in comparison to single trees.This comes to the cost of less interpretability which is an important aspect e.g. inmedical applications, where interpretability is important and black box methods areunwanted when it comes to decisions regarding future treatment of patients. Severalmethods exist to combine both, improved performance and larger interpretability. Forexample Node Harvest proposed by Meinshausen (2010) is characterized by it’s inter-pretability and competitive performance in various situations.

A high diversity between individual base classifiers is deemed to be important in theperformance of an ensemble. Hence, our approach to improving the interpretability ofclassifier ensembles is based on a dramatic reduction of the number of trees constitut-ing the ensemble depending on their diversity. To obtain this goal, we examine severaldiversity measurements (Tang et al., 2006) and create sparse classifier ensembles byweighting the individual trees based on these measurements. We report and discuss theresults obtained using simulated data as well as a clinical example data set.

ReferencesBreiman, L. (2001): Random forests.Machine Learning, 45, 5–32.Meinshausen, N. (2010): Node Harvest.The Annals of Applied Statistics, 4(4), 2049–2072.Tang, E.K., Suganthan, P.N., Yao, X. (2006): An analysis of diversity measures.Ma-chine Learning, 65, 247–271.

KeywordsCLASSIFICATION TREES, ENSEMBLES, DIVERSITY, INTERPRETABILITY

Department of Biometry and Epidemiology, University of Erlangen-Nuremberg, Ger-many· Department of Mathematical Sciences, University of Essex,[email protected]

151

A ROC-Optimised Multi-Prototype Classifier

Mario Ziller

Abstract

In many diagnostic problems in medicine, biology, and far beyond that, there has beenthe desire for detecting typical reference objects. They should act as proof-samples forfuture ruling in practice. In comparable problems, global distance-based classificationturned out to be a useful mathematical vehicle. The application of its results to largedata sets moreover operates much faster than applying a local nearest-neighbour-likeprocedure. In this context, we report on a new multi-prototype classifier which reliablyworks in many respects, even in multi-class diagnostics.

For a short mathematical outline, let all objects be considered as points in a metricspace. Any class to be investigated is modelled as an overlapof potentially different-sized hyperspheres, the centres of which represent the sought reference objects, hence-forth referred to as prototypes. The radii of the hyperspheres are individually optimisedby a generalised ROC-analysis which all other hypersphereswere fixed in. For the ap-proximate solution of the entire discrete optimisation problem, a greedy algorithm hasbeen developed. It runs inO (n2k2) time wheren is the number of training objects andk is the number of prototypes to be selected.

In case of multi-class problems, prototypes and related cutoffs are determined foreach single class, separately. The diagnostic decision is finalised for that class of max-imum specificity when in doubt. Objects not recognised as a member of any of theclasses are assigned to an additional remainder-class.

The performance of the classification system presented is demonstrated at variouspractical examples, and in comparison to other methods.

KeywordsPROTOTYPE CLASSIFIER, MULTI CLASS DIAGNOSTICS, ROC ANALYSIS, GREEDYALGORITHM

Friedrich-Loeffler-Institut, Federal Research Institutefor Animal Health,Biomathematics Working Group, Greifswald - Insel Riems, [email protected]

152

Classification of Rounded Shapes with Penalized SignalRegression

Johan J. de Rooi1 and Paul H.C. Eilers1

Abstract

Various medical and biological applications require the classification of two-dimensional(rounded) outlines. In addition to classification, proper preprocessing is needed. We pro-pose a scheme with several steps: 1) rectangular coordinates are converted to polar; 2)scaling and rotation is applied; 3) the radius is lightly smoothed, using (circular) P-splines as a function of the angle; 4) the spline coefficientsare used as explanatory vari-ables in logistic penalized signal regression. This set-uphas several advantages. UsingP-splines makes the signals of equal length, while unsupported regions can be correctedusing a difference penalty. The penalty prevents overfitting of the data and makes theproblem well-posed. Because the model is a member of the class of generalized linearmodels, we are not limited to a binomial outcome. Applications show excellent classifi-cation performance.

KeywordsSHAPE ANALYSIS; SIGNAL REGRESSION; P-SPLINES

Department of Biostatistics, Erasmus Medical Center, Rotterdam, The [email protected],[email protected]

153

Classification of Topics on Twitter in Consideration of TimeSeries Variation

Atsuho Nakayamar1, Hiroyuki Tsurumi2, and Junya Masuda3

Abstract

We address the task of classifying topics of tweet data of Twitter. Twitter is microblogservice that enables its users to post and read text-based messages of up to 140 charac-ters. Twitter spread rapidly in Japan in recent years thanksto using Chinese ideograms.Since Chinese ideograms are symbols representing meanings, the meaning is easy todiscern by even a few characters. The 140 characters in Japanese are enough to expressa lot of ideas. However, we have to select appropriate words,which are represented thekeywords of the meaningful topics, from a lot of words. It is important to set criteriafor the choice of candidate words. We have used the complementary similarity measure(Sawaki & Hagita, 1996) in order to find appropriate words which represent time seriesvariation of topics and gain more understanding of those characteristics. The comple-mentary similarity measure method is a classification method and widely applied in thearea of character recognition. Then, we will classify the words extracted from the tweetdata by using non-negative matrix factorization (NMF)(Lee& Seung, 2000). NMF hasadvantages for applications involving large and sparse matrices. We empirically showthat our method generates a good summary on the dataset of microblog documents on anew line of beverage.

ReferencesLee, D.D. and Seung, H.S. (2000): Algorithms for Non-Negative Matrix Factoriza-tion. In K. T. Leen, T. G. Dietterich and V. Tresp (Eds.):Advances in Neural Infor-mation Processing Systems, Vol. 13. MIT Press, 556–562.Sawaki, M. and Hagita, N. (1996): Recognition of Degraded Machine-Printed Char-acters Using a Complementary Similarity Measure and Error-Correction Learning.IEICE Transactions on Information and Systems, Vol. E79-D,No.5, 491–497.

KeywordsCOMPLEMENTARY SIMILARITY MEASURE, MICROBLOG DATA, NMF

Graduate School of Social Sciences, Tokyo Metropolitan University, 1-1 Minami-Ohsawa, Hachioji-shi, Tokyo 192-0397 Japan,[email protected] · College ofBusiness Administration, Yokohama National University· Dentsu Marketing InsightINC

154

Classifying Real-World Data With The DDα-Procedure

Pavlo Mozharovskyi1, Karl Mosler1, and Tatjana Lange2

Abstract

The DDα-classifier, a nonparametric fast and very robust procedureintroduced byLange et al. (201x), is applied to fifty classification problems regarding a broad spectrumof real-world data. The procedure first transforms the data from their original propertyspace into a depth space (Li et al., 2012), which is a low-dimensional unit cube, andthen separates them by a projective invariant procedure, called α-procedure (Vasil’evand Lange, 1998). To each data point the transformation assigns its depth values withrespect to the given classes. Here the random Tukey depth (Cuesta-Albertos and Nieto-Reyes, 2008) is employed, which approximates the Tukey depth by minimizing univari-ate Tukey depths over a finite number of directions. ‘Outsiders’, that is data points hav-ing zero depth in all classes, need an additional treatment for classification. Several suchtreatments are introduced and evaluated. TheDDα-procedure has been implemented asan R-package.

ReferencesLANGE, T., MOSLER, K. and MOZHAROVSKYI, P. (201x): Fast nonparametricclassification based on data depth.Statistical Papers, in press.LI, J., CUESTA-ALBERTOS, J.A. and LIU, R.Y. (2012):DD-classifier: nonpara-metric classification procedure based onDD-plot.Journal of the American StatisticalAssociation, 107, 737–753.VASIL’EV, V.I. and LANGE, T. (1998): The duality principle in learning for patternrecognition (in Russian).Kibernetika i Vytschislit’elnaya Technika, 121, 7–16.CUESTA-ALBERTOS, J.A. and NIETO-REYES, A. (2008): The random Tukeydepth.Computational Statistics and Data Analysis, 52, 4979–4988.

KeywordsCLASSIFICATION, SUPERVISED LEARNING, DATA DEPTH, TUKEY DEPTH,OUTSIDERS

Universität zu Köln, Albertus-Magnus-Platz, 50923 Köln, Germany.{mozharovskyi,mosler}@statistik.uni-koeln.de ·Hochschule Merseburg, Geusaer Straße, 06217 Merseburg, [email protected]

155

Comparing High-Dimensional Classifiers: Abuse andDangers of Overall Accuracy

A. Pedro Duarte Silva

Abstract

Statistical classification has a respected tradition in thesupport of medical diagnosis.Early applications relied on classical methodologies thatassumed training samples withmore patients than disease predictors and understood that simple performance measures,that do not take into account disease prevalence and the different costs of negative andpositive predictions, have serious limitations.

More recently, new classification methodologies have been applied to large genomicdata bases where thousands of genes are measured on a few dozen patients. However,many of the studies that have evaluated these proposals employed only overall accuracymeasures. This practice is potentially misleading, as it isknown that changing priorprobabilities and/or cost assumptions can strongly affectthe relative standing of tradi-tional classification rules.

This presentation describes a study on the consequences of comparing high-dimensio-nal classification rules by different performance measures. It will be argued that mea-sures based on expected utilities or decision curves, that focus on the precision of riskestimates near the optimal threshold, should be preferred to overall accuracy. Further-more, it will be shown that when samples proportions are not close to true disease prob-abilities corrected by misclassification costs, the use of overall accuracy can indeed leadto incorrect rankings of high-dimensional classifiers.

ReferencesBAKER, S.G.; COOK, N.R., VICKERS, A. and KRAMER, B.S. (2009): Using rela-tive utility curves to evaluate risk prediction.Journal of the Royal Statistical Society.A, 172, 729–748.DUARTE SILVA, A.P.; STAM, A. and NETER, J. (2002): The Effects of Misclassi-fication Costs and Skewed Distributions in Two-Group Classification.Communica-tions in Statistics: Simulation and Computation 31, 401–423.

KeywordsCLASSIFIER EVALUATION, DECISION CURVES, HIGH DIMENSIONALCLAS-SIFICATION, MISCLASSIFICATION COSTS

Catholic University of Portugal, Faculdade de Economia e Gestão and CEGE, RuaDiogo Botelho 1327, 4169-005 Porto, [email protected]

156

Divisive Latent Class Modeling as a Density Estimation Tool:The Estimation Algorithm and an Application to IncompleteData.

Daniel W. van der Palm1, L. Andries van der Ark1, and Jeroen K. Vermunt1

Abstract

Traditionally, latent class (LC) analysis is used as a statistical method to identify sub-stantively meaningful groups from multivariate data. Morerecently, the LC model hasalso been used as a tool for density estimation. However, theperformance of the LCmodel as a density estimation tool depends on how well the model fits the data. Thus,the optimal number of latent classes must be determined.

A typical model-fit strategy is to start with a 1-class model,a 2-class model, and soon, until the best fitting model has been found according to a certain criterion. How-ever, such a model-fit strategy may require an excessive amount of computation time,especially for datasets containing a large number of variables. Furthermore, during thesearch for the best fitting LC model, numerous LC models may have to be estimated andcompared manually, which may be an obstacle to researchers and practitioners. Van derPalm, Van der Ark, and Vermunt (2013) have developed a divisive latent class (DLC)model that addresses the above two problems. A DLC model is a top-down cluster-ing of respondents into latent classes. It is obtained by estimating a series of one-classand two-class models. Because a DLC model is estimated sequentially, the computationtime is greatly reduced in comparison to a standard LC model.In addition to faster re-sults, a DLC model produces the best fitting latent class model in a single run, withoutthe need for human intervention during the estimation process. In this presentation, wediscuss the estimation algorithm of the DLC model, and an application to the problemof missing data.

ReferencesVan der Palm, D. W., Van der Ark, L. A., and Vermunt, J. K. (2013). Divisive LatentClass Modeling as a Density Estimation Tool.Submitted.

Tilburg University, Tilburg, The [email protected]

157

Determining the Number of Clusters in Categorical Data

Cláudia Silvestre1, Margarida Cardoso2, and Mário Figueiredo3

Abstract

Cluster analysis for categorical data has been an active area of research. A well-knownproblem in this area is the determination of the number of clusters, which is unknownand must be inferred from the data.

In order to estimate the number of clusters, one often resorts to information criteria,such as BIC (Bayesian information criterion), MML (minimum message length, pro-posed by Wallace and Boulton, 1968), and ICL (integrated classification likelihood). Inthis work, we adopt the approach developed by Figueiredo andJain (2002) for cluster-ing continuous data. They use an MML criterion to select the number of clusters and avariant of the EM algorithm to estimate the model parameters. This EM variant seam-lessly integrates model estimation and selection in a single algorithm. For clusteringcategorical data, we assume a finite mixture of multinomial distributions and implementa new EM algorithm, following a previous version (Silvestreet al., 2008).

Results obtained with synthetic datasets are encouraging.The main advantage of theproposed approach, when compared to the above referred criteria, is the speed of exe-cution, which is especially relevant when dealing with large data sets.

ReferencesFIGUEIREDO, M. and JAIN, A. (2002): Unsupervised Learning of Finite MixtureModels.IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 381-396.SILVESTRE, C., FIGUEIREDO, M., and CARDOSO, M. (2008): Clustering withFinite Mixture Models and Categorical Variables. In: P. Brito, Physica-Verlag:Pro-ceedings in Computational Statistics 2008. Porto, Portugal, 213.WALLACE, C. and BOULTON, D. (1968): An information measure for classifica-tion. The Computer Journal, 11:195-209.

KeywordsCLUSTER ANALYSIS, MODEL SELECTION, CATEGORICAL VARIABLES

ESCS, Portugal. [email protected] · BRU-UNIDE, ISCTE-IUL, Portugal. [email protected] · IT, IST, Portugal.mario.a.t.figueiredo@gmail

158

Identifying Mixtures of Mixtures Using Bayesian Estimation

Gertraud Malsiner-Walli1, Sylvia Frühwirth-Schnatter2, and Bettina Grün1

Abstract

In a mixture of mixtures model the cluster distributions areapproximated by a mixturedistribution. However, identifying the components forming one cluster is in general notstraight-forward. To identify the cluster distributions,previous approaches combinedmixture components to form clusters after first having selected the total number of com-ponents of a suitably fitting model. In our approach the number of clusters and theircorresponding cluster distributions are directly estimated during MCMC sampling byimposing suitable priors. In particular, we use informative hierarchical priors for themixture parameters to encourage the components assigned tothe same cluster to haveoverlapping distributions and to approximate a connected and dense cluster distribution.Using a mixture of mixtures of Gaussian distributions we apply a Bayesian estimationscheme based on MCMC methods and Gibbs sampling to automatically fit a suitablemixture model to each cluster and determine the mixture model on the cluster level. Weevaluate our proposed approach in a simulation setup with artificial data and by applyingit to benchmark data sets.

ReferencesBAUDRY, J.-P., RAFTERY, A., CELEUX, G., LO, K. and GOTTARDO,R. (2010):Combining Mixture Components for Clustering.Journal of Computational andGraphical Statistics, 19(2), 332–353.FRÜWIRTH-SCHNATTER, S. (2011): Label Switching Under Model Uncertainty.In: K.L. Mengerson, C.R. Robert and D.M. Titterington (Eds.): Mixtures: Estimationand Application. Wiley, 213–239.HENNIG, C. (2010): Methods for Mixing Gaussian Mixture Components,Advancesin Data Analysis and Classification, 4(1), 3–34.

KeywordsBAYESIAN FINITE MIXTURE MODEL, MULTIVARIATE NORMAL DISTRI BU-TION, HIERARCHICAL PRIOR, NUMBER OF COMPONENTS

Johannes Kepler University Linz, Department of Applied Statistics, Austria,[email protected], [email protected] · WUWirtschaftsuniversität Wien, Institute for Statistics and Mathematics, Wien, Austria,[email protected]

159

Logratio Methodology Applied To Model-Based Clustering

M. Comas-Cufí1, G. Mateu-Figueras1 and J.A. Martín-Fernández1

Abstract

According to Martín et al. (1998) and Palarea-Albaladejo etal. (2012), logratio method-ology is appropriate when data to be clustered are vector of proportions, i.e. composi-tional data (CoDa). Model-based clustering with CoDa are common in many fields aschemotaxonomy, archaeometry, forensic sciences or geochemistry, among others (e.g.Varmuza and Filzmoser, 2009). This work focuses on finite gaussian mixture modelsdefined in the simplex, the sample space of CoDa, i.e., when each cluster is assumed tobe represented by one or several multivariate logratio normal distributions. In addition,we show that any model-based cluster analysis applied to anytype of data, not nec-essarily CoDa, are enriched when the vector of mixing proportions and the vectors ofindividual’s conditional or posterior probabilities (group memberships) are consideredelements of the simplex.

ReferencesMARTÍN, J.A., BARCELÓ, C. and PAWLOWSKY, V. (1998): Critical Approach toNon-Parametric Classification of Compositional Data. In: A. Rizzi, M. Vichi, andH.H. Bock (Eds.):Advances in Data Science and Classification. Springer-Verlag,Berlin, 49–56.PALAREA-ALBALADEJO, J., MARTÍN-FERNÁNDEZ, J.A. and SOTO,J.A.(2012): Dealing with Distances and Transformations for Fuzzy C-Means Clusteringof Compositional Data.Journal of Classification 29(2), 144–169.VARMUZA, K., FILZMOSER, P. (2009):Introduction to Multivariate StatisticalAnalysis in Chemometrics. CRC Press, Boca Raton,FL,USA, 321 pp.

AcknowledgmentsProjects MTM2012-33236 (MSI) and 2009SGR424 (AGAUR).

KeywordsCOMPOSITIONAL DATA, ISOMETRIC LOGRATIO TRANSFORMATION

Departament of Computer Science, Applied Mathematics and Statistics, Univ. ofGirona, Campus de Montilivi, 17071 Girona, [email protected]

160

Model-based Clustering Of Multivariate Longitudinal Data

Laura Anderlucci, Angela Montanari, and Cinzia Viroli

Abstract

Multivariate longitudinal data arise when different individual characteristics are investi-gated over time. When modeling this kind of data, correlation between measurements oneach individual should be taken into account. In this work, we considered the problem ofclustering longitudinal data on multiple response variables. The issue can be addressedby means of matrix-normal distributions (Viroli, 2011). Anexplicit assumption of thisapproach is that the total variability can be decomposed into a within multiple attributes’and a ‘between times’ component. This gives body to a separability condition of the to-tal covariance matrix into two covariance matrices, one referred to the attributes and theother one to the times. According to McNicholas and Murphy (2010) we parameterizethe class conditional ‘between’ matrices through the modified Cholesky decomposition(Newton, 1988). This mixture model can be fitted using an expectation-maximization(EM) algorithm and the model selection can be performed by the BIC and the AIC infor-mation criteria. Effectiveness of the proposed approach has been tested through a largesimulation study and application to a sample of data from theHealth and RetirementStudy (HRS) survey.

ReferencesMcNICHOLAS, P. and MURPHY, B. (2010): Model-based clustering of longitudinaldata,The Canadian Journal of Statistics, 38, 153-168.NEWTON, H.J. (1988):TIMESLAB: A Time Series Analyis Laboratory, PacificGrove, CA: Wadsworth & Brooks/Cole.VIROLI, C. (2011): Finite mixtures of matrix normal distributions for classifyingthree-way data,Statistics and Computing, 21, 511-522.

KeywordsMULTIVARIATE LONGITUDINAL DATA, MIXTURE MODELS, THREE-WA Y DATA

Department of Statistical Sciences ‘P.Fortunati’ - University of [email protected],[email protected],[email protected]

161

A Bayesian Multilevel Modeling of Longitudinal data:Application to Hygroscopic Expansion in Composite Resins

Nasim Vahabi1, Mahmood Reza Gohari2, and Ali Azarbar3

Abstract

Hierarchically structured and correlated data, particularly longitudinal data, are widelyused in many areas of scientific research. Multilevel models(ML), also known as hi-erarchical linear mixed models or random coefficient modelsare utilized for analyzingclustered data which contain subject-specific random effects defined over the clusters,as well as covariate effects at any level. For the past 10 years at least, application ofBayesian multilevel models (BML), which is the focus of thispaper, have considered inmany studies, which have benefits in ensuring that all sources of uncertainty is reflectedin the posterior inferences. The application of BML is illustrated with, laboratory den-tal data from Endodontics Research Center of Shahid Beheshti University of MedicalSciences.

ReferencesBrowne, W.J. and Draper D. (2006): A comparison of Bayesian and likelihoodbasedmethods for ftting multilevel models.Bayesian Analysis, 1(3), 473-514.Diggle, P.J., Liang, K.Y. and Zeger, S.L. (2000):Analysis of Longitudinal Data. Ox-ford University Press, London.Goldstein, H. (2003):Multilevel statistical models. Thousand Oaks, CA.Verbeke, G. and Molenberghs, G. (2000):Linear Mixed Models for LongitudinalData. Springer, New York.

KeywordsBAYESIAN MULTILEVEL MODEL, LONGITUDINAL DATA, HYGROSCOPIC EX-PANSION

Tehran University of Medical Sciences, [email protected] · Tehran Uni-versity of Medical Sciences, [email protected] · Alborz University, [email protected]

162

A New Approach To Analyse Longitudinal EpidemiologicalData With An Excess Of Zeros

A.S. Spriensma123, T.R.S. Hajos34, M.R. de Boer25, M.W. Heijmans123, and J.W.R.Twisk123

Abstract

Within longitudinal epidemiological research, ‘count’ outcome variables with an ex-cess of zeros frequently occur. Although these outcomes arefrequently analysed witha linear mixed model, or a Poisson mixed model, a two-part mixed model would bebetter in analysing outcome variables with an excess of zeros. Therefore, objective ofthis research was to introduce the relatively ‘new’ method of two-part joint regressionmodelling in longitudinal data analysis for outcome variables with an excess of zeros,and to compare the performance of this method to current approaches.

Within an observational longitudinal dataset, we comparedthree techniques; two‘standard’ approaches (a linear mixed model, and a Poisson mixed model), and a two-part joint mixed model (a binomial/Poisson mixed distribution model), including ran-dom intercepts and random slopes. Model fit indicators, and differences between pre-dicted and observed values were used for comparisons. The analyses were performedwith STATA using the GLLAMM procedure.

Regarding the random intercept models, the two-part joint mixed model (bino-mial/Poisson) performed best. Adding random slopes for time to the models changedthe sign of the regression coefficient for both the Poisson mixed model and the two-partjoint mixed model (binomial/Poisson) and resulted into a much better fit.

This research showed that a two-part joint mixed model is a more appropriate methodto analyse longitudinal data with an excess of zeros compared to a linear mixed modeland a Poisson mixed model. However, in a model with random slopes for time a Poissonmixed model also performed remarkably well.

KeywordsTWO-PART JOINT MODEL, EXCESS OF ZEROS, COUNT, MIXED MODELLING,LONGITUDINAL, STATISTICAL METHODS

Department of Epidemiology and Biostatistics, VU University Medical Center, Amster-dam, The [email protected] · Department of Methodology andApplied Biostatistics, Faculty of Earth and Life Sciences,Institute for Health Sciences,VU University, Amsterdam, The Netherlands· EMGO+ Institute for Health and CareResearch, Amsterdam, The Netherlands· Department of Medical Psychology, VU Uni-versity Medical Centre, Amsterdam, The Netherlands· Department of Health Sciences,University Medical Centre Groningen, University of Groningen, The Netherlands

163

A Linear Mixed Model with a Mixture of Smooth RandomEffects Distributions

Berrie Zielman

Abstract

Longitudinal data, where data are recorded on a series of time points are often collectedin medicine, microeconomics, biology, pharmacokinetics and other fields. The linearmixed effects model is a popular model for the analysis of longitudinal data. Thesemodels incorporate fixed effects and random effects. The random effects are drawn froma distribution, which is usually the normal distribution. The assumption of a normaldistribution is not always realistic, and sometimes replaced by a mixture distribution(Verbeke and Lesaffre, 1996).

A smooth random effects distribution in a linear mixed modelis proposed that issimilar to the one in Ghidey, Lesaffre and Eilers (2004). Ourapproach differs fromtheirs in that penalized estimation is not used and that the the parameters of the gridin the random effects distribution are estimated from the data and not specified by theuser. The random effects distribution in the model is build up from a mixture of normaldistributions with equally spaced means between them. The model contains the linearmixed model as a special case when the estimated distances between the means arezero. By imposing constraints on the mixture probabilitiesand the means of the normaldistributions we obtain a mixture of smooth distributions.

ReferencesGHIDEY, W., LESAFFRE, E. and EILERS, P. (2004): Smooth Random Effects Dis-tribution in a Linear Mixed Model.Biometrics, 60, 945–953.VERBEKE, G. and LESAFFRE, E. (1996): A Linear Mixed Model with Hetero-geneity in the Random-effects Population.Journal of the American Statistical Asso-ciation, 91, 217–221.

KeywordsLONGITUDINAL DATA, SKEWED DISTRIBUTION, LINEAR MIXED MODEL,MIXTURE OF SMOOTH RANDOM EFFECTS DISTRIBUTIONS

Netherlands Court of Audit, Lange Voorhout 8, Den Haag, [email protected]

164

Longitudinal IRT Modelling compared with MultilevelAnalysis in estimating Development Over Time In Data FromThree Likert-Item Questionnaires

R. Gorter13, M.R. de Boer234, M.W. Heijmans123, and J.W.R. Twisk123

Abstract

The objective was to compare the outcomes of Multilevel (ML)modelling with Mul-tilevel Item Response Theory (ML IRT) modelling when estimating development overtime in ordinal questionnaire data when applied to a longitudinal cohort study. Datafrom the Longitudinal Aging Study Amsterdam (LASA) were obtained, an observa-tional cohort study among the elderly (n=2987). The two models were fit to the data andare compared in the performance of analysing development over time by means of pa-rameter estimates and observed-predicted plots. We found that the ML IRT model givesa more accurate prediction of the data when compared to the MLmodel in all threequestionnaires. Subsequently, we found differences in theestimated time effects. TheML IRT and the ML model give different results in terms of predicted values and timeeffect estimates, when applied to the LASA questionnaire data. The difference betweenboth models is most evident in the HADS questionnaire which is heavily skewed to theright. The differences in results between the models may lead to incorrect conclusionswith respect to the development over time when using the ML model.

KeywordsML IRT, LONGITUDINAL, ORDINAL DATA

Department of Epidemiology and Biostatistics, VU University Medical Centre, Amster-dam, The [email protected] · Institute for Health Sciences, Facultyof Earth and Life Sciences, VU University, Amsterdam, The Netherlands · EMGO+Institute for Health and Care Research, Amsterdam, The Netherlands · Department ofHealth Sciences, Univerity Medical Centre Groningen, University of Groningen, TheNetherlands.

165

Mutual Information, Chi-Squared And Model-BasedClustering For Co-Clustering Of Contingency Tables

Mohamed Nadif1 and Gérard Govaert2

Abstract

Given a data matrix defined on two sets I and J, co-clustering considers simultaneouslythe two sets and organizes the data into homogeneous blocks.Different approachesand algorithms were proposed. For co-occurrence data matrices, Dhillon et al. (2003)proposed an information-theoretic co-clustering algorithm that presents a non-negativematrix as an empirical joint probability distribution of two discrete random variables.They set co-clustering problem under an optimization problem in information theoryand developed a popular algorithm, termed ITCC. This latterconsists in maximizing amutual information associated to a couple of partitions.

In this work, we embed the co-clustering problem in model-based clustering. Twoapproaches are considered: the first one, calledblock model, assumes that the partitionsare unknown parameters and the second one, calledlatent block model, assumes thatthe partitions are considered as latent variables (Govaertand Nadif, 2008, 2010). Wedevelop the two approaches, propose models and algorithms,and establish the connec-tions with ITCC and other algorithms.

ReferencesDHILLON, I.S., MALLELA, S. and MODHA, D.S. (2003): Information-theoreticco-clustering. In:Proceedings of the ninth ACM SIGKDD, 89–98.GOVAERT, G. and NADIF, M. (2008): Block clustering with Bernoulli mixture mod-els: Comparison of different approaches.Computational Statistics & Data Analysis,52, 6, 3233–3245,GOVAERT, G. and NADIF, M. (2010): Latent Block Model for Contingency Table.Communications in Statistics–Theory and Methods39, 3, 416–425.

KeywordsCO-CLUSTERING, BLOCK MODEL, LATENT BLOCK MODEL

LIPADE, University of Paris Descartes, 75006 Paris, France,[email protected] · HEUDIASYC, CNRS 7253,University of Technology of Compiègne, 60205 Compiègne, France,[email protected]

166

Parsimonious Estimation And Testing Of Two-WayInteraction By Means Of Two-Mode Clustering

Jan Schepers

Abstract

We consider the problem of estimating and testing two-way interaction in between-subjects factorial designs involving two factors and a continuous response variable. Ex-cept for 2×2 designs, the classical (ANOVA) omnibus F-test for two-wayinteractionmay imply too many parameters being estimated. A new method is therefore proposedin which two-mode clustering is applied to capture only the most salient interactions. AnF-like test statistic calculated from the parameter estimates of this two-mode clustering,and a resampling approach to estimate its null distribution, are discussed. Simulationssuggest that this new method has an empirical Type I error rate close to the nominallevel, and an empirical Type II error rate lower than that of the classical omnibus F-testif the numbers of clusters for the two factors are large enough. After interaction hasbeen detected, the method can also be used to interpret it because the two-mode cluster-ing indicates which tetrad differences (i.e.,µi j −µi′ j −µi j ′ +µi′ j ′) differ from zero, andwhich do not.

ReferencesGOLLOB, H.F. (1968): A Statistical Model Which Combines Features of FactorAnalytic and Analysis of Variance Techniques.Psychometrika, 33, 73–115.CALINSKI, T. and CORSTEN, L.C.A (1985): Clustering Means inANOVA by Si-multaneous Testing.Biometrics, 41, 39–48.VAN ROSMALEN, J., GROENEN, P.J.F., TREJOS, J. and CASTILLO,W. (2009):Optimization Strategies for Two-Mode Partitioning.Journal of Classification, 26,155–181.

KeywordsTWO-WAY INTERACTION, ANOVA, PARSIMONIOUS ESTIMATION, TWO-MODECLUSTERING, TYPE I AND II ERROR RATE

Faculty of Psychology and Neuroscience, Maastricht [email protected]

167

A general Model for Two-mode Clustering

Maurizio Vichi1

Abstract

For big data, represented by matrices with a huge number of rows and columns, fre-quently the main analysis is a two-mode clustering (co-clustering), trying to mine andsynthesize the relevant information by reducing the size ofthe data to a matrix of com-pact dimensions formed by prototype objects and variables.This is achieved by simul-taneous grouping rows and columns so that results are informative and easy to interpret,denoting compressed, but relevant representation of the big data, while trying to pre-serve most of the original information. The reduction is generally soft to obtain a lightcompression of the multivariate data in order to allow the successive application of othermultivariate statistical methods that are computationally prohibitive for large data sets.

A general two-mode clustering technique is proposed. A coordinate descent algo-rithm is developed. The applications on both, synthetic andreal datasets, validate theperformance and applicability of the new algorithm.

KeywordsTWO-MODE CLUSTERING, DOUBLE K-MEANS, DISJOINT PRINCIPAL COM-PONENT ANALYSIS, ROBUSTNESS

Department of Statistics, Sapienza University of [email protected]

168

Comprehensive Calculations of the Sensitivity and Specificityof Diagnosis Using Bile Cytological Data

Tatsunami S.1, Hayakawa C.2, Koike J.2, Hoshikawa, M.2, and Ueno T.1

Abstract

Some of pathological data for clinical diagnosis are composed of dichotomous variablesonly. Values of 1/0 represent positivity/negativity of specific characteristics for the bi-ological item of interest. Because the number of items in therow of such data is notvery large, the variety of the possible combinations of candidate items that should beused for diagnosis is not very large. We tried to compute the sensitivity and specificityof diagnosis for all possible combination patterns.

We used bile cytological data that are used for the diagnosisof cholangiocarcinoma.The number of items in a data row was 21, from item 1 to item 21, for each patient.The sensitivity and specificity of tentative diagnostic criteria were computed in all ofthe possible patterns of positive item combinations used for the diagnosis. Results werecompared to those from multivariate analyses.

When we used positivity of item 16 and item 18 as the diagnostic criteria of cholan-giocarcinoma, the sensitivity and specificity were 0.78 and0.97. Worse results appearedfrom combinations of any other two items’ positivity and combinations of three or moreitems. The sensitivity and specificity from the logistic regression method were 0.96 and0.97. Both the logistic regression method and Hayashi’s Q2 method showed clear-cutdiscriminant ability by using more than six items.

Although the final diagnosis for a patient is obtained by alsoreferring to other data,improvement of the accuracy of diagnosis using cytologicaldata alone is expected inthe clinical field. The present results showed the potentialpossibility of cytological di-agnosis. However, at the same time, it was clearly suggestedthat diagnosis by simplecombinations of two or three items’ positivity will not provide reliable criteria of diag-nosis in the case of bile data.

KeywordsDICHOTOMOUS DATA, DIAGNOSIS, CYTOLOGY, SENSITIVITY

Unit of Medical Statistics, Faculty of Medical Education and Culture,St. Marianna University School of Medicine, Kawasaki, Japan [email protected] · Department of Pathology, Kawasaki Munic-ipal Tama Hospital, St. Marianna University School of Medicine, Kawasaki, Japan214-8525

169

Diagnostics for the Risk Prediction of Each Type of EndoleakFormation after TEVAR Using Statistical DiscriminantAnalysis

Kuniyoshi Hayashi1,5, Fumio Ishioka2,5, Bhargav Raman3, Daniel Y. Sze3, HiroshiSuito1,5, Takuya Ueda4,5, and Koji Kurihara1,5

Abstract

A quantitative assessment of results obtained by statistical analysis has the potentialto generate findings that will enable better therapy planning by doctors. However, wefeel that the usefulness and impact of its application in real data analyses in the field ofmedicine has not been widely and clearly shown. In this study, we particularly selectthoracic endovascular aortic repair (TEVAR), a minimally invasive technique involvingstent-graft placement. Based on Nakatamari et al. (2011), we use linear discriminantanalysis to evaluate the risk of formation of each type of endoleak, which is a clinicalside effect of TEVAR. Next, we utilize sensitivity analysiswith influence functionsto identify influential patients for risk prediction. Finally, we investigate the findingsobtained on the basis of an analysis of their characteristics.

ReferencesNAKATAMARI, H., UEDA, T., ISHIOKA, F., RAMAN, B., KURIHARA, K., RU-BIN, G.D., ITO, H., SZE, D.Y. (2011): Discriminant analysisof native thoracic aorticcurvature: risk prediction for endoleak formation after thoracic endovascular aorticrepair.Journal of Vascular and Interventional Radiology, 22, 974–979.

KeywordsINFLUENCE FUNCTIONS, LINEAR DISCRIMINANT ANALYSIS, QUANTITA-TIVE ANALYSIS OF AORTIC MORPHOLOGY

Graduate School of Environmental and Life Science, [email protected], [email protected],[email protected] · School of Law, Okayama [email protected] · Department of Radiology, Stanford Univer-sity School of [email protected], [email protected] ·Department of Radiology, St. Luke’s International Hospital [email protected] ·CREST, Japan Science and Technology Agency

170

Extension Of A Multilingual Medical Lexicon By CombinedFeature Extraction Methods

Wiebke Petersen1, Denis Anuschewski1, Pascal Chave1, and Philipp F. Zeitz2

Abstract

The 2011 digital, multilingual dictionary of ophthalmology by the practicing ophthal-mologist Zeitz with its more than 24.000 medical terms in 13 languages, arranged bysynonymy, was developed to support the practicing physician

http://zeitzfrankozeitz.de/index.php/Dictionary_of_Ophthalmology.html

who, in a time of increased mobility of people, often has to translate international medi-cal reports. These translations can involve severe translation errors (cf. Zeitz & Petersen2013). Alas, not all languages are covered to the same extent, c.f. German: 6584 terms,Russian: 252 terms. In our talk we introduce an approach to semi-automatically enrichthe dictionary’s structure and content. We tag medical terms with attributes correspond-ing to shared features and calculate the corresponding concept lattice (cf. Ganter & Wille1999). Following Janssen’s ideas we use concept lattices asan Interlingua in our dic-tionary in order to facilitate browsing through its contentand to fill the lexical gaps byparaphrases based on the attribute tags (cf. Janssen 2004).We aim at tagging the termssemi-automatically by extracting attributes from: (a) thelinguistic contexts in whichthey occur, (b) the morphemes of which they are composed, and(c) their position ingiven hierarchical categorizations. Terms that are similar with respect to these aspectsshare a common attribute, and the automatically extracted attributes can be translatedand controlled by human experts in a second step. To this end,several sources for au-tomatic extraction of attributes were explored: classifications such as the InternationalClassification of Diseases ICD-10 and English language Wikipedia articles on ophthal-mology.

ReferencesGANTER, B. and WILLE, R. (1999). Formal concept analysis: mathematical foun-dations. Berlin: Springer.JANSSEN, M. (2004): Multilingual Lexical Databases, Lexical Gaps, and SIMuL-LLDA. International Journal of Lexicography, 17, 136-154ZEITZ, P. F. and W. PETERSEN: Übersetzungsfehler in der Augenheilkunde.Klin-ische Monatsblätter für Augenheilkunde, 230(3), 275-277.

KeywordsMULTILINGUAL LEXICON, FORMAL CONCEPT ANALYSIS, FEATURE EX-TRACTION

Institute of Linguistic and Information, University of Düsseldorf· Praxis Zeitz FrankoZeitz, Praxis für Augenheilkunde, Düsseldorf

171

The Joy of Fuzzy

Michael Greenacre1

Abstract

Canonical correspondence analysis and redundancy analysis are two methods of con-strained ordination regularly used in the analysis of ecological data when several re-sponse variables (for example, species abundances) are related linearly to several ex-planatory variables (for example, environmental variables, spatial positions of samples).In this talk I demonstrate the advantages of the fuzzy codingof explanatory variables:first, nonlinear relationships can be diagnosed; second, more variance in the responsescan be explained; and third, in the presence of categorical explanatory variables (forexample, years, regions) the interpretation of the resulting triplot ordination is unifiedbecause all explanatory variables are measured at a categorical level.

Background material and references for the topic of this talk can be found in Asanand Greenacre (2010) and Greenacre (2013).

ReferencesASAN, Z. and GREENACRE, M. (2013): Biplots of fuzzy coded data. Fuzzy Setsand Systems 183, 57–71.GREENACRE, M. (2013): Fuzzy coding in constrained ordinations.Ecology 94(2),280–286.

KeywordsFUZZY CODING, CANONICAL CORRESPONDENCE ANALYSIS, MIXED-SCALEPREDICTORS IN CONSTRAINED ORDINATIONS

Universitat Pompeu [email protected]

172

Fast Iterative Implementation of Correspondence Analysis

Alfonso Iodice D’Enza1, Patrick J. Groenen2 and Michel van de Velden2

Abstract

The eigenvalue decomposition (EVD) and the related singular value decomposition(SVD) are the core step of several dimension reduction methods, such as principal com-ponents analysis (PCA; Jolliffe, 2002) and multiple correspondence analysis (MCA;Greenacre, 2007), that apply to quantitative and qualitative data, respectively. Most ofthe modern applications have in common the large amount of data to be analyzed. Theapplication of standard implementations of both EVD and SVDbecomes unfeasible dueto their high computational cost. To this end, several algorithms have been proposed inthe literature that aim to increase the computational speedand efficiency of EVD andSVD. The majority of the proposed procedures focus on the quantitative variables case.Dealing with binary variables, however, a further peculiarity arises and has to be takeninto account, that is the data sparsity. In case of data flows analysis or of interactive datavisualization, repeated analyses are needed to keep the solution up-to-date when newdata comes in, or in case of user interactions, respectively. A further case is the assess-ment of significance of MCA solution requires repeated analyses of bootstrap replicatesof data. In a common setting, these methods are unfeasible for large data sets.In the present paper an efficient implementation of the MCA isproposed that addressesthe sparsity of data and the need of computational speed in the case of repeated analy-ses, exploiting both enhanced sparse matrix computations and fast iterative methods formatrix decompositions.

ReferencesGREENACRE, M.J. (2007):Correspondence Analysis in Practice, 2nd edition.Chapman & Hall/CRC.JOLLIFFE, I.T. (2002):Principal Component Analysis, 2nd edition. Springer.

KeywordsCORRESPONDENCE ANALYSIS, EIGENVALUE DECOMPOSITION, POWERMETHOD

Università di Cassino e del Lazio Meridionale, Cassino, [email protected] · Erasmus University of Rotterdam, Rotterdam, [email protected], [email protected]

173

Inverse Multiple Correspondence Analysis

Michel van de Velden1, Patrick Groenen2, and Wilco van den Heuvel3

Abstract

The inverse correspondence analysis (CA) problem can be described as follows. Givena k-dimensional CA solution, find the set of (nonnegative) data matrices that yields thisk-dimensional CA solution. Groenen and van de Velden (2004)showed that the set ofpermissable data matrices is characterized by a set of vertices corresponding to a setof inequalities. Furthermore, it was shown that any convex combination of the obtainedvertices produces a candidate data matrix that, when applying CA to it, contains in theirsolution the original k-dimensional CA solution. An algorithm was proposed to obtainall vertices as well as a heuristic to quickly obtain a (sub)set of vertices. A popularextension of CA concerns multiple correspondence analysis(MCA). In MCA, the datamatrix is a so-called indicator or super-indicator matrix;a concatenation of dummy vari-ables where for each individual the observed category is indicated by a one in a columncorresponding to that category, and zeros in the other columns. The rows of such a ma-trix correspond to individuals, and the columns to categories. Such an indicator matrixis typically much larger than a CA contingency matrix. Consequently, the exact inverseCA approach of Groenen and van de Velden (2004) cannot be usedto find the set of ver-tices rendering the MCA solution. The heuristic can perhapsbe applied, however, it isnot clear whether the thus obtained set of vertices is complete. Using specific propertiesof MCA, we explore new ways for obtaining a meaningful set of vertices in the contextof MCA.

ReferencesGROENEN, P.J.F. and VAN DE VELDEN, M. (2004): Inverse CorrespondenceAnalysisLinear Algebra and its Applications, 388, 221–238.

KeywordsCORRESPONDENCE ANALYSIS, MULTIPLE CORRESPONDENCE ANALYSIS,INVERSE PROBLEMS

Erasmus University [email protected] · Erasmus Uni-versity Rotterdam [email protected] · Erasmus University [email protected]

174

Tracking Association Structures in Categorical Data Flows

Alfonso Iodice D’Enza1 and Angelos Markos2

Abstract

In modern applications, such as in signal processing and social network analysis, dataare produced at a high rate and the association structures change over time. MultipleCorrespondence Analysis (MCA) is a well-established dimension reduction methodaiming to explore the underlying structure of categorical data sets (Greenacre, 2007).A critical step of the MCA algorithm is the singular value decomposition (SVD) oreigenvalue decomposition (EVD) of a suitably transformed matrix. The high compu-tational and memory requirements of ordinary SVD and EVD makes their applicationimpractical on massive or sequential data sets. Several enhanced SVD/EVD approacheshave been recently introduced in an effort to overcome theseissues. The aim of thepresent contribution is to extend MCA to allow for incremental updates (downdates)of existing MCA solutions, which lead to an approximate yet highly accurate solution.For this purpose, two incremental EVD and SVD (Hallet al., 2002; Rosset al., 2008)approaches with desirable properties are revised and embedded in the context of MCA.The proposed method is evaluated in terms of discrepancy from a classic MCA solutionand applied to a real dataset.

ReferencesGREENACRE, M.J. (2007):Correspondence Analysis in Practice, 2nd edition,Chapman & Hall/CRC.HALL, P., MARSHALL, D. and MARTIN, R. (2002): Adding and subtractingeigenspaces with eigenvalue decomposition and singular value decomposition.Im-age and Vision Computing, 20, 1009–1016.ROSS, D., LIM, J., LIN, R.S. and YANG, M.H. (2008): Incremental Learning forRobust Visual Tracking,International Journal of Computer Vision, 77, 125–141.

KeywordsMULTIPLE CORRESPONDENCE ANALYSIS, SINGULAR VALUE DECOMPOSI-TION, INCREMENTAL METHODS

Università di Cassino e del Lazio Meridionale, [email protected] · Dem-ocritus University of Thrace, [email protected]

175

Determining the Number of Clusters: a Problem of Definitionor Estimation?

Giovanna Menardi1

Abstract

The problem of determining the optimal number of clusters ina set of data has beenaddressed according to several perspectives, ranging fromthe naive “elbow”-rule ofthumb to its formalized version GAP statistic, or to more refined criteria as those basedon evaluating the stability of a partition. Nonetheless, the question is far from havingan undisguised answer. Wondering what configuration is optimal becomes a wild-goosechase if a true (albeit unknown) population structure, representing the ideal partitionthat clustering methods should try to approximate, is not specified. Therefore, beforeaddressing the problem of determining the right number of clusters, we need to defineproperly what a cluster is.

A precise statistical notion, unshared by most of clustering methods, is provided bythe density-based approach, assuming that clusters are associated to some specific char-acteristic of the probability distribution underlying thedata. Parametric methods usuallyassociate clusters to homogeneous distributions which arecombined in a mixture model,while methods following a nonparametric approach draw a correspondence between thegroups and the modes of the density underlying the data. An appealing implication ofthe density-based formulation is that the number of clusters is conceptually well de-fined. Moreover, it follows that the ill-specified tasks of cluster detection and evaluationof cluster quality can be regarded to as more circumscribed problems of estimation andgoodness of fit.

However, the density based framework is far from being an easy answer to the cluster-ing problem. In this work the approach is critically reviewed from both a conceptual andan operational point of view and focusing, in particular, onthe nonparametric perspec-tive. Some connections with alternative formulations of the problem are enlightened asweel as the main challenges and directions of further research.

KeywordsCLUSTER, DENSITY ESTIMATION, MIXTURE MODELS, MODE SEEKING

Department of Statistical Sciences, University of Padua, via C. Battisti 241, [email protected]

176

Enhancing The Selection Of A Number Of Clusters InModel-Based Clustering With External Qualitative Variabl es

AJ.-P. Baudry, M. Cardoso, G. Celeux, M.J. Amorim, and A.S. Ferreira

Abstract

Usual criteria to select a number of clusters in model-basedclustering, such as BIC forexample, can sometimes lead to an unclear, uncertain, choice. We propose a criterionthat takes into account a classification of the data which is knowna priori and may shednew light on the data, help to drive the selection of a number of clusters and make itclearer, without involving it in the design of the clustering itself. The variables used tobuild the clustering and the (qualitative) variables used as ana priori clustering haveto be chosen carefully and with respect to the modeling purpose. This is an illustrationof how the modeling purpose is directly involved in what a “goo” number of clustersshould be and to what extend the last should be thought of as dependent on the context.

177

Choosing the Number of Clusters after, before, and whileClustering

B. Mirkin

Abstract

Methods for determining the number of clusters in data can becategorized in three fol-lowing types: (1) post-processing; (2) pre-processing; and (3) (2) at-processing methods.

The “post-clustering” type methods are most popular: a number of partitions are gen-erated, after which a procedure is run to determine those most suitable; the latter wouldusually be based on either: (i) a “tightness” criterion, or (ii) a “stability” criterion, or,more recently, (iii) consensus approach. In a series of experiments with synthetic data ofGaussian clusters with varying spread, intermix and “elongation”, Chiang and Mirkin(2010) came up with a winner among seven or eight methods - the“rule of thumb” byHartigan (1975) involving a relative change in the value of the square error clusteringcriterion. Unfortunately, the rule has been much less successful in cluster recovery.

The “pre-processing” type methods explore the structure ofdata by putting hypo-thetical centers of clusters relatively far from each otherin the data set, after which aclustering procedure applies. Of a number heuristics, mostpromising, in the author’sview, is the “anomalous pattern” method by Mirkin (1987, 2005) which positions thehypothetical center points relative to a “reference” pointin such a way that they arecentral to some dense parts of the data. This method involvesa granularity parameter, athreshold on the cluster cardinality to decide should the center be discarded at all. Whenthe threshold is 1 (discarding singletons only), this leadsto overestimating the numberof clusters at synthetic data (Chiang, Mirkin, 2010), yet atrealworld data of moderatesizes it works well.

For the “at-processing” approach, the divisive clusteringis probably the only optionat which further clustering can be stopped at any step (of sequential divisions). Kovalevaand Mirkin (2013) show that the rule by Tasoulis, Tasoulis and Plagianakos (2010) issuperior to many other popular options. The rule involves projection of all the data in acluster onto the first principal component, building a Parzen-type density function andtesting whether it has any minima at all. If not, the cluster is not split anymore. The ruleis by far superior over the popular statistics test of a single Gaussian against a mix ofGaussians. It survives introduction of noise in data, except insertion of random objects.In the latter case, the rule should be applied to random projections of clusters as well(Kovaleva, Mirkin 2013).

KeywordsCLUSTERING, NUMBER OF CLUSTERS, HARTIGAN’S RULE

NRU Higher School of Economics, Moscow, RF and Birkbeck University of London,UK [email protected]

178

Competitions in Machine Learning: the Fun, the Art, and theScience

Isabelle Guyon1

Abstract

Challenges have recently proved a great stimulus for research in machine learning, pat-tern recognition, and robotics. Robotics contests seem to be particularly popular, themost visible ones probably being the DARPA grand challengesof autonomous groundvehicle navigation and RoboCup featuring several challenges for robots including play-ing soccer or rescuing people. The European network of excellence PASCAL has ac-tively sponsored a number of challenges around hot themes inmachine learning, held inconjunction with workshops at major international conference, including KDD, ICML,and NIPS. These contests are oriented towards scientific research and the main rewardfor the winners is to disseminate the product of their research and obtain recognition.In that respect, they play a different role than challenges like the Netflix prize, whichoffer large monetary rewards for solving a task of value to the Industry (movie referralin than particular case), but are narrower scope. Attracting hundreds of participants andthe attention of a broad audience of specialists as well as sometimes the general public,these events have been important in several respects: (1) pushing the state-of-the art,(2) identifying techniques which really work, (3) attracting new researchers, (4) raisingthe standards of research, (5) giving the opportunity to non-established researchers tomake themselves rapidly known. Since 2003, we have been organizing challenges inmachine learning. We addressed problems of both fundamental and practical interest inmachine learning, data mining or statistics, illustrated with data from various domains.For instance,

• in 2003 we organized a challenge on feature selection and in 2009 on sample selec-tion (active learning),

• in 2006, 2007 and 2010, we organized a series of challenges onmodel constructionand selection, including agnostic methods and methods using

• prior knowledge or knowledge transfer, between 2008 and 2013, we organized threechallenges on causality.

Our challenge platforms, which remain open for post-challenge submissions, are con-stantly in use by students and have been used in practical work in our own classes andthose of other professors throughout the world. We take great care of giving to theparticipants opportunities publish in reputable conferences proceedings or journals likeJMLR. We think of challenges as a means of carrying out research in machine learn-ing by focusing the mental energy of brilliant researchers around the world. But, whatmakes a good challenge that provides conclusive results having an important impact?This presentation will review the main findings of our past challenges and look uponthem with a critical eye to identify strength and weaknessesand new directions.

ChaLearn, Berkeley, California

179

Playing with Data–or How to Discourage Incorrect DataAnalysis

Klaas Sijtsma1

Abstract

Recent fraud cases in psychological and medical research have emphasized the needto pay attention to Questionable Research Practices (QRPs). Deliberate or not, QRPsusually have a deteriorating effect on the quality and the credibility of research results.QRPs must be revealed but prevention of QRPs is more important than detection. Isuggest two policy measures that I expect to be effective in improving the quality ofpsychological research. First, the research data and the research materials should bemade publicly available so as to allow verification. Second,researchers should morereadily consider consulting a methodologist or a statistician. These two measures aresimple but run against common practice to keep data to oneself and overestimate one’smethodological and statistical skills, thus allowing secrecy and errors to enter researchpractice.

Tilburg School of Social and Behavioral Sciences, Tilburg University

180

A Study on Small-Area Geographical Analysis of ResidentialCharacteristics after the Great Hanshin-Awaji Earthquakeby two Individual Differences Model

Mitsuhiro Tsuji, Hiroshi Kageyama1 and Toshio Shimokawa2

Abstract

We discuss several approaches to realize geographical small-area statistics by usingmultidimensional scaling (the INDSCAL model) and clustering (the INDCLUS model),which assumes that the objects (geographical areas) are embedded in a continuous ordiscrete space common to all data, including individual differences obtained by weight-ing each dimension.

We apply some effective geographical approaches using two methods to performsome structural analysis for some residential characteristics (damage, population changesand so on) after the Great Hanshin-Awaji Earthquake.

The saling and clustering of geographical space consider: 1) the characteristics of thefeature space (continuous); 2) the spatial nature of the objects to be clustered geomet-rically (discrete); 3) the latent structure between earthquake damages and residentialcharacteristics.

KeywordsSMALL-AREA STATISTICS, GREAT HANSHIN-AWAJI EARTHQUAKE, INDSCALMODEL, INDCLUS MODEL

Kansai University, Takatsuki, Osaka, [email protected] · Universityof Yamanashi, Kofu, Yamanashi, [email protected]

181

Author Identification of Japanese Classical Literature byQuantitative Analysis

Gen Tsuchiyama1 and Masakatsu Murakami2

Abstract

Singular authorship of The Tale of Genji, the most famous andgreatest accomplish-ment in Japanese classical literature of the Heian period (between 794 and 1185), iscontentious. While literary scholars have long debated theauthorship of this work, theissue has been largely ignored by Japanese statisticians. Therefore, in this study, westatistically analyze whether the author of the last ten chapters of The Tale of Ganji,collectively titled Uji Jugo, also wrote the previous chapters.

In quantitative analyses of texts composed in the Japanese language, when the fre-quency of the function word of certain documents substantially differs from that ofother documents, the difference is generally attributed tovarying author style. Thus, ouranalysis is based on word frequency.

Word frequency throughout The Tale of Genji was analyzed by principal componentanalysis and random forests. No obvious difference in word usage was observed be-tween Uji Jugo and the other chapters. Thus, we conclude thatThe Tale of Genji waslikely composed by a single author.

ReferencesBREIMAN, L. (2001): Random Forests.Machine Learning , 45, 5–32.JIN, M. and MURAKAMI, M. (2007): Authorship Identification Using RandomForests.Proceedings of the Institute of Statistical Mathematics , 55, 255–268.

KeywordsQUANTITATIVE THEORY OF VOCABULARY, PRINCIPAL COMPONENT ANAL-YSIS, RANDOM FORESTS, JAPANESE LITERATURE

Graduate School of Culture and Information Science, Doshisha University, Kyoto, [email protected] · Faculty of Culture and Information Science,Doshisha University, Kyoto, [email protected]

182

A Latent Class Approach for Estimating Labour MarketMobility in the Presence of Multiple Indicators andRetrospective Interrogation

Francesca Bassi1, Marcel Croon2, and Davide Vidotto1

Abstract

Measurement errors can induce bias in the estimation of transitions, leading to erro-neous conclusions about labour market dynamics. A large body of literature on grossflows estimation is based on the assumption that errors are uncorrelated over time. Thisassumption is not realistic in many contexts, because of survey design and data collec-tion strategies. We use a model-based approach to adjustingobserved gross flows forclassification errors, eventually correlated. A convenient framework is provided by la-tent class Markov models (Biemer and Bushery, 2000). We refer to data collected withthe Italian Continuous Labour Force Survey, which is cross-sectional, quarterly, witha 2-2-2 rotating design. The questionnaire allows to dispose of multiple indicators oflabour force condition for each quarter: two collected in the same interview and a thirdone collected after one year. Our approach provides a means to estimate labour mar-ket mobility taking into account correlated errors and the rotating design of the survey.Specifically the best fitting model is a mover-stayer latent class Markov model with co-variates affecting latent transitions and correlated errors among indicators. A secondaryresult of our research is that the mover-stayer model and thelatent class Markov esti-mate the same amount of measurement error in the data. The better fit of the mixturespecification is all due to more-accurately estimated latent transitions. This evidencecontradicts results in previous literature (see, for example, Magidson et al., 2007).

ReferencesBIEMER, P.P. and BUSHERY, J.M. (2000): On the validity of Markov latent classanalysis for estimating classification errors in labour force data.Survey Methodology,26, 139-152.MAGIDSON, J., VERMUNT, J.K. and TRAN B. (2007): Using a mixture of latentMarkov model to analyze longitudinal U.S. employment data involving measurementerror. In: K. Shigemasu, A. Okada, T. Imaizumi and T. Hoshino(Eds.):New trendsin Psychometrics. Universal Academy Press, 235-242.

KeywordsMIXTURE LATENT CLASS MODEL, GROSS FLOWS, CORRELATED ERRORS

Department of Statistical Sciences, University of [email protected] ·Methodology Department, University of Tilburg, NL

183

On Finite Mixtures of Skew Distributions

Geoff McLachlan and Sharon Lee

Abstract

Non-normal mixture distributions have received increasing attention in recent years. Fi-nite mixtures of multivariate skew symmetric distributions, in particular, the skew nor-mal and skewt-mixture models, are emerging as a promising extension to the traditionalnormal andt-mixture modelling. Most of these parametric families of skew symmetricdistributions are closely related. In this talk, we give a brief overview of various existingproposals for multivariate skew distributions. We consider a classification of them intofour forms, namely, the restricted, unrestricted, extended, and generalised forms, basedon their characterizations. We compare the relative performance of restricted and unre-stricted skew mixture models in clustering and density estimation on four real datasets.We also compare their performance with mixtures having other non-normal componentdistributions.

Geoff McLachlan· Sharon LeeUniversity of Queensland

184

Classification via Mixtures of Shifted Asymmetric Laplaceand Mixtures of Generalized Hyperbolic Distributions

Paul D. McNicholas1, Ryan P. Browne1, and Brian C. Franczak1

Abstract

The recent burgeoning of non-Gaussian approaches to model-based classification in-cludes work on the multivariatet-distribution, the skew-normal distribution, and theskew-t distribution, as well as other approaches. We add to the richness of the pal-let of non-Gaussian mixture model-based approaches to classification by introducing amixture of shifted asymmetric Laplace distributions and a mixture of generalized hyper-bolic distributions. The mathematical development of eachmixture model relies on itsrelationship with the generalized inverse Gaussian distribution. Parameter estimation isoutlined within the expectation-maximization framework before the performance of ourmixture models is illustrated on simulated and real data. Weconclude with discussionon the anticipated impact of these models and details of someongoing work.

ReferencesBARNDORFF-NIELSEN, O. (1978): Hyperbolic distributions and distributions onhyperbolae.Scandinavian Journal of Statistics, 5, 151–157.FRANCZAK, B.C., BROWNE, R.P. and McNICHOLAS, P.D. (2012): Mixtures ofshifted asymmetric Laplace distributions. Arxiv preprintarXiv:1207.1727v3.KOTZ, S., KOZUBOWSKI, T.J. and PODGORSKI, K. (2001):The Laplace Distri-bution and Generalizations: A Revisit with Applications toCommunications, Eco-nomics, Engineering, and Finance. Birkhauser, Boston.JØRGENSEN, B. (1982):Statistical Properties of the Generalized Inverse GaussianDistribution. Springer-Verlag, New York.

KeywordsASYMMETRIC LAPLACE, GENERALIZED HYPERBOLIC, GENERALIZEDIN-VERSE GAUSSIAN, MIXTURE MODELS

Department of Mathematics and Statistics, University of Guelph, Ontario, N1G 2W1,Canada.{pmcnicho,rbrowne,bfrancza}@uoguelph.ca

185

Gaussian And Distance Based Clustering InHigh-Dimensional Space: Differences And Common Aspects

Francesco Palumbo1, Cristina Tortora2, and Paul McNicholas2

Abstract

Non-hierarchical cluster analysis aims at identifying theoptimalk groups partition ina multivariate data sets. Most recent contributions in the field are focused on the prob-abilistic (Celeux and Goveart, 1995) and distance based (Ben-Israel and Iyigun; 2008)mixture model approach which ensures good performances under a wide range of hy-pothesis. The former assumes clusters are derived under thesame probability function(Gaussian, generally) with different parameters for each group, the latter is a distribu-tion free and units are assigned to the groups according to a distance function. This talkaims at presenting and discussing two extensions of the above mentioned approacheswhen the high dimensionality of the space requires the feature reduction. In particular,we focus on advantages and drawbacks of the following clustering approaches that in-tegrate clustering and dimensionality reduction: Mixtureof Factor Analyzers, Mixtureof Parsimonious Gaussian Mixture models (McNicholas and Murphy; 2008), Mixtureof High-Dimensional mixture models, Discriminative latent mixture models (Bouvey-ron and Brunet-Saumard; 2012) and Factor PD-clustering (Tortora et al. 2011). Overallperformances are compared using simulated and real data.

ReferencesBen-Israel, A. and Iyigun, C. (2008): Probabilistic d-clustering.Journal of Classifi-cation, 25(1):5–26.Bouveyron, C. and Brunet-Saumard, C. (2012): Model-based clustering of high-dimensional data: A review.Computational Statistics and Data Analysis.Celeux, G. and Goveart, G. (1995): Gaussian parsimonious clustering models.Pat-tern Recognition, 28(5):781–793.McNicholas, P.D. and Murphy, D. (2008): Parsimonious Gaussian Mixture models.Statistics and computing, 18(3):285–296.Tortora, C., Gettler Summa, M., and Palumbo, F. (2011). Factor PD-clustering.Pro-ceedings of the Joint Conference of the German Classification Society.

KeywordsMODEL BASED CLUSTERING, DISTANCE BASED CLUSTERING, SIMULA-TION STUDY

Università di Napoli Federico II, [email protected] · University of Guelph,[email protected], [email protected]

186

Clustering and Dimension Reduction using Non-GaussianMixtures

Katherine Morris and Paul McNicholas

Abstract

We introduce a dimension reduction method for model-based clustering using non-Gaussian distributions, specificallyt, shifted asymmetric Laplace, and generalized hy-perbolic distributions. The approach is analogous to existing work within the Gaussianparadigm. By employing sliced inverse regression, the method relies on identifying areduced subspace of the data by considering the extent to which group means and groupcovariances vary. This subspace contains linear combinations of the original data, whichare ordered by importance via the associated eigenvalues. Observations can be projectedonto the subspace and the resulting set of variables captures most of the clustering struc-ture available in the data. Our clustering approaches are illustrated on simulated and realdata, and compared to each other as well as their Gaussian counterpart.

ReferencesANDREWS, J. L. and MCNICHOLAS, P. D. (2012): Model-based Clustering, Clas-sification, and Discriminant Analysis via Mixtures of Multivariatet-distributions:ThetEIGEN Family.Statistics and Computing, 22(5), 1021–1029.FRANCZAK, B., BROWNE, R. P. AND MCNICHOLAS, P. D. (2012): Mixtures ofShifted Asymmetric Laplace Distributions.IEEE Transactions on Pattern Analysisand Machine Intelligence, 5, 263–286.LI, K. C. (1991): Sliced Inverse Regression for Dimension Reduction (with discus-sion).Journal of the American Statistical Association 86, 316–342.SCRUCCA, L. (2010): Dimension Reduction for Model-based Clustering.Statisticsand Computing, 20(4), 471–484.

KeywordsDIMENSION REDUCTION, MIXTURE MODELS, MODEL-BASED CLUSTER-ING

Department of Mathematics & Statistics, University of Guelph, Ontario, Canada{kmorri09, pmcnicho}@uoguelph.ca

187

Comparison of Spatial Clusters between Suicide Data and ItsIncrease-decrease Rates in Japan

Makoto Tomita1, Takafumi Kubota2, Fumio Ishioka3 and Toshiharu Fujita2

Abstract

Our data are the numbers of suicides with 6 periods of every 5 years (only the 1st pe-riod has 10 years) between 1973 and 2007 in Japan, and with 348secondary medicalcare zones they were brought together from municipality units. This data was formed aspart of medical planning. There are several approaches to detect hotspots from differentkinds of spatial data. A spatial scan statistical method forfinding hotspot areas basedon a likelihood ratio has been a very common and useful method. However, this methodtends to detect hotspots much larger than the true hotspot. Therefore it does not alwaysdetect hotspots with high relative risk. Echelon analysis is a useful technique for system-atically and objectively investigating the phase-structure of spatial lattice data. In thispaper, we have studied space-time clusters as well as space clusters of each increase-decrease rate from a period to the next period to evaluate these data using Echelonanalysis.

Acknowledgement

This was a part of funded research from National Institute ofMental Health, Na-tional Center of Neurology and Psychiatry and was partiallysupported by KAKENHI24500337, KAKENHI 21700317 and KAKENHI 21700305.

ReferencesIshioka F. and Kurihara K. (2012) Hotspot Detection Using Scan Method Based onEchelon Analysis.Proceedings of the Institute of Statistical Mathematics60(1):93–108.

KeywordsSPATIAL DATA, SUICIDE DATA, SPACE CLUSTERS, SPACE-TIME CLUSTERS

Tokyo Medical and Dental University, Tokyo, [email protected] ·The Institute of Statistical [email protected] · School of law,Okayama [email protected]

188

Detection of Spatial Clusters for High and Low Suicidal RiskAreas in Japan

Takafumi Kubota1, Makoto Tomita2, Fumio Ishioka3, Tomokazu Fujino4 and HiroeTsubaki5

Abstract

This study detected spatial clusters with both high and low suicidal risks. Small arealdata in Kanto district of Japan from "Statistics of Community for the Death from Sui-cide" were used to calculate SMR of suicide and non-suicide.Then, they were appliedto find out statistically high candidate areas of both SMRs toscan their areas by spatialscan statistics. Finally, the detected areas of both high and low suicide areas were com-pared with the previous study of Kubota et al. (2011) to discuss the risks of suicide intheir areas and to present the interpretations of them.

Acknowledgement

This is a part of funded research from National Institute of Mental Health, NationalCenter of Neurology and Psychiatry and is also partially supported by KAKENHI21700305, KAKENHI 24500337 and KAKENHI 23500358.

ReferencesFujita, T. (2009):Statistics of Community for the Death from Suicide. National Insti-tute of Mental Health, National Center of Neurology and Psychiatry, Japan.Kubota, T., Tomita, M, Ishioka, F. and Fujita, T. (2011): Spatial AutocorrelationStatistics and Spatial Clustering in the Areas in Japan withLow Suicide Rates.JointMeeting of 7th Conference of the Asian Regional Section of the IASC and 2011 TaipeiInternational Statistical Symposium, 99-100.

KeywordsSPATIAL CLUSTERING, SUICIDE DATA, SMALL AREA DATA

The Institute of Statistical [email protected] · Tokyo Medical andDental University· Okayama University· Fukuoka Women’s University· The Instituteof Statistical Mathematics

189

Patterns of Cultural Practices and Characteristics of theCultural Omnivore

Miki Nakai

Abstract

In this paper, we attempt to examine how styles of cultural consumption is classifiedand characterized. In sociological argument focusing on cultural stratification, it hasbeen theorized that participation in highbrow culture is a feature of the elite in the so-ciety (Bourdieu 1979). Other researchers, on the other hand, have been argued that theomnivorous taste pattern shows up in numerous countries (Peterson and Simkus 1992)and this hypothesis has been becoming prevailing. However,characteristics of omnivo-rousness and how omnivore- and univore- types of cultural clusters are associated withsocioeconomic status have received less consideration in Japan (Nakai 2011). Using thedata from a national sample in Japan in 2005 (N=2915), patterns and determinants ofcultural consumption are examined. Our findings of latent class analysis seem to revealthat there are small number of notable groups in terms of cultural practices (Vermunt1997). These include the omnivore class as well as the inactive class. The resultant cul-tural clusters seem consistent with the omnivore-univore hypothesis.

ReferencesBOURDIEU, P. (1979).La Distinction: Critique Sociale du Jugement.Paris: Minuit.NAKAI, M. (2011): Social Stratification and Consumption Patterns: Cultural Prac-tices and lifestyles in Japan. In: S. Ingrassia, R. Rocci, and M. Vichi (Eds.):NewPerspectives in Statistical Modeling and Data Analysis. Springer, Berlin, 211–218.PETERSON, R. A. and SIMKUS, A. (1992): How Musical Taste Groups Mark Oc-cupational Status Groups. In: M. Lamont and M. Fournier (Eds.): Cultivating Differ-ences. Chicago, IL: Univ. of Chicago Press.VERMUNT, J.K. (1997):LEM: A General Program for the Analysis of CategoricalData. Department of Methodology and Statistics, Tilburg University.

KeywordsCULTURAL PRACTICES, OMNIVORE, SOCIAL STRATIFICATION

Department of Social Sciences, College of Social Sciences,Ritsumeikan University,56-1 Toji-in Kitamachi, Kyoto 603-8577 [email protected]

190

The Structure Of Subjective Social Status In Japan: AnApproach Based On Latent Class Model

Yusuke Kanazawa1

Abstract

Previous studies have studied subjective social status of people based on two differentapproaches. The first one is the social psychological approach which explores the rela-tionship between subjective social status and other kinds of social consciousnesses (e.g.Nakao 2002). The second one is the social class approach which explains subjectivesocial status based on people’s objective social status (Hodge and Treiman 1968; Hout2008). This study integrates these two approaches by using latent class models.

First, I analyzed the relationship between subjective social status and other kinds ofsocial consciousnesses such as life satisfaction, satisfaction with own socio-economicstatus (SES) and change in life standard by latent class analysis (McCutcheon 1987),using national representative dataset (the 2010 Stratification and Social Psychology In-terview Survey). As a result, I extracted four latent classes; (a) subjective middle class(32.7% ), who identify themselves as “middle” in society andsatisfied with their life butthink their life unchanged in recent years, (b) subjective upper class(28.3%), who iden-tify themselves as “upper” in society , satisfied with their life and own SES and thinktheir life changed better, (c) subjective lower class(21.4%), who identify themselves as“lower” in society , dissatisfied with their life and own SES and think their life changedworse, and (d) neutral response group(17.6%), who answer “middle” in the questionof subjective social status and choose the neutral response(i.e. the center of responsecategory) in other questions.

Next, I analyzed the relationship between four classes and respondents’ objective so-cial status by multinomial logit latent-class regression analysis (Yamaguchi 2000). Theresults were as follows. (A) Compared to subjective middle class, subjective upper classattains higher levels of education and income. (B) Comparedto subjective middle class,subjective lower class attains lower levels of education, income and occupational pres-tige. (C) There is no difference between neutral response group and subjective middleclass in objective social status. However, neutral response group show lower levels ofcooperation toward the survey than subjective middle class.

ReferencesHodge, R. W. and Treiman, D. J. (1968): Class Identification in the United States.American Journal of Sociology, 73: 535-47.Hout, M. 2008: How Class Works: Objective and Subjective Aspects of Class Since1970s. In: A. Lareau and D. Conley (Eds.):Machine Learning: The Art and Scienceof Algorithms that Make Sense of DataRussel Sage Foundation, New York, 25-89.McCutcheon, A. L. (1987):Latent Class Analysis. Sage, Thousand Oaks.Nakao, K. (2002): Status Identification and Perception about Standard of Living.Sociological Theory and Methods, 17, 135-149. [in Japanese]

Center for Statistics and Information, Rikkyo [email protected]

191

Yamaguchi, K. (2000): Multinomial Logit Latent-Class Regression Models: AnAnalysis of the Predictors of Gender Role Attitude Among Japanese Women.Amer-ican Journal of Sociology, 105: 1702-40.

KeywordsSUBJECTIVE SOCIAL STATUS, SOCIAL SURVEY, LATENT CLASS ANALYSIS,MULTINOMIAL LOGIT LATENT-CLASS REGRESSION MODEL

192

Reference Set Selection for Multivariate Statistical ProcessMonitoring with Biplots

RF Rossouw1, RLJ Coetzer1, and NJ Le Roux2

Abstract

The fundamental approach of almost all of the multivariate process monitoring proce-dures is to first specify an historical reference set that is within statistical control. How-ever, current literature is focused on multivariate statistical monitoring of many processvariables simultaneous for a single process. The selectionof a reference set that is withinstatistical control or conforms to some specified accepted performance measure(s) formultiple production processes simultaneously has received very little attention. The se-lection of the most optimal reference set for a monitoring biplot for multiple processeshas to our knowledge not been discussed previously. Therefore, in this paper we presenta methodology for selecting a reference set for multivariate process monitoring of manyprocess variables using the biplot (Gower et al., 2011), andallows for efficient moni-toring of multiple production processes. It will be demonstrated how a combination ofGeneralized Orthogonal Procrustes Analysis (Gower and Dijksterhuis, 2004), and bi-plot methodology (Arnold et al., 2007) can be used to find boththe optimal productionprocess and the optimal period for the reference set.

ReferencesArnold, G. M., Gower, J. C., Gardner-Lubbe, S., and le Roux, N. J. (2007). Biplots offree-choice profile data in generalized orthogonal Procrustes analysis.Applied Statis-tics, 56, 445-458.Gower, J. C. and Dijksterhuis, G. B. (2004).Procrustes Problems. Oxford, UK: Ox-ford University Press.Gower, J.C., Lubbe, S. and Le Roux, N.J. (2011).Understanding Biplots. Chichester,UK: John Wiley & Sons.

KeywordsBIPLOTS, PROCESS MONITORING, PROCRUSTUS ANALYSIS

Sasol Technology Research and Development, Sasol, PrivateBag 1, Sasolburg, 1947,South Africa [email protected], [email protected] ·Department of Statistics and Actuarial Science, Stellenbosch University, Private BagX1, Matieland, 7602, South [email protected]

193

PLS Biplot: Another Graphical Tool for Multivariate Data

Opeoluwa V.F. Oyedele1 and Sugnet Lubbe2

Abstract

In multivariate analysis, data matrices are often very large and therefore it is difficultto describe the structure and make a visual inspection of therelationship between theirrespective rows (samples) and columns (variables). For this reason, biplots, the jointgraphical display of rows and columns of a data matrix, can bea useful tool for anal-ysis. Biplots have been employed in a number of multivariatemethods such as Corre-spondence Analysis, Principal Component Analysis, Canonical Variate Analysis, andDiscriminant Analysis, as a form of graphical display of data.

Another (popular) multivariate method is Partial Least Squares (PLS). Introduced byWold (1966) as a regression method, PLS is more flexible than multivariate regression,but better suited for the prediction of a set of response variables from a large set of pre-dictors than Principal Component Regression. Different iterative algorithms have beenproposed for estimating the PLS regression coefficients. The most popular algorithmsare the NIPALS (Nonlinear Iterative PArtial Least Squares), Kernel and SIMPLS (Sta-tistical Inspired Modification to Partial Least Squares).

In this paper the biplot is employed in the form of thePLS biplot, a new addition to thebiplot family. Akin to the advantages of biplots, the PLS biplot demonstrates, in graphicform, the association between samples and (or) variables aswell as provides a singlegraphical representation for displaying results from the PLS regression analysis. Twodifferent forms of the PLS biplot are discussed. First, in typical Gower and Hand (1996)biplot style with calibrated biplot axes. Second, the area biplot introduced by Gower,Groenen and Van de Velden (2010) is utilised to ease representation of the matrix ofPLS regression coefficients.

ReferencesGOWER, J.C., GROENEN, P.J.F. and VAN DE VELDEN, M. (2010): Area Biplots.Journal of Computational and Graphical Statistics, 19, 46–61.GOWER, J.C. and HAND, D.J. (1996):Biplots. Monographs on Statistics and Ap-plied Probability. Chapman & Hall, London.WOLD, H. (1966): Estimation of Principal Components and Related Models by Iter-ative Least Squares. In P.R. Krishnaiah (Ed.):Multivariate Analysis. Academic Press,New York, 391–420.

KeywordsAREA BIPLOT, BIPLOT, PARTIAL LEAST SQUARES REGRESSION

University of Cape Town, Cape Town, South [email protected] · University of Cape Town, Cape Town,South [email protected]

194

Variable Selection for Regression and PLS using GenericAlgorithms and Particle Swarm Optimization: AComparison between the Two Methods

Martin Philip Kidd1 and Martin Kidd2

Abstract

Genetic Algorithms(GA) and Particle Swarm Optimization(PSO) (Moraglio et.al 2008)has previously been shown to be successful in the role of variable selection in a regres-sion setting (Talbi et.al. 2008). In this presentation we share some of our experienceswhen applying these techniques to simulated and actual datafor multiple regression andPartial Least Squares(PLS). For PLS the optimal number of components was imple-mented as part of the optimization algorithm, and for both methods, the optimal numberof variables was also implemented as part of the optimization.

A further adaption was made to the optimization algorithm, called hybrid GA(PSO).Each member of the population (outer algorithm) is used as input to another GA(PSO)algorithm (inner algorithm). The outer algorithm focuses on diversification while theinner algorithm focuses on intensification.

For multiple regression and PLS, simulated data sets were constructed with only asmall number of significant predictors from a "large" pool (in excess of 500) predic-tor variables. The time taken for the algorithms to find the significant variables wererecorded. Results will be shown for various different selections of population (swarm)sizes and other tuning parameters.

In similar fashion, comparisons of the algorithms on real data will also be reported.

ReferencesMoraglio, A, Di Chio, C, Togelius, J, Poli R. (2008): Geometric parti-cle swarm optimization.Journal of Artificial Evolution and Applications, Vol2008,doi:10.1155/2008/143624.Talbi, E-G., Jourdan, L., Garcia-Nieto, J., Alba, E. (2008): Comparison of populationbased metaheuristics for feature selection: Application to microarray data classifi-cation.2008 IEEE/ACS INTERNATIONAL CONFERENCE ON COMPUTER SYS-TEMS AND APPLICATIONS, VOLS 1-3 Book Series: InternationalConference onComputer Systems and Applications Pages: 45-52.

KeywordsGENETIC ALGORITHMS, PARTICLE SWARM OPTIMIZATION, PLS, REGRES-SION, VARIABLE SELECTION

Operations Research Group, Dipartimento di Elettronica, Informatica eSistemistica (DEIS),Universita degli Studi di Bologna, Bologna, [email protected] · Centre for Statistical Consultation (CSC),Stellenbosch University, Stellenbosch, South [email protected]

195

Classification with Hyperspheres

Morné Lamont

Abstract

The classification of observations plays a very important role in many applied researchareas. The most well-known classification (discriminant) technique was proposed byFisher (1936) and is called Fisher’s linear discriminant analysis. Many traditional sta-tistical techniques such as Fisher’s linear discriminant analysis have been kernelized(Mika et al., 1999). Other kernelized methods include, kernel principal component anal-ysis, kernel ridge regression and kernel clustering (Cristianini and Shawe-Taylor, 2004).Kernel-based multivariate techniques have gained popularity in statistics over the pastfew decades. The most well-known kernel-based technique isprobably the support vec-tor machine (Boser et al., 1992), which is known for its state-of-the-art performance. Inthis paper, another kernel-based technique called the smallest enclosing hypersphere isreviewed. This technique was used by Tax and Duin (1999) to develop an outlier detec-tor. In this paper we will use the smallest enclosing hypersphere for statistical classifica-tion (called nearest hypersphere classification or NHC). Wewill give an explanation ofhow the NHC is performed. The NHC is compared to other popularstatistical classifi-cation methods in a simulation study and on two real-world datasets. The properties andadvantages of NHC will also be highlighted. NHC is a non-parametric approach to clas-sification and provides more advantages and flexibility thanthe traditional classificationmethods.

ReferencesBOSER, B.E., GUYON, I.M. and VAPNIK, N.V. (1992): A trainingalgorithm foroptimal margin classifiers. In: D. Haussler (Eds.).Proceedings of the 5th annualACM workshop on Computational Learning Theory, 144–152.CRISTIANINI, N. and SHAWE-TAYLOR, J. (2004):Kernel Methods for PatternAnalysis. Cambridge University Press, New York.FISHER, R.A. (1936): The use of multiple measurements in taxonomic problems.Annals of Eugenics, 7, 179–188.MIKA, S., RÄTSCH, G., WESTON, J., SCHÖLKOPF, B. and MÜLLER, K.-R.(1999): Fisher discriminant analysis with kernels. In: Y.-H. Hu, J. Larsen, E. Wil-son and S. Douglas (Eds.):Neural Networks for Signal Processing IX. IEEE, 41–48.TAX, D.M.J. and DUIN, R.P.W. (1999): Support vector domain description.PatternRecognition Letters, 20, 11–13.

KeywordsDISCRIMINANT ANALYSIS, HYPERSPHERE, KERNEL FUNCTION, SUPPORTVECTORS

Department of Statistics and Actuarial Science, Stellenbosch University, Private BagX1, 7602, South Africa,[email protected]

196

Separation And Convexity Properties Of Hierarchical AndNon Hierarchical Clustering

Patrice Bertrand1 and Jean Diatta2

Abstract

Weak hierarchies and paired hierarchies both extend the well known hierarchical clus-tering structure. Weak hierarchies are collections of clusters such that the intersectionof any three clusters is the intersection of some two of them.They play a central rolein the study of theoretical properties of arbitrary clusterstructures. Paired hierarchiesare a type of weak hierarchy, and they are represented by planar graphs which are verysimilar to dendrograms. Like in a hierarchy, each cluster ofa paired hierarchy is dis-played as an interval of some linear ordering of the data set,the only difference beingthe possible existence of cluster overlaps, at most one for each cluster. The purpose ofthis presentation is to characterize the previously mentioned cluster structures, namelyhierarchies, weak hierarchies and paired hierarchies, both in terms of ternary separationrelation, on the one hand, and, on the other hand, in terms of some abstract convexitywhich depends on the type of cluster structure being considered.

ReferencesBANDELT, H.J. and DRESS, A.W.M. (1989): Weak hierarchies associated with sim-ilarity measures : an additive clustering technique.Bull. Math. Biology 51, 113–166.BERTRAND, P. (2008): Set systems for which each set properlyintersects at mostone other set - Application to cluster analysis.Discrete Applied Mathematics 156(8),1220–1236.DIATTA, J. and FICHET, B. (1998): Quasi-ultrametrics and their 2-ball hypergraphs.Discrete Mathematics 192, 87–102.POWERS, R.C. (2007): Hierarchies and ternary separation.Applied MathematicsLetters 20(3), 279–283.

KeywordsTERNARY SEPARATION, ABSTRACT CONVEXITY, HIERARCHY, WEAK HIER-ARCHY, PAIRED HIERARCHY

CEREMADE, Université Paris Dauphine, Paris, [email protected] · LIM-EA2525, Université de la Réu-nion, Saint-Denis, [email protected]

197

Latticial Approach for Perfect Phylogeny Problems

François Brucker and Pascal Préa

Abstract

We present a combinatorial model which generalizes phylogenetic trees. This modellinks together a graph model (strongly chordal graphs), a lattice model (crown-freelattices) and a clustering model (chordal quasi-ultrametrics). This structure allows tomodel phylogenetic networks and to associate attributes toa phylogenetic tree.

In classification, this kind of approximation yields a global visualization of the clus-ters and their relationships through dedicated 2-dimensional or 3-dimensional represen-tations. It can be seen as a compromise between hierarchies (simple structure; easy tointerpret) and general lattices (rich interactions between elements; hard to interpret).

ReferencesBRUCKER, F. and GÉLY, A. Crown-free Lattices and Their Related GraphsOrder,28:443–454, 2010.FARBER, M. Characterizations of strong chordal graphs.Discrete Mathematics,43:173–189, 1983.KELLY, D. and RIVAL, I. Crowns, fences, and dismantable lattices.Canadian Jour-nal of Mathematics, 26:1257–1271, 1974.SPINRAD, J. P.Efficient Graph Representations. American Mathematical Society,Providence Rhode Island, 2003.

KeywordsPERFECT PHYLOGENY, CROWN-FREE LATTICES, DISSIMILARITY, STRONGLYCHORDAL GRAPHS

Laboratoire LIF, UMR 7279, École Centrale Marseille, 38 rueJoliot-Curie - F-13451 Marseille [email protected];[email protected]

198

Some Aspects of Formal Concept Analysis in HierarchicalClassification and Data Analysis

Mehdi Kaytoue1, Sergei O. Kuznetsov2, and Amedeo Napoli3

Abstract

In Formal Concept Analysis (FCA [1]), the formalization of aclassification problemrelies on a formal contextK = (G,M, I) whereG is a set of objects,M a set of at-tributes andI ⊆ G×M a binary relation describing links between objects and attributes.Then a formal concept corresponds to a maximal set of objects–the extent– associatedwith a maximal set of attributes –the intent. Formal concepts are ordered within a com-plete lattice thanks to a subsumption relation based on extent inclusion. The standardFCA formalism can be extended to deal with complex data such as numbers, intervals,strings, and even graphs, within the so-called pattern structures [3]. In addition, a simi-larity between objects based on the closeness of attribute values can be considered andformalized as a tolerance relation, i.e. reflexive and symmetric [2].

In our presentation, we would like to emphasize the links existing between FCA vari-ations and (hierarchical) clustering methods in data analysis. The framework of FCAoffers many possibilities w.r.t. classification and data analysis, e.g. a powerful and di-verse algorithmic machinery of FCA for dealing with large and complex data. Moreover,the joint use of pattern structures and similarities materializes a convergence betweensymbolic classification (e.g. FCA) and numerical classification methods.

References1. B. Ganter and R. Wille.Formal Concept Analysis. Springer, 1999.2. M. Kaytoue, Z. Assaghir, A. Napoli, and S.O. Kuznetsov. Embedding Tolerance

Relations in Formal Concept Analysis – An Application in Information Fusion. InProceedings of CIKM, pages 1689–1692. ACM, 2010.

3. M. Kaytoue, S.O. Kuznetsov, and A. Napoli. Revisiting Numerical Pattern Miningwith Formal Concept Analysis. InProceedings of IJCAI, pages 1342–1347, 2011.

KeywordsFORMAL CONCEPT ANALYSIS, CLASSICATION, PATTERN STRUCTURES, SIM-ILARITY

LIRIS/INSA Lyon [email protected] · HSE [email protected] · LORIA (CNRS – Inria Nancy – U. de Lorraine)[email protected]

199

Which Movie Shall I Watch? Ultrametric BasedRecommendation System

Pedro Contreras1, Fionn Murtagh1, and Javier Pereira2

Abstract

In previous work we have shown how an ultrametric (Murtagh etal, 2008. Pereira etal, 2010. Contreras et al, 2012) can be used to create hierarchical clusters in constantalgorithmic time. In particular we make use of the Baire metric or the longest commonprefix to construct our classification trees. Sometimes whena technique to reduce thedata dimensionality was needed we opted to project the data randomly to one dimension(Murtagh et al, 2008).

Our aim in this work is to show how the Baire metric can be used to classify,match and retrieve categorical data. We demonstrate this bycreating a movie rec-ommendation system based in the Baire metric and using the MovieLens dataset(http://www.grouplens.org/node/73).

ReferencesCONTRERAS, P. and MURTAGH. F. (2012): Fast, Linear Time Hierarchical Clus-tering Using the Baire Metric. In: Journal of Classification, 29(2):118–143.MURTAGH, F., DOWNS, G. and CONTRERAS P. (2008): Hierarchical Clusteringof Massive, High Dimensional Data Sets by Exploiting Ultrametric Embedding. In:SIAM Journal on Scientific Computing, 30(2):707–730.PEREIRA, J., SCHMIDT, F. CONTRERAS, P., MURTAGH, F. and H. ASTUDILLO(2010): Clustering and Semantics Preservation in CulturalHeritage InformationSpaces. In: RIAO’2010, 9th International Conference on Adaptivity, Personalizationand Fusion of Heterogeneous Information, 100–105. Paris, France.

KeywordsULTRAMETRIC, BAIRE METRIC, CLUSTERING, RECOMMENDATION SYS-TEMS, INFORMATION RETRIEVAL.

Royal Holloway, University of London. Egham Hill, Egham. England. TW20 [email protected], [email protected] · Universidad Diego Portales.Avenida Ejército 441. Santiago, [email protected]

200

Automatic Annotation and Classification of newPapillomavirus genomes

Mohamed Amine Remita, Ahmed Halioui and Abdoulaye Baniré Diallo

Abstract

Papillomaviruses (PVs) are a group of viruses harboring a circular dsDNA and caus-ing cutaneous and mucosal epithelial lesions in several vertebrate species. Since the lasttwo decades, improvements in cloning and sequencing technologies permit the mas-sive sequencing of new complete PV genomes (about 8kb nucleotides, conserved struc-ture and complex set of protein-coding genes [1]). The annotation of these genomesand their classification within known PV-types [2] (genotyping) constitute an importantasset for knowledge discovery in the mechanisms of disease diagnosis as well as thewhole PV classification. However, public PV genotyping and annotation tools yet lackaccuracy (derived only from sequence similarity searches). Here, we propose a methodthat exploits both statistical and similarity-based methods to automatically annotate andgenotype genomes. Our approach is composed of two main modules. Theannotationmodulecan detect protein-coding regions based on conservation patterns among alignedgenomes and accurately identify complex features such as overlapping genes and ribo-somal frameshifts. ThePV genotyping moduleis derived from a supervised machinelearning approach relying on decision tree learned from multiple features such as thegenome annotation data from the first module, statistical evidence (nucleotide frequen-cies in each codon positions, GC content, etc.), physical and chemical characteristics ofproteins, and other genome features like restriction fragment length polymorphism.

References1. Zheng, Z. M. and Baker, C. C. (2006): Papillomavirus genome structure, expression,

and post-transcriptional regulation.Front. Biosci., 11, 2286−2302.2. Bernard, H.U., Burk, R.D., Chen, Z., van Doorslaer, K., Hausen, H.Z., and de Vil-

liers, E.M. (2010): Classification of papillomaviruses (PVs) based on 189 PV typesand proposal of taxonomic amendments.Virology 401, 70−79.

KeywordsBIOINFORMATICS, PAPILLOMAVIRUS, ANNOTATION, CLASSIFICATION,KNOWLEDGE DISCOVERY, MACHINE LEARNING

Department of Computer Science, Université du Québec à Montréal,P.O. Box 8888 Downtown Station, Montreal, Quebec, H3C 3P8, [email protected]

201

Different Approaches To Modeling Family Data In GWAS:Application To Cannabis Use

Camelia C. Minica1, Conor V. Dolan1,2, Jouke-Jan Hottenga1, Dorret I. Boomsma1 andJacqueline M. Vink1

Abstract

Power in genome-wide association studies (GWAS) of complextraits has gained muchimportance lately given that the causal genes are commonly assumed to have small ef-fects, requiring large samples for detection. Despite their potential to increase power,the cohorts followed longitudinally in twin registries remain largely unexploited, as theincorporation of many genetic variants in complex models that explicitly account forkinship among individuals poses computational challenges. Hence, one strategy is tolimit the association analysis to unrelated individuals. Given the availability of familydata, it is of interest to determine which analytic strategyis most efficient in the contextof GWAS, where power and computational tractability are both important. We comparedthe performance of three approaches: (a) analysis limited to unrelated, versus analysis offamily data (b) by using a robust estimator (Huber, 1967), or(c) by employing a mixed-effects approach (Guo and Wang, 2002). We evaluated these approaches by consideringfeatures of samples typically collected in Twin registries: a large number of clusters,that are small (i.e., clusters of sibs, sibs and monozygoticand dizygotic twins, with orwithout parents), varying in size and may include a wide range of phenotypic correla-tions (from .1 to .7). In addition we expect the individuals within the cluster to displaysex, age and generation effects. The performance of the three approaches was assessed,first, in simulated data. Next, the most efficient strategy was applied in a GWAS wherewe used genotypes and lifetime cannabis use data collected in 2619 families with up to4 siblings from the Adult Netherlands Twin Register.

ReferencesGUO, G. and WANG, J. (2002): The mixed or multilevel model forbehavior geneticsanalysis.Behavior Genetics, 32, 37–49.HUBER, P.J. (1967): The behaviour of maximum likelihood estimates under non-standard conditions.The 5th Berkeley Symp on Math Stat and Prob, I, 221–233.

KeywordsPOWER, ROBUST ESTIMATOR, MIXED MODEL

Vrije Universiteit Amsterdam, Department of Biological Psychology, Van der Boe-chorststraat 1, 1081 [email protected] · Universiteit van Amsterdam , De-partment of Psychology, Weesperplein 4, 1018 XA

202

Utilization Of Machine-Learning Methodologies In Order ToUnderstand Complex Evolutionary And Functional LinksAmong Bacterial Genomes

Olivier Poiron1 and Benedicte Lafay2

Abstract

We are searching for evolutionary trends among genome maintenance-related genespresent on the replicon sets (i.e., chromosomes and plasmids) of bacterial genomes.Traditional bioinformatic and phylogenetic methods are not adapted to large scale andhigh-dimensional study. We thus developed a semi-supervised analytical pipeline re-lying on data-mining methodologies. Generic unsupervised(SOM, K-means, SUB-CLU, Bayesian networks) and supervised (SVM,decision trees) classication methodswere combined with specific bioinformatic algorithms basedon sequence homologysearch (BLAST). Through this approach, important evolutionary processes could becharacterized among genome-integrated plasmids and chromosomes. We here report onthe inherent difficulties (input data bias, high-dimensional analysis, noise) and the ap-plied methodology, and conclude on the significance of the data-mining methodologyin knowledge discovery.

KeywordsCOMPARATIVE GENOMICS, HOMOLOGY SEARCH, CLASSIFICATION, ANA-LYTICAL PIPELINE

Laboratoire AMPERE Ecole Centrale de Lyon, [email protected] · Laboratoire AMPERE Ecole Centrale deLyon, [email protected]

203

Application of a Bayesian Artificial Neural Network to theBreast Cancer Survival Data

Masoud Salehi1 and Mahmood Reza Gohari2

Abstract

To imitate the function of the brain, ANNs were first developed in the 1940s. Theyhave been more popular in the last two decades because of the development of newtechniques and increases in computational power in different fields as prediction tools.ANNs are mathematical models that contain a number of processing units called nodes,which accomplish limited and simple computations. Moreover, ANNs are consideredas nonparametric and distribution free models, which can beused for prediction andtreated as linear or nonlinear regression models. Multi-layer perceptrons (MLPs) arethe most popular and widely used ANN among the different types of them which areseparated in terms of structure and type of operation. Bayesian framework for trainingand selecting the complexity of ANNs based on Markov chain Mont Carlo (MCMC)techniques has benefits in ensuring that uncertainty into ANNs is reflected in the poste-rior information. Real data of breast cancer were used to illustrate the application of theBayesian Artificial Neural Networks.

ReferencesBishop, C.M. (2006):Patern Recognition and Machine Learning. Springer, NewYork.Mackay, D.J.C. (1995): Probable networks and plausible predictionsâASa review ofpractical Bayesian methods for supervised neural networks. Network Computation inNeural System, 6, 469-505.McCulloch, W.S. and Pitts, W. (1943): A logical calculus of the ideas immanent innervous activity.Bulletin of Mathematical Biophysics, 5, 115-133.

KeywordsBAYESIAN ARTIFICIAL NEURAL NETWORKS, MCMC, BREAST CANCER

Tehran University of Medical Sciences, [email protected] · Tehran Uni-versity of Medical Sciences, [email protected]

204

Achieving Near-perfect Classification for Functional Data

Peter Hall (and Aurore Delaigle)1

Abstract

It can be shown that, in supervised classification problems involving functional data,asymptotically perfect classification is possible, makinguse of the intrinsic very highdimensional nature of functional data. This performance isoften achieved by linearmethods, which are optimal in important cases. The results point to a marked differ-ence between classification for functional data and its counterpart in conventional mul-tivariate analysis, where dimension is kept fixed as sample size diverges. In the lattersetting, linear methods can sometimes be quite inefficient,and there are no prospectsfor asymptotically perfect classification, except in pathological cases where, for exam-ple, a variance vanishes. By way of contrast, in finite samples of functional data, goodperformance can be achieved by truncated versions of linearmethods. Truncation canbe implemented by partial least-squares or projection ontoa finite number of principalcomponents, using, in both cases, cross-validation to determine the truncation point.

Department of Mathematics and Statistics, The University of [email protected]

205