Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Conference of the International Federation of Classi� cation Societies
IFCS-2013
Program and Book of Abstracts
July 14-17, 2013Tilburg, the Netherlands
General information
Greetings!
Welcome to Tilburg and the 2013 conference of the International Federation of Classifi-cation Societies (IFCS), which is held at Tilburg University, The Netherlands from July14 to July 17, 2013. The conference theme is‘United through Ordination and Classifi-cation’.
On July 14, preconference workshops will be held. The conference itself will start onJuly 15 in the Morning, and will close on July 17 with a full dayconference programand a conference dinner. The conference includes a president’s invited session and apresidential address, plenary invited sessions, and concurrent invited and contributedsessions with oral paper presentations.
The opening session will feature the presentation of theChikio Hayashi Awards toAnne-Laure BoulesteixandPaul McNicholas, who are the2013 winnersof this prizefor young researchers with promising track records in the areas of classification and dataanalysis, as a support of their professional career. The members of the 2013 AwardsCommittee are: Michel Wedel (Chair), Edwin Diday, Sylvia-Früwirth-Schnatter andJames Ramsay.
IFCS Registration Desk
The IFCS Registration Desk is located in the hall in front of the University auditorium(aula), which can be found in the Cobbenhagen building of Tilburg University. Signsare provided to help you find the registration desk.
Preconference Workshops Registration Hours
Sunday July 14, 9:30 – 18:30
IFCS Registration Hours
Monday, July 15, 7:30 – 17:00Tuesday, July, 7:30 – 17:00Wednesday, July, 7:30 – 10:30
Badges
Participation in the IFCS conference is limited to registered attendees. The official con-ference badge is required for admission to all sessions.
Lunches and Coffee Breaks
Free lunches and coffee or tea during coffee breaks are included in the registration fee.Note that the lunches and coffee or tea are only free at the designated conference area infront of the auditorium. Meals and drinks used elsewhere have to be paid by attendeesat their own expense.
Social Events and Conference Dinner
Participation in one of the two IFCS social events is limitedto attendees who have regis-tered for either a visit to the national Park Loon and Drunen Dunes or to La Trappe: BeerBrewery at the Koningshoeven Monastry. Busses will be available in the Hogeschool-laan to bring participants to the social event and back to Tilburg University afterwards.
Participation in the conference dinner is limited to attendees who have registered forthe conference dinner. The conference dinner will be held atrestaurantDe Harmonie,Stationsstraat 26, 5038 ED Tilburg, tel: +31(0)13-5425843. The restaurant can be easilyreached by going to the Tilburg Central Railway Station. Walk about 75 metres into thestreet (Stationstraat) right opposite to the front entrance of the Tilburg Central RailwayStation. The restaurant is at the left side of this street.
Messages
A message board will be maintained in the registration area during registration hours.
Local Organizing Committee and Scientific ProgramCommittee
Members of the Local Organizing Committee:
Andries van der Ark (Chair), Tilburg UniversityJohn Gelissen, Tilburg UniversityJeroen Vermunt, Tilburg UniversityMarieke Timmermans (secretary), Tilburg UniversityTom Wilderjans, KU LeuvenKatrijn van Deun, KU Leuven
Members of the Scientific Program Committee:
Jeroen Vermunt (Chair), Tilburg University, The NetherlandsCarlos Cuevas-Covarrubias, Anahuac University, MexicoRozenn Dahyot, Trinity College Dublin, IrelandAnuška Ferligoj, University of Ljubljana, SloveniaChristian Hennig, UCL, London, UKKrzysztof Jajuga, Wrocław University, PolandTae Rim Lee, Korea National Open University, KoreaFriedrich Leisch, University of Natural Ressources and Life Sciences, Vienna, AustriaNiel le Roux, University of Stellenbosch, South AfricaGeoff McLachlan, University of Queensland, AustraliaFred R. McMorris, Chicago University, USAAngelos Markos, Democritus University of Thrace, GreeceBoris Mirkin, Higher School of Economics, Moscow, Russian FederationMohamed Nadif, RenÃl’ Descartes University, Paris, FranceRebecca Nugent, Carnegie Mellon University, USAAkinori Okada, Tama University, JapanFernanda Sousa, University of Porto, PortugalIven Van Mechelen (President IFCS), KU Leuven, BelgiumMaurizio Vichi, University of Rome, ItalyClaus Weihs, TU Dortmund, GermanyJunjie Wu, Beihang University, ChinaPatrick Groenen, Erasmus University, The Netherlands
IFCS Member Societies:
Associação Portuguesa de Classificação e Análise de Dados (CLAD)British Classification Society (BCS)The Classification Society (CS)Gesellschaft für Klassifikation (GfKl)Greek Society of Data Analysis (GSDA)Irish Pattern Recognition and Classification Society (IPRCS)Japanese Classification Society (JCS)Korean Classification Society (KCS)Sekcja Klasyfikacji i Analizy Danych PTS (SKAD)Sociedad Centroamericana y del Caribe de Clasificación y Análisis de Datos (SoCC-CAD)Società Italiana di Statistica (SIS-CLADAG)Société Francophone de Classification (SFC)Statisticno društvo Slovenije (SdS)Vereniging voor Ordinatie en Classificatie (VOC)
The 2013 IFCS conference is scientifically sponsored by the International StatisticalInstitute (ISI) and supported by the IFCS, the VOC, Tilburg University, and the Depart-ment of Methodology and Statistics of Tilburg University.
TITL
EFI
RST_
NAM
ESU
RNAM
EIN
STIT
UTI
ON
CITY
COU
NTR
YEM
AIL_
ADDR
ESS
DrHo
ngsh
ikAh
nSU
NY
Kore
a / S
UN
Y St
ony
Broo
kIn
cheo
nKR
hahn
@su
nyko
rea.
ac.k
rDr
Casp
erAl
bers
Uni
vers
ity o
f Gro
ning
enGr
onin
gen
NL
c.j.a
lber
s@ru
g.nl
DrLa
ura
Ande
rlucc
iU
nive
rsity
of B
olog
naBo
logn
ala
ura.
ande
rlucc
i@un
ibo.
itM
rM
arko
sAn
gelo
sDe
moc
ritus
Uni
vers
ity o
f Thr
ace
Alex
andr
oupo
liGR
amar
kos@
eled
.dut
h.gr
Mr
Deni
sAn
usch
ewsk
iHe
inric
h He
ine
Uni
vers
ity D
üsse
ldor
fDü
ssel
dorf
DEde
nis.
anus
chew
ski@
hhu.
dePr
ofYa
sum
asa
Baba
The
Inst
itute
of S
tatis
tical
Mat
hem
atic
sTa
chik
awa,
Tok
yoJP
baba
@ism
.ac.
jpM
rsZs
uzsa
Bakk
Tilb
urg
Uni
vers
ityTi
lbur
gN
Lz.
bakk
@uv
t.nl
DrBe
ata
Bal-D
omań
ska
Wro
claw
Uni
vers
ity o
f Eco
nom
ics
Jele
nia
Gora
PLbe
ata.
bal-d
oman
ska@
ue.w
roc.
plPr
ofDa
vid
Bank
sDu
ke U
nive
rsity
Durh
am, N
CU
Sba
nks@
stat
.duk
e.ed
uDr
Tom
asz
Bart
łom
owic
zW
rocl
aw U
nive
rsity
of E
cono
mic
sJe
leni
a Gó
raPL
tom
asz.
bart
lom
owic
z@ue
.wro
c.pl
Prof
Fran
cesc
aBa
ssi
Uni
vers
ity o
f Pad
uaPa
dova
ITba
ssi@
stat
.uni
pd.it
DrJe
an-P
atric
kBa
udry
Uni
vers
ité P
ierr
e et
Mar
ie C
urie
Paris
FRje
an-p
atric
k.ba
udry
@up
mc.
frM
rsM
argo
tBe
nnin
kTi
lbur
g U
nive
rsity
Tilb
urg
NL
m.b
enni
nk@
uvt.n
lDr
Patr
ice
Bert
rand
Uni
vers
ite P
aris-
Daup
hine
Paris
FRbe
rtra
nd@
cere
mad
e.da
uphi
ne.fr
Prof
Tam
mo
Bijm
olt
Uni
veris
ty o
f Gro
ning
en, F
acul
ty o
f Ec
onom
ics &
Bus
ines
sGr
onin
gen
NL
t.h.a
.bijm
olt@
rug.
nl
Mr
Flor
ian
Böin
g-M
essin
gDe
part
men
t of M
etho
dolo
gy a
nd S
tatis
tics,
Ti
lbur
g U
nive
rsity
Tilb
urg
NL
f.boe
ing-
mes
sing@
uvt.n
l
Prof
Anne
-Lau
reBo
ules
teix
Ludw
ig-M
axim
ilian
s-U
nive
rsity
Mun
ich
DEbo
ules
teix
@ib
e.m
ed.u
ni-m
uenc
hen.
deDr
Joha
nBr
aeke
nTi
lbur
g U
nive
rsity
Tilb
urg
NL
j.bra
eken
@uv
t.nl
DrM
aria
del
Car
men
Brav
oU
nive
rsid
ad C
ompl
uten
se d
e M
adrid
Mad
ridES
mcb
ravo
@uc
m.e
s
Prof
PAU
LABR
ITO
FEP
& L
IAAD
INES
C TE
C; U
nive
rsity
of P
orto
PORT
OPT
mpb
rito@
fep.
up.p
tPr
ofFr
anço
isBr
ucke
rEc
ole
Cent
rale
Mar
seill
eM
arse
ille
FRfr
anco
is.br
ucke
r@ce
ntra
le-m
arse
ille.
frM
rsJu
styn
aBr
zeziń
ska
Uni
vers
ity o
f Eco
nom
ics i
n Ka
tow
ice
Kato
wic
ePL
just
yna.
brze
zinsk
a@ue
.kat
owic
e.pl
Mrs
Silv
iaCa
ligar
isU
NIV
ERSI
TY O
F M
ILAN
-BIC
OCC
APA
VIA
ITsil
viac
alig
aris8
5@gm
ail.c
omDr
SILV
IACA
LIGA
RIS
UN
IVER
SITY
OF
MIL
AN-B
ICO
CCA
PAVI
AIT
silvi
acal
igar
is85@
gmai
l.com
DrM
ASSI
MO
CAN
NAS
UN
IVER
SITY
OF
CAGL
IARI
CAGL
IARI
ITm
assim
o.ca
nnas
@un
ica.
it
DrVé
roni
que
CARI
OU
Nan
tes-
Atla
ntic
of V
eter
inar
y M
edic
ine,
Fo
od S
cien
ce a
nd E
ngin
eerin
g N
atio
nal
Colle
geN
ante
sFR
vero
niqu
e.ca
riou@
oniri
s-na
ntes
.frDr
Drag
oCa
rloU
nive
rsity
of N
apol
iRo
me
ITc.
drag
o@m
clin
k.it
Prof
Andr
eaCe
rioli
Uni
vers
ity o
f Par
ma
Parm
aIT
andr
ea.c
erio
li@un
ipr.i
tPr
ofEv
aCe
ulem
ans
KU L
euve
nLe
uven
BEEv
a.Ce
ulem
ans@
ppw
.kul
euve
n.be
Mr
Holg
erCe
vallo
s Val
divi
ezo
Ghen
t Uni
vers
ityGh
ent
BEho
lger
.cev
allo
sval
divi
ezo@
ugen
t.be
Mr
Pasc
alCh
ave
Hein
rich-
Hein
e-U
nive
rsitä
t Düs
seld
orf
Düss
eldo
rfDE
pasc
al.c
have
@hh
u.de
Mr
MAR
CCO
MAS
UN
IVER
SITA
T DE
GIR
ON
AGI
RON
AES
mco
mas
@im
a.ud
g.ed
uDr
Pedr
oCo
ntre
ras
Uni
vers
ity o
f Lon
don
Egha
mGB
pedr
o@cs
.rhul
.ac.
uk
DrCl
audi
oCo
nver
sano
Uni
vers
ity o
f Cag
liari,
Dip
artim
ento
di
Scie
nze
Econ
omic
he e
d Az
iend
ali
Cagl
iari
ITco
nver
sa@
unic
a.it
Prof
Fran
caCr
ippa
Depa
rtm
ent o
f Psy
chol
ogy,
Uni
vers
ity o
f M
ilano
-Bic
occa
Mila
nIT
fran
ca.c
rippa
@un
imib
.itDr
Mar
cCs
erne
lIN
RIA-
Rocq
uenc
ourt
Le C
hesn
ayFR
Mar
c.Cs
erne
l@in
ria.fr
DrCa
rlos
Cuev
as C
ovar
rubi
asU
nive
rsid
ad A
nahu
acM
exic
oM
Xcc
ueva
s@an
ahua
c.m
xM
rBr
uno
Daig
leU
nive
rsité
du
Qué
bec
à M
ontr
éal
Mon
trea
lCA
daig
le.b
runo
@co
urrie
r.uqa
m.c
aDr
Sanj
eena
Dang
Uni
vers
ity o
f Gue
lph
Guel
phCA
ssub
edi@
uogu
elph
.ca
Mr
Utk
arsh
Dang
Uni
vers
ity o
f Gue
lph
Guel
phCA
udan
g@uo
guel
ph.c
a
Mr
Rain
erDa
ngl
Uni
vers
ity o
f Nat
ural
Res
ourc
es a
nd L
ife
Scie
nces
Vie
nna
Vien
naAT
rain
er.d
angl
@bo
ku.a
c.at
DrAn
toin
ede
Fal
guer
olle
sU
nive
rsité
de
Toul
ouse
III (
retir
ed)
Toul
ouse
FRan
toin
e@fa
lgue
rolle
s.ne
t
Mr
Joha
nDe
Roo
iEr
asm
uc M
CRo
tter
dam
NL
Secr
etar
iat.B
iost
atist
ics@
eras
mus
mc.
nlDr
Mar
kDe
Roo
ijLe
iden
Uni
vers
ityLe
iden
, The
Net
herla
nds
rooi
jm@
fsw
.leid
enun
iv.n
lM
rsKi
mDe
Roo
ver
KU L
euve
nLe
uven
BEKi
m.D
eRoo
ver@
ppw
.kul
euve
n.be
DrN
ema
Dean
Uni
vers
ity o
f Gla
sgow
Glas
gow
GBne
ma.
dean
@gm
ail.c
omM
rGu
dich
aDe
reje
W.
Tilb
urg
Uni
vers
ityTi
lbur
gD.
W.G
udic
ha@
uvt.n
lDr
Chris
tian
Derq
uenn
eEl
ectr
icité
de
Fran
ce -
R&D
Clam
art C
edex
FRch
ristia
n.de
rque
nne@
edf.f
rPr
ofAb
doul
aye
Bani
reDi
allo
Uni
vers
ité d
u Q
uébe
c à
Mon
tréa
lM
ontr
eal
CAdi
allo
.abd
oula
ye@
uqam
.ca
Prof
Jean
Diat
taU
nive
rsité
de
la R
éuni
onSa
inte
Clo
tilde
REje
an.d
iatt
a@un
iv-r
euni
on.fr
DrFl
oren
tDo
men
ach
Uni
vers
ity o
f Nic
osia
Nic
osia
CYdo
men
ach.
f@un
ic.a
c.cy
Mrs
Lisa
Doov
eKU
Leu
ven,
VAT
-nr.
BE 0
419
052
173
Leuv
enBE
lisa.
doov
e@pp
w.k
uleu
ven.
bePr
ofA.
Ped
roDu
arte
Silv
aCa
thol
ic U
nive
rsity
of P
ortu
gal /
CEG
EPo
rto
PTps
ilva@
port
o.uc
p.pt
DrAn
drze
jDu
dek
Wro
claw
Uni
veris
ty o
f Eco
nom
ics
Jele
nia
Gora
PLan
drze
j.dud
ek@
ue.w
roc.
plDr
Elise
Duss
eldo
rpTN
O &
Kat
holie
ke U
nive
rsite
it Le
uven
Leid
enN
Lel
ise.d
usse
ldor
p@tn
o.nl
DrSe
rgey
Dvoe
nko
Stat
e U
nive
rsity
of T
ula
Tula
RUse
rged
v@ya
ndex
.ruM
rsIri
sEe
khou
tVU
Uni
vers
ity m
edic
al c
ente
rAm
ster
dam
NL
i.eek
hout
@vu
mc.
nlPr
ofPa
ulEi
lers
Eras
mus
Uni
vers
ity M
edic
al C
ente
rRo
tter
dam
NL
p.ei
lers
@er
asm
usm
c.nl
DrW
ilco
Emon
sTi
lbur
g U
nive
rsity
Tilb
urg
ANw
.h.m
.em
ons@
tilbu
rgun
iver
sity.
edu
Mrs
Mar
ijeFa
ggin
ger A
uer
Leid
en U
nive
rsity
Leid
enN
Lm
.f.fa
ggin
ger.a
uer@
fsw
.leid
enun
iv.n
lM
rLu
kasz
Feld
man
Wro
cław
Uni
vers
ity o
f Eco
nom
ics
Wro
cław
PLlu
kasz
.feld
man
@ue
.wro
c.pl
Mr
Bern
ard
Fich
etAi
x-M
arse
ille
Uni
vers
ityM
arse
ille
FRbe
rnar
d.fic
het@
lif.u
niv-
mrs
.frDr
Silv
iaFi
gini
Uni
vers
ity o
f Pav
iaPa
via
ITsil
via.
figin
i@un
ipv.
it
Mr
Kam
ilFi
jore
kCr
acow
Uni
vers
ity o
f Eco
nom
ics
Crac
owPL
kam
il.fij
orek
@ue
k.kr
akow
.pl
Mrs
Mar
jole
inFo
kkem
aVr
ije U
nive
rsite
itAm
ster
dam
NL
m.fo
kkem
a@vu
.nl
Mr
Luca
Frig
auIta
liaSe
larg
ius
ITfr
igau
@un
ica.
itM
rHi
roki
Furu
zum
iU
nive
rsity
of H
yogo
Kobe
JPfu
ruzu
mi@
econ
.u-h
yogo
.ac.
jpPr
ofBe
rnha
rdGa
nter
Tech
nisc
he U
ni D
resd
enDr
esde
nDE
bern
hard
.gan
ter@
tu-d
resd
en.d
eM
rEU
GEN
IUSZ
GATN
ARN
atio
nal B
ank
of P
olan
dW
arsa
wPL
sekr
etar
iat.g
atna
rWB@
nbp.
pl
DrJo
hnGe
lisse
nDe
part
men
t of M
etho
dolo
gy &
Sta
tistic
s,
Tilb
urg
Uni
vers
ityTi
lbur
gN
Lj.p
.t.m
.gel
issen
@uv
t.nl
Mr
van
den
Burg
Gert
jan
Eras
mus
Uni
vers
ity R
otte
rdam
Rott
erda
mN
Lbu
rg@
ese.
eur.n
lDr
Paol
oGi
orda
niSa
pien
za U
nive
rsity
of R
ome
Rom
eIT
paol
o.gi
orda
ni@
uniro
ma1
.itDr
Anna
Gira
ldo
Uni
vers
ity o
f Pad
ova
Pado
vaIT
anna
.gira
ldo@
unip
d.it
DrCy
nthi
aGl
odea
nuTU
Dre
sden
Dres
den
DECy
nthi
a-Ve
ra.G
lode
anu@
tu-d
resd
en.d
e
DrTo
mas
zGó
reck
iFa
culty
of M
athe
mat
ics a
nd C
ompu
ter
Scie
nce,
Ada
m M
icki
ewic
z Uni
vers
ityPo
znań
PLto
mas
z.go
reck
i@am
u.ed
u.pl
Mrs
Rosa
lieGo
rter
VUm
cAm
ster
dam
NL
r.gor
ter@
vum
c.nl
Prof
John
Gow
erO
pen
Uni
vers
ity,
Milt
on K
eyne
sGB
j.c.g
ower
@op
en.a
c.uk
Prof
Mic
hael
Gree
nacr
eU
nive
rsita
t Pom
peu
Fabr
aBa
rcel
ona
ESm
icha
el.g
reen
acre
@gm
ail.c
omPr
ofPa
tric
kGr
oene
nEr
asm
us U
nive
rsity
Rot
terd
amRo
tter
dam
NL
groe
nen@
ese.
eur.n
lDr
Isab
elle
Guyo
nCl
opiN
etSa
n Fr
anci
sco
US
guyo
n@cl
opin
et.c
omM
rKu
niyo
shi
Haya
shi
Oka
yam
a U
nive
rsity
Oka
yam
a Ci
tyJP
k-ha
yash
i@em
s.ok
ayam
a-u.
ac.jp
Mr
Will
emHe
iser
Leid
en U
nive
rsity
Leid
enN
LHe
iser@
Fsw
.Lei
denu
niv.
nl
DrCh
ristia
nHe
nnig
Uni
vers
ity C
olle
ge L
ondo
n, D
epar
tmen
t of
Stat
istic
al S
cien
ceLo
ndon
GBc.
henn
ig@
ucl.a
c.uk
Mrs
Joke
Heyl
enKU
Leu
ven
Leuv
enBE
Joke
.Hey
len@
ppw
.kul
euve
n.be
Prof
Tada
shi
Imai
zum
iTa
ma
Uni
vers
ityTo
kyo
JPim
aizu
mi@
tam
a.ac
.jp
Prof
Salv
ator
eIn
gras
siaDe
part
men
t of E
cono
mic
s and
Bus
ines
s,
Uni
vers
ity o
f Cat
ania
Cata
nia
ITs.
ingr
assia
@un
ict.i
t
DrAl
fons
oIo
dice
D'E
nza
Uni
vers
ità d
i Cas
sino
e de
l Laz
io M
erid
iona
leCa
ssin
oIT
iodi
cede
@un
icas
.itM
rsLi
anne
Ippe
lTi
lbur
g U
nive
rsity
Tilb
urg
NL
g.j.e
.ippe
l@til
burg
univ
ersit
y.ed
u
DrLo
reda
naIv
anN
atio
nal S
choo
l of P
oliti
cal S
tudi
es a
nd
Publ
ic A
dmin
istra
tion
Buch
ares
tRO
lore
dana
.ivan
@co
mun
icar
e.ro
Mr
Rusla
nJa
bray
ilov
Tilb
urg
Uni
vers
ityTi
lbur
gN
Lr.j
abra
yilo
v@uv
t.nl
Prof
Krzy
szto
fJa
juga
Wro
claw
Uni
vers
ity o
f Eco
nom
ics
Wro
claw
PLkr
zysz
tof.j
ajug
a@ue
.wro
c.pl
Mr
Maa
rten
Kam
pert
Leid
en U
nive
rsity
Leid
enN
Lm
kam
pert
@m
ath.
leid
enun
iv.n
lM
rYu
suke
Kana
zaw
aRi
kkyo
Uni
vers
ityTo
shim
a-ku
JPka
naza
wa@
rikky
o.ac
.jp
DrM
ilo¨
Kank
ara¨
Tilb
urg
Uni
vers
ityTi
lbur
gN
Lm
.kan
kara
s@uv
t.nl
DrRo
bert
Kapł
onW
rocl
aw U
nive
rsity
of T
echn
olog
yW
rocł
awPL
robe
rt.k
aplo
n@pw
r.wro
c.pl
DrM
aurit
sKa
ptei
nTi
lbur
g U
nive
rsity
Nijm
egen
NL
mau
rits@
mau
ritsk
apte
in.c
omPr
ofHe
nkKe
lder
man
Leid
en U
nive
rsity
Leid
enN
Lh.
keld
erm
an@
umai
l.lei
denu
niv.
nlPr
ofM
artin
Kidd
Uni
vers
ity o
f Ste
llenb
osch
Mat
iela
ndZA
mki
dd@
sun.
ac.z
aPr
ofHe
nkKi
ers
Gron
inge
n U
nive
rsity
Gron
inge
nN
Lh.
a.l.k
iers
@ru
g.nl
DrSi
mon
aKo
renj
ak-C
erne
Uni
vers
ity o
f Lju
blja
na, F
acul
ty o
f Eco
nom
ics
Ljub
ljana
SIsim
ona.
cern
e@ef
.uni
-lj.s
iM
rsAn
naKr
olW
rocl
aw U
nive
rsity
of E
cono
my
Wro
claw
PLan
na.k
rol@
ue.w
roc.
plDr
Taka
fum
iKu
bota
The
Inst
itute
of S
tatis
tical
Mat
hem
atic
sTo
kyo
JPtk
ubot
a@ism
.ac.
jpM
rsRe
nske
Kuijp
ers
Tilb
urg
Uni
vers
ityTi
lbur
gN
Lr.e
.kui
jper
s@til
burg
univ
ersit
y.ed
uDr
Kei
Kura
kaw
aN
atio
nal I
nstit
ute
of In
form
atic
sTo
kyo
JPku
raka
wa@
nii.a
c.jp
Prof
Koji
Kurih
ara
Oka
yam
a U
nive
rsity
Oka
yam
aJP
kurih
ara@
ems.
okay
ama-
u.ac
.jpM
rsO
lesia
Kush
nir
Tula
Sta
te U
nive
rsity
Tula
RUku
shni
r-ol
esya
@ra
mbl
er.ru
DrM
orne
Lam
ont
Stel
lenb
osch
Uni
vers
itySt
elle
nbos
chZA
mm
cl@
sun.
ac.z
aPr
ofBe
rtho
ldLa
usen
Uni
vers
ity o
f Ess
exCo
lche
ster
GBbl
ause
n@es
sex.
ac.u
kPr
ofN
iel
Le R
oux
Uni
vers
ity o
f Ste
llenb
osch
Stel
lenb
osch
ZAnj
lr@su
n.ac
.za
Prof
TAE
RIM
LEE
Kore
a N
atio
nal O
pen
Uni
vers
itySe
oul
trle
e@kn
ou.a
c.kr
Prof
Herb
ieLe
eU
nive
rsity
of C
alifo
rnia
, San
ta C
ruz
Sant
a Cr
uz, C
AU
She
rbie
@am
s.uc
sc.e
duPr
ofFr
iedr
ich
Leisc
hBO
KU V
ienn
aVi
enna
ATFr
iedr
ich.
Leisc
h@bo
ku.a
c.at
Mr
Etie
nne
Lord
Uni
vers
ite d
u Q
uebe
c a
Mon
trea
l / D
ept.
Info
rmat
ique
Mon
trea
lCA
lord
.etie
nne@
cour
rier.u
qam
.ca
Prof
Sugn
etLu
bbe
Uni
vers
ity o
f Cap
e To
wn
Cape
Tow
nZA
Sugn
et.L
ubbe
@uc
t.ac.
zaM
rsGe
rtra
udM
alsin
er-W
alli
Joha
nnes
Kep
ler U
nive
rsity
Lin
zLi
nzAT
gert
raud
.mal
siner
_wal
li@jk
u.at
DrM
ałgo
rzat
aM
arko
wsk
aW
rocł
aw U
nive
rsity
of E
cono
mic
sJe
leni
a Gó
raPL
mal
gorz
ata.
mar
kow
ska@
ue.w
roc.
plM
rYu
suke
Mat
sui
Hokk
aido
Uni
vers
itySa
ppor
oJP
mat
sui@
iic.h
okud
ai.a
c.jp
DrM
arce
llaM
azzo
leni
Dipa
rtim
ento
di S
tatis
tica
e M
etod
i Q
uant
itativ
i Uni
vers
ità B
icoc
caM
ilano
ITm
.maz
zole
ni8@
cam
pus.
unim
ib.it
Prof
Geof
fM
cLac
hlan
Uni
vers
ity o
f Que
ensla
ndBr
isban
eAU
g.m
clac
hlan
@uq
.edu
.au
Prof
Paul
McN
icho
las
Uni
vers
ity o
f Gue
lph
Guel
phCA
paul
.mcn
icho
las@
uogu
elph
.ca
Mrs
Dhou
haM
ejri
Tech
nisc
he U
nive
rsity
of D
ortm
und
Dort
mun
dDE
mej
ri_dh
ouha
@ya
hoo.
frDr
Giov
anna
Men
ardi
Uni
vers
ity o
f Pad
uaPA
DOVA
ITm
enar
di@
stat
.uni
pd.it
DrHi
royu
kiM
INAM
IHo
kkai
do U
nive
rsity
Sapp
oro
JPm
in@
iic.h
okud
ai.a
c.jp
Mrs
Cam
elia
Min
ica
Vrije
Uni
vers
iteit
Amst
erda
mAm
ster
dam
NL
c.c.
min
ica@
vu.n
l
Prof
Boris
Mirk
inN
RU H
ighe
r Sch
ool o
f Eco
nom
ics M
osco
wM
osco
wRU
mirk
in@
dcs.
bbk.
ac.u
kM
rM
asak
iM
itsuh
iroGr
adua
te S
choo
l of D
oshi
sha
Uni
vers
ityKy
otan
abe
JPdi
m00
09@
mai
l4.d
oshi
sha.
ac.jp
Mrs
Mar
ie-A
nne
Mitt
elha
euse
rTi
lbur
g U
nive
rsity
Tilb
urg
NL
M.M
ittel
haeu
ser@
uvt.n
lPr
ofM
ASAH
IRO
MIZ
UTA
IIC H
okka
ido
Uni
v.Sa
ppor
oJP
mizu
ta@
iic.h
okud
ai.a
c.jp
Prof
Fran
cesc
oM
ola
Uni
vers
ity o
f Cag
liari
Cagl
iari
ITm
ola@
unic
a.it
Prof
Ange
laM
onta
nari
Uni
vers
ity o
f Bol
ogna
Bolo
gna
ange
la.m
onta
nari@
unib
o.it
Mrs
Kath
erin
eM
orris
Uni
vers
ity o
f Gue
lph,
Ont
ario
, Can
ada
Toro
nto
CAkm
orri0
9@uo
guel
ph.c
aM
rPa
vlo
Moz
haro
vsky
iU
nive
rsity
of C
olog
neCo
logn
eDE
moz
haro
vsky
i@st
atist
ik.u
ni-k
oeln
.de
DrJo
risM
ulde
rTi
lbur
g U
nive
rsity
Tilb
urg
NL
j.mul
der3
@uv
t.nl
Prof
Fion
nM
urta
ghRo
yal H
ollo
way
, Uni
vers
ity o
f Lon
don
Lond
onGB
fmur
tagh
@ac
m.o
rgPr
ofM
oham
edN
adif
LIPA
DE -
Uni
vers
ity o
f Par
is De
scar
tes
Paris
FRm
oham
ed.n
adif@
paris
desc
arte
s.fr
Mr
Erw
inN
agel
kerk
eTi
lbur
g U
nive
rsity
Tilb
urg
NL
e.na
gelk
erke
@uv
t.nl
Prof
Mik
iN
akai
Rits
umei
kan
Uni
vers
ityKy
oto
JPm
naka
i@ss
.rits
umei
.ac.
jpPr
ofJu
nji
Nak
ano
The
Inst
itute
of S
tatis
tical
Mat
hem
atic
sTo
kyo
JPna
kano
j@ism
.ac.
jpDr
Atsu
hoN
akay
ama
Toky
o M
etro
polit
an U
nive
rsity
Hach
ioji-
shi
JPat
suho
@tm
u.ac
.jp
Mr
Amed
eoN
apol
iLO
RIA
(CN
RS --
INRI
A N
GE --
Uni
vers
ité d
e Lo
rrai
ne)
Vand
oeuv
re le
s Nan
cyFR
Amed
eo.N
apol
i@lo
ria.fr
DrFe
deric
aN
icol
ussi
Uni
vers
ità d
egli
Stud
i Mila
no-B
icoc
caLi
sson
eIT
f.nic
olus
si@ca
mpu
s.un
imib
.itPr
ofRe
becc
aN
ugen
tCa
rneg
ie M
ello
n U
nive
rsity
Pitt
sbur
ghU
Srn
ugen
t@st
at.c
mu.
edu
DrDa
niel
Obe
rski
Tilb
urg
Uni
vers
ityTi
lbur
gN
Ldo
bers
ki@
uvt.n
lPr
ofAk
inor
iO
kada
Tam
a U
nive
rsity
Toky
oJP
okad
a@rik
kyo.
ac.jp
Prof
Rodr
igue
zO
ldem
arU
nive
rsity
of C
osta
Ric
aSa
n Pe
dro
CRol
dem
ar.ro
drig
uez@
ucr.a
c.cr
Mrs
Hann
ahO
oste
rhui
sTi
lbur
g U
nive
rsity
Tilb
urg
NL
h.e.
m.o
oste
rhui
s@til
burg
univ
ersit
y.ed
uM
rPi
eter
Oos
terw
ijkTi
lTi
lbur
gN
Lp.
r.oos
terw
ijk@
uvt.n
lM
rM
ory
Oua
ttar
aCE
DRCI
/CST
BPa
risFR
mor
y.ou
atta
ra@
live.
fr
DrJa
n W
.O
wsin
ski
Syst
ems R
esea
rch
Inst
itute
, Pol
ish A
cade
my
of S
cien
ces
War
szaw
aPL
owsin
ski@
ibsp
an.w
aw.p
lM
rsO
peol
uwa
Oye
dele
Uni
vers
ity o
f Cap
e To
wn
Cape
Tow
nZA
Ope
oluw
aOye
dele
@gm
ail.c
omPr
ofFr
ance
sco
Palu
mbo
Uni
vers
ity o
f Nap
les F
eder
ico
IIN
aple
sIT
fpal
umbo
@un
ina.
itDr
Silv
iaPa
ndol
fiU
nive
rsity
of P
erug
iaPe
rugi
aIT
pand
olfi@
stat
.uni
pg.it
Prof
Barb
ara
Paw
elek
Crac
ow U
nive
rsity
of E
cono
mic
s,
Depa
rtm
ent o
f Sta
tistic
sKr
akow
PLpa
wel
ekb@
uek.
krak
ow.p
lDr
Mar
cin
Pełk
aW
rocl
aw U
nive
rsity
of E
cono
mic
sJe
leni
a Gó
raPL
mar
cin.
pelk
a@ue
.wro
c.pl
Mr
GON
ZALO
PERE
Z DE
LA
CRU
ZN
ATIO
NAL
UN
IVER
SITY
OF
MEX
ICO
MEX
ICO
MX
acua
rio_1
984@
yaho
o.co
m.m
xDr
Rado
slaw
Piet
rzyk
Wro
claw
Uni
vers
ity o
f Eco
nom
ics
Wro
claw
PLra
dosla
w.p
ietr
zyk@
ue.w
roc.
plDr
Krzy
szto
fPi
onte
kW
rocl
aw U
nive
rsity
of E
cono
mic
sW
rocl
awPL
krzy
szto
f.pio
ntek
@ue
.wro
c.pl
Prof
Joze
fPo
ciec
haCr
acow
Uni
vers
ity o
f Eco
nom
ics,
De
part
men
t of S
tatis
tics
Krak
owPL
poci
echa
@ue
k.kr
akow
.pl
Mr
Oliv
ier
POIR
ION
Labo
rato
ire A
MPE
RE -
UM
R 50
05-C
NRS
�cu
llyFR
oliv
ier.p
oirio
n@ec
-lyon
.frM
rPa
scal
Prea
Ecol
e Ce
ntra
le M
arse
ille
Mar
seill
e Ce
dex
20FR
pasc
al.p
rea@
lif.u
niv-
mrs
.frDr
Klau
dia
Przy
bysz
Wro
claw
Uni
vers
ity o
f Eco
nom
ics
Wro
claw
PLkl
audi
a.pr
zyby
sz@
ue.w
roc.
plDr
Anto
nio
Punz
oU
nive
rsity
of C
atan
iaCa
tani
aIT
anto
nio.
punz
o@un
ict.i
tPr
ofGi
anca
rloRa
gozin
iFe
deric
o II
Uni
vers
ity o
f Nap
les
Nap
les
ITgi
rago
z@un
ina.
itDr
Sura
jitRa
yU
nive
rsity
of G
lasg
owGL
ASGO
WGB
sura
jit.ra
y@gl
asgo
w.a
c.uk
Mr
Moh
amed
Am
ine
Rem
itaU
nive
rsité
du
Qué
bec
à M
ontr
éal
Mon
tréa
lCA
rem
ita.m
oham
ed_a
min
e@uq
am.c
aDr
Ralp
hRi
ppe
Leid
en U
nive
rsity
Leid
enN
Lrr
ippe
@fs
w.le
iden
univ
.nl
DrPa
wel
Roki
taW
rocl
aw U
nive
rsity
of E
cono
mic
sW
rocl
awPL
paw
el.ro
kita
@ue
.wro
c.pl
Mr
Ruan
Ross
ouw
Saso
l Tec
hnol
ogy
R &
DSa
solb
urg
tany
a.ce
rva@
saso
l.com
Prof
Adam
Saga
nCr
acow
Uni
vers
ity o
f Eco
nom
ics
Krak
owPL
saga
na@
uek.
krak
ow.p
lDr
Jan
Sche
pers
Maa
stric
ht U
nive
rsity
Maa
stric
htN
Lja
n.sc
hepe
rs@
maa
stric
htun
iver
sity.
nlDr
Vere
naSc
hmitt
man
nTi
lbur
g U
nive
rsity
Tilb
urg
NL
v.d.
schm
ittm
ann@
uvt.n
lM
rPi
eter
Scho
onee
sEr
asm
us U
nive
rsity
Rot
terd
amRo
tter
dam
scho
onee
s@es
e.eu
r.nl
DrM
algo
rzat
aSe
j-Kol
asa
Wro
claw
Uni
vers
ity o
f Eco
nom
ics
Jele
nia
Góra
PLm
algo
rzat
a.se
j-kol
asa@
ue.w
roc.
plM
rAn
drey
Shes
tako
vHi
gher
Sch
ool o
f Eco
nom
ics
Mos
cow
RUsh
esta
koffa
ndre
y@gm
ail.c
om
Prof
Klaa
sSi
jtsm
aTi
lbur
g Sc
hool
of S
ocia
l and
Beh
avio
ral
Scie
nces
Tilb
urg
NL
k.sij
tsm
a@uv
t.nl
Mrs
Cláu
dia
Silv
estr
eIS
CTE-
IUL
LISB
OA
PTcs
ilves
tre@
escs
.ipl.p
tPr
ofAg
eSm
ilde
Uni
vers
ity o
f Am
ster
dam
Amst
erda
mN
La.
k.sm
ilde@
uva.
nlM
rN
iels
Smits
VU U
nive
rsity
Am
ster
dam
Amst
erda
mN
Ln.
smits
@vu
.nl
Prof
Elżb
ieta
Sobc
zak
Wro
cław
Uni
vers
ity o
f Eco
nom
ics
Jele
nia
Góra
PLel
zbie
ta.s
obcz
ak@
ue.w
roc.
plPr
ofAn
drze
jSo
kolo
wsk
iCr
acow
Uni
vers
ity o
f Eco
nom
ics
Krak
owPL
soko
low
s@ue
k.kr
akow
.pl
Mrs
Alet
teSp
riens
ma
VU U
nive
rsity
Med
ical
Cen
ter
Amst
erda
mN
La.
sprie
nsm
a@vu
mc.
nlDr
Alw
inSt
egem
anU
nive
rsity
of G
roni
ngen
Gron
inge
nN
La.
w.s
tege
man
@ru
g.nl
Prof
Doug
las
Stei
nley
Uni
vers
ity o
f Miss
ouri
Colu
mbi
aU
Sst
einl
eyd@
miss
ouri.
edu
Prof
Xiao
gang
SuU
nive
rsity
of A
laba
ma
at B
irmin
gham
Birm
ingh
amU
Sxg
.su.
2012
@gm
ail.c
omDr
Jacq
ues-
Henr
iSu
blem
ontie
rLI
FO U
nive
rsity
of O
rléan
sO
rléan
sFR
jhs@
univ
-orle
ans.
frPr
ofYu
anSu
nN
atio
nal I
nstit
ute
of In
form
atic
sTo
kyo
JPyu
an@
nii.a
c.jp
DrM
irosla
wa
Szte
mbe
rg-L
ewan
dow
ska
Wro
claw
Uni
vers
ity o
f Eco
nom
ics
Jele
nia
Góra
PLm
irosla
wa.
szte
mbe
rg-
lew
ando
wsk
a@ue
.wro
c.pl
Mr
Kens
uke
Tani
oka
Grad
uate
scho
ol o
f Cul
ture
and
Info
rmat
ion
Scie
nce,
Dos
hish
a U
nive
rsity
Kyot
anab
e Ci
tyJP
eim
1001
@m
ail4
.dos
hish
a.ac
.jp
DrSh
inob
uTa
tsun
ami
St. M
aria
nna
Uni
vers
ity S
choo
l of M
edic
ine
Kaw
asak
iJP
s2ta
tsu@
mar
iann
a-u.
ac.jp
DrFe
tene
Tekl
eTi
lbur
g U
nive
rsity
Tilb
urg
NL
f.b.te
kle@
uvt.n
l
Mr
Yosh
ikaz
uTe
rada
Grad
uate
Sch
ool o
f Eng
inee
ring
Scie
nce,
O
saka
Uni
vers
ityO
saka
JPte
rada
@sig
mat
h.es
.osa
ka-u
.ac.
jpDr
Mak
oto
Tom
itaTo
kyo
Med
ical
and
Den
tal U
nive
rsity
Toky
oJP
tom
ita.c
rc@
tmd.
ac.jp
Mr
Gen
Tsuc
hiya
ma
Grad
uate
Sch
ool o
f Cul
ture
and
Info
rmat
ion
Scie
nce,
Dos
hish
a U
nive
rsity
Kyot
anab
e Ci
tyJP
eim
1002
@m
ail4
.dos
hish
a.ac
.jp
Prof
Mits
uhiro
Tsuj
iFa
culty
of I
nfor
mat
ics /
Kan
sai U
nive
rsity
/ Ja
pan
Taka
tsuk
i-shi
, OSA
KAJP
tsuj
i@ka
nsai
-u.a
c.jp
Mr
Taka
hiko
Uen
oSt
. Mar
iann
a U
nive
rsity
Kaw
asak
iJP
t2ue
no@
mar
iann
a-u.
ac.jp
Mr
Taka
hiro
Um
eiGr
adua
te S
choo
l of D
oshi
sha
Uni
vers
ityKy
otan
abe
JPdi
m00
15@
mai
l4.d
oshi
sha.
ac.jp
Mr
Robb
ieva
n Ae
rtDe
part
men
t of M
etho
dolo
gy a
nd S
tatis
tics
Tilb
urg
Uni
vers
ityTi
lbur
gR.
C.M
.van
Aert
@til
burg
univ
ersit
y.ed
uDr
Mic
hel
van
de V
elde
nEr
asm
us U
nive
rsity
Rot
terd
amRo
tter
dam
NL
vand
evel
den@
ese.
eur.n
lM
rGe
rtja
nva
n de
n Bu
rgEr
asm
us U
nive
rsity
Rot
terd
amRo
tter
dam
NL
burg
@es
e.eu
r.nl
DrAn
drie
sVa
n de
r Ark
Tilb
urg
Uni
vers
ityTi
lbur
gN
La.
vdar
k@uv
t.nl
Mr
Dani
ëlva
n de
r Pal
mTi
lbur
g U
nive
rsity
Tilb
urg
NL
D.W
.vdr
Palm
@uv
t.nl
DrKa
trijn
Van
Deun
KU L
euve
n, V
AT: 0
419
052
173
Leuv
enBE
katr
ijn.v
ande
un@
ppw
.kul
euve
n.be
Prof
fred
van
eeuw
ijkw
agen
inge
n un
iver
sity
wag
enin
gen
NL
fred
.van
eeuw
ijk@
wur
.nl
Mrs
Anou
khva
n Gi
esse
nU
MC
Utr
echt
Utr
echt
NL
a.va
ngie
ssen
@um
cutr
echt
.nl
Mrs
Loan
van
Hoev
enU
MC
Utr
echt
Utr
echt
NL
l.r.v
anho
even
-3@
umcu
trec
ht.n
lDr
M. L
eeVa
n Ho
rnU
nive
rsity
of S
outh
Car
olin
aCo
lum
bia,
SC
US
vanh
orn@
sc.e
duM
rGe
ert
van
Kolle
nbur
gTi
lbur
g U
nive
rsity
Oirs
chot
NL
g.h.
vank
olle
nbur
g@uv
t.nl
Prof
Iven
Van
Mec
hele
nKU
Leu
ven,
VAT
: 041
9 05
2 17
3Le
uven
BEiv
en.v
anm
eche
len@
ppw
.kul
euve
n.be
Prof
Rosa
nna
VERD
ESe
cond
Uni
vers
ity o
f Nap
les
Case
rta
ITro
sann
a.ve
rde@
unin
a2.it
Prof
Rosa
nna
Verd
eSe
cond
Uni
vers
ity o
f Nap
les
Case
rta
ITro
sann
a.ve
rde@
unin
a2.it
Prof
Jero
en K
.Ve
rmun
tTi
lbur
g Sc
hool
of S
ocia
l and
Beh
avio
ral
Scie
nces
Tilb
urg
NL
j.k.v
erm
unt@
uvt.n
lM
rsM
arlie
sVe
rvlo
etKU
Leu
ven
Leuv
enBE
mar
lies.
verv
loet
@pp
w.k
uleu
ven.
be
Prof
DON
ATEL
LAVI
CARI
DIP.
SCI
ENZE
STA
TIST
ICHE
- SA
PIEN
ZA U
NIV
. RO
MA
ROM
AIT
dona
tella
.vic
ari@
uniro
ma1
.itPr
ofM
auriz
ioVi
chi
Sapi
enza
Uni
vers
ity o
f Rom
eRo
me
ITm
auriz
io.v
ichi
@un
irom
a1.it
Mrs
Mar
ia d
el C
arm
enVi
llar P
atin
oU
nive
rsid
ad A
nahu
acM
exic
oM
Xm
aria
.vill
ar@
anah
uac.
mx
Mrs
Ingr
idVr
iens
Tilb
urg
Uni
vers
ityTi
lbur
gN
Li.v
riens
@til
burg
univ
ersit
y.ed
uPr
ofM
arek
Wal
esia
kW
rocl
aw U
nive
rsity
of E
cono
mic
sJe
leni
a Gó
raPL
mar
ek.w
ales
iak@
ue.w
roc.
plDr
Mat
thijs
War
rens
Leid
en U
nive
rsity
Leid
enw
arre
ns@
fsw
.leid
enun
iv.n
lM
rLu
kasz
Was
zak
Adam
Mic
kiew
icz U
nive
rsity
Pozn
anPL
lwas
zak@
amu.
edu.
plDr
Jelte
Wic
hert
sTi
lbur
g U
nive
rsity
TIlb
urg
NL
j.m.w
iche
rts@
uvt.n
l
Mr
Tom
Wild
erja
nsKU
Leu
ven
Leuv
enBE
tom
.wild
erja
ns@
ppw
.kul
euve
n.be
DrJu
styn
aW
ilkW
rocl
aw U
nive
rsity
of E
cono
mic
sJe
leni
a Gó
raPL
just
yna.
wilk
@ue
.wro
c.pl
Prof
Adils
on E
lias
Xavi
erFe
dera
l Uni
vers
ity o
f Rio
de
Jane
iroRi
o de
Jane
iroBR
adils
on@
cos.
ufrj.
brPr
ofHi
rosh
iYa
dohi
saDo
shish
a U
nive
rsity
Kyot
anab
eJP
hyad
ohis@
mai
l.dos
hish
a.ac
.jpPr
ofKa
zuno
riYa
mag
uchi
Rikk
yo U
nive
rsity
Toky
oJP
kyam
agu@
rikky
o.ac
.jpDr
Mic
hio
Yam
amot
oO
saka
Uni
vers
ityTo
yona
kaJP
mya
mam
oto@
sigm
ath.
es.o
saka
-u.a
c.jp
Mr
Achi
mZe
ileis
Uni
vers
ität I
nnsb
ruck
Inns
bruc
kAT
Achi
m.Z
eile
is@R-
proj
ect.o
rgDr
Mar
iang
ela
Zeng
aU
nive
rsity
of M
ilano
-Bic
occa
Mila
noIT
mar
iang
ela.
zeng
a@un
imib
.itM
rBe
rrie
Ziel
man
Net
herla
nds C
ourt
of A
udit
Leid
sche
ndam
NL
a.zie
lman
@re
kenk
amer
.nl
DrM
ario
Zille
r:F
riedr
ich-
Loef
fler-
Inst
itut,
Fede
ral R
esea
rch
Inst
itute
for A
nim
al H
ealth
Grei
fsw
ald
- Ins
el R
iem
sDE
Mar
io.Z
iller
@fli
.bun
d.de
Mrs
Agat
aZo
ltasz
ekCh
air o
f Spa
tial E
cono
met
rics,
Uni
vers
ity o
f Lo
dzLo
dzPL
zolta
szek
@un
i.lod
z.pl
Scientific Program
Monday, July 15Plenary Invited SessionsTime: 09:00-10:30Room: CZ115Chair: Groenen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
Critical Issues and Developments in High-dimensional Prediction withBiomedical Applications 1Anne-Laure Boulesteix
Flexible Model Based Clustering via the Cluster-Weighted Approach 2Salvatore Ingrassia
Monday, July 15Concurrent Session 1aTopic: Applications in marketing and social policyTime: 11:00-12:20Room: CZ6Chair: Okada . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .
Latent Class Models in Marketing: Trading off Classification Certainty andCosts of Data Collection. 3Maurits Kaptein
Market Segmentation based on Stated Preferences using Latent ClassModels and R 4Andrzej Bak, Aneta Rybicka, and Marcin Pełka
Multi-layer Cluster Analysis of Brand Switching Among Coff ee Brands 5Akinori Okada and Satoru Yokoyama
Polish Households’ Pharmaceutical Expenditures in Years 2010− 2020−Microsimulation Analysis with FARMMES 6Agata Zoltaszek, M.A.
Monday, July 15Concurrent Invited Session 1bTopic: Analysis of symbolic dataTime: 11:00-12:20Room: CZ7Organizer and chair: Brito . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
i
Clustering for Aggregated Symbolic Data 7Nobuo Shimizu and Junji Nakano
Factor Analysis of Distributional Data using Quantiles 8Rosanna Verde and Antonio Irpino
A Hierarchical Clustering Algorithm applied to Modal Ordin al SymbolicData 9Carmen Bravo and José M. García-Santesmases
Constrained Clustering of Temporal Beanplot Data 10Carlo Drago
Monday, July 15Concurrent Invited Session 1cTopic: Reconsidering methodologies in inequalities indicators: thecase of gender studies(session sponsored by European Association ofMethodology)Time: 11:00-12:20Room: CZ8Organizer and chair: Crippa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Gender Gap: towards a Measurement with Chain Graphical Models 11Federica Nicolussi and Fulvia Mecatti
Time To Graduation: Does Gender Make A Difference? An Analysis Of AGreek University 12Adele H. Marshall Aglaia Kalamatianou and Mariangela Zenga
Beyond indicators: a Causal Approach to Gender Statistics 13Silvia Caligaris and Fulvia Mecatti
Gender Differentials In Higher Education: Hints From A Fuzz y StatesAnalysis 14Franca Crippa, Marcella Mazzoleni and Mariangela Zenga
Monday, July 15Concurrent Session 1dTopic: Correspondence analysisTime: 11:00-12:20Room: CZ9Chair: Le Roux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
Analysing Categorical Variables With Similar Categories:ConstrainedMultiple Correspondence Analysis 15Véronique Cariou and El Mostafa Qannari
Constrained Dual Scaling of Successive Categories for Detecting ResponseStyles 16Pieter C. Schoonees,, Michel van de Velden, and Patrick J.F. Groenen
ORTHOMALS: Orthogonal Projection Of A Multiple Correspond enceSolution On A Design Space 17Ralph C.A. Rippe and Willem J. Heiser
ii
Squared Covariances Or Chi-Squared Statistics Based Distances 18Antoine de Falguerolles
Monday, July 15Concurrent Session 1eTopic: Latent class analysisTime: 11:00-12:20Room: CZ109Chair: Oberski . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
A New Constant Memory Recursion For Hidden Markov Models 19Francesco Bartolucci and Silvia Pandolfi
Detecting Local Dependence In Binary Data Latent Class Models: SomeDevelopments 20Daniël Oberski
Power and Sample Size Determination for Latent Class Models 21Dereje W. Gudicha, Jeroen K. Vermunt, and Fetene B. Tekle
The Bias-Adjusted Three-Step Approach To Latent Class Modeling WithExternal Variables 22Zsuzsa Bakk, Daniel Oberski, and Jeroen K. Vermunt
Monday, July 15Concurrent Invited Session 1fTopic: Recent clustering techniques and their applicationsTime: 11:00-12:20Room: CZ114Organizer and chair: Kurihara . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparative Analysis on LDA-based Classification and Subject Categoriesof the Japanese Awards Database of Grants-in-Aid for Scientific Research,KAKEN 23Kei Kurakawa, Yuan Sun, and Yasumasa Baba
Prototype Identification through Archetypes 24Giancarlo Ragozini
Spatial Clustering based on Hierarchical Structure of MultidimensionalLattice Data 25Koji Kurihara and Fumio Ishioka
Research Literature Analytics through Mapping Narratives 26Fionn Murtagh
Monday, July 15Plenary Invited SessionsTime: 13:20-14:50Room: CZ115Chair: Jajuga . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .
Effects of Moment-to-moment Likeability Patterns on the Virality of OnlineAds 27Tammo Bijmolt
iii
Formal Concepts for Classification 28Bernhard Ganter
Monday, July 15Concurrent Invited Session 2aTopic: Biostatistics & psychometricsTime: 15:20-16:40Room: CZ6Organizer and chair: Lee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Multinomial Logistic Regression Ensembles 29Hongshik Ahn
Age-specific Disease Network For The Major Disease In Korea 30Taerim Lee and Hongseok Kim
Analysis of Questionnaire Survey with Ordinal-polytomousUsing theBinomial Confidence Limits 31Ueno, T., Tatsunami, S., Otaki, M., and Kuwabara, R.
Comparison Of Methods For Handling Missing Data In A Multi-I temInstrument 32I. Eekhout, H.C.W. de Vet, J.W.R. Twisk, J.P.L. Brand, M.R. de Boer, and M.W.Heymans
Monday, July 15Concurrent Session 2bTopic: Reduced rank clusteringTime: 15:20-16:40Room: CZ7Chair: Wilderjans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Common and Cluster-specific Simultaneous Component Analysis 33Kim De Roover, Marieke E. Timmerman, Batja Mesquita and Eva Ceulemans
Extending Clusterwise non-negative matrix factorization(NMF) tohierarchically organized data 34Joke Heylen, Philippe Verduyn, Iven Van Mechelen and Eva Ceulemans
Generalized Reduced Clustering Analysis 35Michio Yamamoto
Mixtures Of Factor Analyzers And Unobserved HeterogeneityInQuestionnaire Data 36Robert Kapłon
Monday, July 15Concurrent Invited Session 2cTopic: Research of IOPS Ph.D.-studentsTime: 15:20-16:40Room: CZ8Organizer and chair: Kuijpers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iv
Estimation Methods for Categorical Marginal Models: Comparing MAEL,GEE, and GSK. 37Renske E. Kuijpers, Wicher P. Bergsma, L. Andries van der Ark, and Marcel A.Croon
Applying Multilevel Latent Class Analysis To Large-Scale EducationalAssessment Data: Predicting Students’ Mathematical Strategy ChoicesFrom Teachers’ Instructional Practice 38Marije F. Fagginger Auer, Marian Hickendorff, and CornelisM. van Putten
A Tuning Strategy for COSA 39Maarten M.D. Kampert and Jacqueline J. Meulman
Accuracy Of Reliability Estimates 40Pieter R. Oosterwijk, Klaas Sijtsma, and L. Andries van der Ark
Monday, July 15Concurrent Session 2dTopic: Symbolic data clustering and regressionTime: 15:20-16:40 Room: CZ9Chair: Brito . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .
A Big Data Intensive Application System with Symbolic Data Analysis andits Implementation 41Hiroyuki Minami and Masahiro Mizuta
An Generalization Of Centre And Range Method For Fitting A Li nearRegression Model To Symbolic Interval Data Using Ridge Regression, LassoAnd Elastic Net Methods 42Oldemar Rodríguez
Symbolic Data Clustering. A Review 44Justyna Wilk
The Ensemble Conceptual Clustering of Symbolic Data 45Marcin Pełka
Monday, July 15Concurrent Invited Session 2eTopic: Applications in economics and businessTime: 15:20-16:40Room: CZ109Organizer and chair: Pociecha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Hierarchy Test Of Geographic Units based on Border Lengths 46Andrzej Sokołowski, Danuta Strahl, Małgorzata Markowska,and MarekSobolewski
Statistical Modeling the Optimal Level of FX Reserves for Poland 47Eugeniusz Gatnar
Latent Transitions with Mixture Rasch Model of Bankruptcy R isk in theClassification of Polish Firms 48Barbara Pawełek, Józef Pociecha, and Adam Sagan
v
Automatic Determination The Number Of Clusters In Spectral Clustering 49Marek Walesiak and Andrzej Dudek
Monday, July 15Concurrent Session 3aTopic: Clustering algorithmsTime: 17:10-18:30Room: CZ6Chair: Hennig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .
A Spectral-Mean Shift Algorithm for Clustering of Symbolic Data 50Andrzej Dudek and Marcin Pełka
Asymptotics of ReducedK-means Clustering 51Yoshikazu Terada
Non-hierarchical Clustering Algorithm For Mixed Numerica l AndCategorical Three-Way Three-Mode Data 52Takahiro Umei and Hiroshi Yadohisa
Using Simulation Strategies to Test Clustering Algorithm Performances 53Marina Marino and Cristina Tortora
Monday, July 15Concurrent Invited Session 3bTopic: recursive partitioning and application (session sponsored byIASC)Time: 17:10-18:30Room: CZ7Organizer and chair: Wilhelm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Random Forest Variable Importance Measures: Current Developments 54Anne-Laure Boulesteix and Silke Janitza
Detecting Threshold Interactions In Binary Classification: STIMA 55Claudio Conversano and Elise Dusseldorp
A Recursive Partitioning-Based Method To Balance Covariates WhenEstimating Causal Effects 56Massimo Cannas, Claudio Conversano and Francesco Mola
Recursive Partitioning for Hybrid Image Classification using Captions andImage Features 57Adalbert Wilhelm
Monday, July 15Concurrent Session 3cTopic: Applications in economicsTime: 17:10-18:30Room: CZ8Chair: Markos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
Change of Aspects of Industrial Classification System from HierarchicalStructure to Network Structure 58Hiroki Furuzumi, Yoshiro Matsuda, and Yasumasa Baba
vi
Econometric Models of Durable Goods’ Prices: A Hedonic Approach 59Anna Król
Smart Growth Versus Economic And Social Cohesion – Econometric PanelAnalysis 60Beata Bal-Domanska and Elzbieta Sobczak
Workflow Classification Based On The K-Means Partitioning 61Etienne Lord, Abdoulaye Baniré Diallo, and Vladimir Makarenkov
Monday, July 15Concurrent Session 3dTopic: R packagesTime: 17:10-18:30Room: CZ9Chair: Leisch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .
Functional Principal Component Analysis with R 62Malgorzata Sej-Kolasa and Miroslawa Sztemberg-Lewandowska
Implementation of Time Series Methods of Forecasting in TSprediction RPackage 63Tomasz Bartłomowicz
Latest developments of theRSDA: An R package for Symbolic Data Analysis 64Oldemar Rodríguez and Johnny Villalobos
Microeconometrics Multinomial Logit Models and their Impl ementations inMMLM R Package 65Andrzej Bak and Tomasz Bartłomowicz
Monday, July 15Concurrent Session 3eTopic: Latent variable & multilevel analysisTime: 17:10-18:30Room: CZ109Chair: Montanari . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Latent Spaces of the Product Baskets - A Hybrid Model of On-line Shopping 66Adam Sagan and Mariusz Łapczynski
Multilevel Principal Covariates Regression 67Marlies Vervloet, Wim Van den Noortgate, Katrijn Van Deun and Eva Ceulemans
Three-step Estimation Method For Discrete Micro-Macro Multilevel Models 68M. Bennink, M. A. Croon and J. K. Vermunt
Single-array SNP Genotype Classification With Semi-ParametricLog-Concave Mixtures 69Paul H.C. Eilers and Ralph C.A. Rippe
Monday, July 15Concurrent Invited Session 3fTopic: Least-squares clusteringTime: 17:10-18:30
vii
Room: CZ114Organizer and chair: Mirkin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
On Featureless K-Means Clustering 70Sergey D. Dvoenko
Two Major Least-squares Divisive Clustering Methods: Bisecting K-Means,PDDP and in between 71E. Kovaleva and B. Mirkin
Scoring Dissimilarity between Binary Images by Aligning Series of SkeletonPrimitives 72Olesya A. Kushnir and Oleg S. Seredin
Least-squares Consensus Clustering versus: (a) other ConsensusApproaches and (b) K-Means 73A. Shestakov and B. Mirkin
Tuesday, July 16Concurrent Session 4aTopic: Clustering methodsTime: 08:30-10:10Room: CZ6Chair: Bertrand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Combination of Several Control Charts using Dynamic Weighted MajorityAlgorithm 74Dhouha Mejri, Claus Weihs and Mohamed Limam
Multiplicity Within Clustering: Challenges And Unificatio ns 75Jacques-Henri Sublemontier
Non-Isometric Transforms in Time Series Classification using DTW 76Tomasz Górecki and Maciej Łuczak
Performance of the Accelerated Hyperbolic Smoothing Clustering Method 77Adilson Elias Xavier and Vinicius Layter Xavier
STATIS Based Multiblock Clustering 78Ndèye Niang and Mory Ouattara
Tuesday, July 16Concurrent Invited Session 4bTopic: New trends in analyzing multi-set and three-way dataTime: 08:30-10:10Room: CZ7Organizers: Wilderjans and Ceulemans (Chair). . . . . . . . . . . . . . . . . . . . . . .
Identifying Common And Distinctive Processes Underlying Multiset Data 79Katrijn Van Deun, Age K. Smilde, Henk A.L. Kiers, and Iven VanMechelen
Fuzzy Clustering of Three-way Proximity Arrays 80Paolo Giordani and Henk A.L. Kiers
viii
Principal Covariates Clusterwise Regression 81Eva Ceulemans, Eva Vande Gaer, Henk A. L. Kiers, Iven Van Mechelen, andTom F. Wilderjans
Clusterwise PARAFAC To Identify Heterogeneity In Three-Way Data 82Tom F. Wilderjans and Eva Ceulemans
Structure-Revealing Data Fusion Model 83Evrim Acar, Anders J. Lawaetz, Morten A. Rasmussen, and Rasmus Bro
Tuesday, July 16Concurrent Session 4cTopic: Distances and similaritiesTime: 08:30-10:10Room: CZ8Chair: Okada . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .
Effects of Resampling Schemes on Stability of Cluster Validation Indices 84Rainer Dangl and Friedrich Leisch
Functional Canonical Correlation Analysis 85Mirosław Krzysko and Łukasz Waszak
Pearson’s Product-Moment Correlation is a Special Case Of Cohen’sWeighted Kappa 86Matthijs J. Warrens
Ternary Diagrams Based On A Probabilistic Ideal Point Model 87Mark de Rooij and Paul Eilers
The Matter Of Scale: Perceiving Distances And Proximities In TheBi-Partial Clustering Setting 88Jan W. Owsinski
Tuesday, July 16Concurrent Session 4dTopic: Algorithms for clustering and classificationTime: 08:30-10:10Room: CZ9Chair: Sokołowski . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparing Direct Estimators of the Mode 90Andrzej Sokołowski and Kamil Fijorek
k-NN Algorithm for Instantaneous Classification 91Carmen Villar-Patiño and Carlos Cuevas-Covarrubias
Flexible Multiclass Support Vector Machines: An Approach using IterativeMajorization and Huber Hinge Errors 92G.J.J. van den Burg and P.J.F. Groenen
Power-Stress for Multidimensional Scaling 93Patrick J.F. Groenen and Jan de Leeuw
ix
Variable Selection in Cluster Analysis Using Resampling Techniques: aProposal 94Hans-Joachim Mucha and Hans-Georg Bartel
Tuesday, July 16Concurrent Session 4eTopic: Applications in risk analysis and financeTime: 08:30-10:10Room: CZ109Chair: Cuevas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .
Adversarial Risk Analysis in Auctions 95David Banks
Gaussian Process Classification And Duration Models For Credit Risk 96Silvia Figini and Aki Vehtari
Model Averaging For Credit Risk Modelling 97Silvia Figini and Marika Vezzoli
Multiobjective Optimization Of Financing Household GoalsWith MultipleInvestment Programs 98Lukasz Feldman, Radoslaw Pietrzyk, and Pawel Rokita
Power Of Skewness Tests In The Presence Of Fat Tailed FinancialDistributions 99Krzysztof Piontek
Tuesday, July 16Concurrent Session 4fTopic: Applications in social sciencesTime: 08:30-10:10Room: CZ110Chair: Palumbo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Robust Clustering for Anti-Fraud Analysis 100Andrea Cerioli and Domenico Perrotta
An Extended Gravity Approach To Examining Internal Migrati ons. TheCase Of Poland 101Justyna Wilk and Michał Pietrzak
Clustering of US counties based on their demographic structures 102Simona Korenjak-Cerne, Vladimir Batagelj, Nataša Kejžar
Strategic, Motivational And Emotional Aspects Of University Study. ALatent Class Approach 103Anna Giraldo, Silvia Meggiolaro, and Elisa Visentin
The Comparative Log–Linear Analysis Of Unemployment In Poland In2004–2011 104Justyna Brzezinska
Tuesday, July 16President’s Invited SessionTime: 10:40-12:10
x
Room: CZ115Chair: Dean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .
Measurement of Quality in Cluster Analysis 105Christian Hennig
Resampling Methods for Exploring Cluster Stability 106Friedrich Leisch
The Effect Of Data Generation On Our Understanding Of ClusteringAlgorithms 107Doug Steinley
Tuesday, July 16Concurrent Session 5aTopic: Clustering and multilevel analysis of symbolic dataTime: 13:10-14:30Room: CZ6Chair: McNicholas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CLustering Constrained Symbolic Objects Constrained By Rules 108Marc Csernel
Conceptual Clustering with Interval Representation 109Paula Brito and Géraldine Polaillon
Hierarchical Symbolic Cluster Analysis with Quantile FunctionRepresentation 110Yusuke Matsui, Hiroyuki Minami, and Masahiro Mizuta
Multilevel Consumer Preference Model on Symbolic Data 111Adam Sagan, Marcin Pełka, and Aneta Rybicka
Tuesday, July 16Concurrent Invited Session 5bTopic: advances in clustering and classificationTime: 13:10-14:30Room: CZ7Organizer and chair: Nugent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Variance of the Adjusted Rand Index (and other properties) 112Doug Steinley
Identifying Clusters Bayesian Disease Mapping 113Nema Dean, Craig Anderson, and Duncan Lee
Classification Boundary Mapping 114Yuning He and Herbert Lee
Deduplicating Text Records by Clustering the Results of AggregatedConditional Classifiers 115Rebecca Nugent and Samuel L. Ventura
Tuesday, July 16Concurrent Session 5cTopic: Applications in behavioral sciences
xi
Time: 13:10-14:30Room: CZ8Chair: Yamaguchi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Classifications of Baseball Pitching Strategies and Exploring Effects of theNew Official Balls in the Japanese Professional Baseball League 116Kazunori Yamaguchi
Life Long Learning Idea on Background of Poles’ Needs 117Marta Dziechciarz-Duda and Klaudia Przybysz
Migration Of Population - The Analysis With The Use Of Log-Li near Models 118Justyna Brzezinska
The Influence of Emotion Recognition and Academic Performance onGroup Popularity 119Ivan Loredana
Tuesday, July 16Concurrent Invited Session 5dTopic: Formal concept analysisTime: 13:10-14:30Room: CZ9Organizer and chair: Ganter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hierarchical Classes Analysis vs. Formal Concept Analysis 120Bernhard Ganter and Cynthia V. Glodeanu
The Diversity of Pattern Structures in Formal Concept Analysis 121Aleksey Buzmakov, Sergei O. Kuznetsov, and Amedeo Napoli
Decision Aiding Software And Consensus Theory 122Florent Domenach and Ali Tayari
Experimental Comparison of Some Triclustering Algorithms 123Dmitry V. Gnatyshak, Dmitry I. Ignatov, and Sergei O. Kuznetsov
Tuesday, July 16Concurrent Invited Session 5eTopic: Interactions in bi- and tri-additive modelsTime: 13:10-14:30Room: CZ109Organizers: Albers and Gower (Chair) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A Framework For Modeling Covariances 124Age K. Smilde, M.E. Timmerman, H.C.J. Hoefsloot, J.J. Jansen, and E. Saccenti
Biadditive Models, Alternative Estimation Procedures AndBetter Biplots 125Fred A. van Eeuwijk, Gerrit Gort, Sabine K. Schnabel, and Paul H.C. Eilers
Triadditive Models for Three-way Tables 126John C. Gower, Casper J. Albers, and Steffen Unkel
Three-way Candecomp/Parafac And The Diverging ComponentsProblem 127Alwin Stegeman
xii
Tuesday, July 16Concurrent Session 5fTopic: Cluster-weighted modelingTime: 13:10-14:30Room: CZ114Chair: Ingrassia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cluster-weighted t-factor Analyzers for Clustering of High-dimensionalData 128Sanjeena Dang, Antonio Punzo, Salvatore Ingrassia, and Paul D. McNicholas
Cluster-Weighted Modeling For Time To Event Data 129Utkarsh J. Dang and Paul D. McNicholas
Modeling Bivariate Mixed-Type Data with the Generalized LinearExponential Cluster-Weighted Model 130Salvatore Ingrassia and Antonio Punzo
Tuesday, July 16Plenary Invited SessionsTime: 15:00-15:45Room: CZ115Chair: Vichi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .
Cluster Inference using Modes 131Surajit Ray
Tuesday, July 16Presidential addressTime: 15:45-16:30Room: CZ115Chair: Vichi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .
IFCS Presidential AddressClassipedia: A Road Map to Help Traverse the Classification Jungle 132Iven Van Mechelen
Wednesday, July 17Concurrent Session 6aTopic: Clustering, including ultrametric approachesTime: 08:30-10:10Room: CZ6Chair: Diatta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .
A Restricted ADCLUS Type Model for Transition Matrices 133Tadashi Imaizumi
Clustering Of Time Series Via A Segmentation Approach 134Christian Derquenne
Looking For A Best Compromise Between The Ultrametric Supremum-Norm Approximations 135B. Fichet
xiii
Ultrametric Tree Representation For Three-Way Three-ModeData WithWeights Of Variables And Occasions 136Kensuke Tanioka and Hiroshi Yadohisa
Which Movie Shall I Watch? Ultrametric Based Recommendation System 137Pedro Contreras, Fionn Murtagh, and Javier Pereira
Wednesday, July 17Concurrent Invited Session 6bTopic: Personalized medicine by treatment-subgroup interactionTime: 08:30-10:10Room: CZ7Organizer : Elise DusseldorpChair and discussant: Willem Heiser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Model-Based Recursive Partitioning for Detecting Interaction Effects inSubgroups 138Achim Zeileis, Torsten Hothorn, and Kurt Hornik
Predicting Individual Causal Effects (ICE) 139Xiaogang Su and Joseph Kang
A New Tool For Identifying Qualitative Treatment-Subgroup Interactions:QUINT 140Elise Dusseldorp and Iven Van Mechelen
A Comparison Of Six Sequential Partitioning Methods To FindSubgroupsInvolved In Treatment-Subgroup Interactions 141Lisa Doove, Elise Dusseldorp, Katrijn Van Deun, and Iven VanMechelen
Wednesday, July 17Concurrent Session 6cTopic: Modeling distributions and associationsTime: 08:30-10:10Room: CZ8Chair: Kiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .
Automatic Bayes Factors for Comparing Variances of Two IndependentNormal Distributions 142Florian Böing-Messing and Joris Mulder
Bayesian Model Selection For Evaluating Equality And OrderConstraintsOn Correlation Matrices 143Joris Mulder
Bivariate Dependence Patterns And Copulas: Model Discrimination AndRobustness 144Lianne Ippel and Johan Braeken
Posterior Predictive checking as alternative to Asymptotics andBootstrapping in Latent Class Analysis 145Geert H. van Kollenburg, Joris Mulder, and Jeroen K. Vermunt
Statistical Modeling Of The Distribution Of Financial Retu rns 146Cuevas-Covarrubias C., Iñigo-Martínez J. and Rosales-Contreras J.
xiv
Wednesday, July 17Concurrent Session 6dTopic: Classification treesTime: 08:30-10:10Room: CZ9Chair: Lausen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .
Combining Decision Trees And Stochastic Curtailment For AssessmentLength Reduction Of Test Batteries Used For Classification 147Marjolein Fokkema, Niels Smits Henk Kelderman
Gaussian Tree Models For Discrimination 148Gonzalo Perez–de–la–Cruz and Guillermina Eslava–Gomez
Stochastic Curtailment Of Questionnaires For Three Level Classification:Shortening The Ces-D For Assessing Low, Moderate, And High Risk OfDepression 149Niels Smits, Matthew Finkelman, and Henk Kelderman
Tree-Based Prediction with Missing Data 150Holger Cevallos Valdiviezo, Stefan Van Aelst
Sparse Classifier Ensembles for Improved Interpretability. 151Werner Adler, Zardad Khan, Sergej Potapov and Berthold Lausen
Wednesday, July 17Concurrent Session 6eTopic: ClassificationTime: 08:30-10:10Room: CZ109Chair: Groenen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
A ROC-Optimised Multi-Prototype Classifier 152Mario Ziller
Classification of Rounded Shapes with Penalized Signal Regression 153Johan J. de Rooi and Paul H.C. Eilers
Classification of Topics on Twitter in Consideration of TimeSeries Variation 154Atsuho Nakayamar, Hiroyuki Tsurumi, and Junya Masuda
Classifying Real-World Data With The DDα-Procedure 155Pavlo Mozharovskyi, Karl Mosler, and Tatjana Lange
Comparing High-Dimensional Classifiers: Abuse and Dangersof OverallAccuracy 156A. Pedro Duarte Silva
Wednesday, July 17Concurrent Session 6fTopic: Model-based clusteringTime: 08:30-10:10Room: CZ114Chair: McLachlan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xv
Divisive Latent Class Modeling as a Density Estimation Tool: TheEstimation Algorithm and an Application to Incomplete Data. 157Daniel W. van der Palm, L. Andries van der Ark, and Jeroen K. Vermunt
Determining the Number of Clusters in Categorical Data 158Cláudia Silvestre, Margarida Cardoso, and Mário Figueiredo
Identifying Mixtures of Mixtures Using Bayesian Estimation 159Gertraud Malsiner-Walli, Sylvia Frühwirth-Schnatter, and Bettina Grün
Logratio Methodology Applied To Model-Based Clustering 160M. Comas-Cufí, G. Mateu-Figueras and J.A. Martín-Fernández
Model-based Clustering Of Multivariate Longitudinal Data 161Laura Anderlucci, Angela Montanari, and Cinzia Viroli
Wednesday, July 17Concurrent Session 7aTopic: Longitudinal and multilevel analysisTime: 10:40-12:00Room: CZ6Chair: Nugent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .
A Bayesian Multilevel Modeling of Longitudinal data: Appli cation toHygroscopic Expansion in Composite Resins 162Nasim Vahabi, Mahmood Reza Gohari, and Ali Azarbar
A New Approach To Analyse Longitudinal Epidemiological Data With AnExcess Of Zeros 163A.S. Spriensma, T.R.S. Hajos, M.R. de Boer, M.W. Heijmans, and J.W.R. Twisk
A Linear Mixed Model with a Mixture of Smooth Random EffectsDistributions 164Berrie Zielman
Longitudinal IRT Modelling compared with Multilevel Analy sis inestimating Development Over Time In Data From Three Likert-ItemQuestionnaires 165R. Gorter, M.R. de Boer, M.W. Heijmans, and J.W.R. Twisk
Wednesday, July 17Concurrent Invited Session 7bTopic: BiclusteringTime: 10:40-12:00Room: CZ7Organizer: VichiChair: Van Mechelen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Mutual Information, Chi-Squared And Model-Based Clustering ForCo-Clustering Of Contingency Tables 166Mohamed Nadif and Gérard Govaert
Parsimonious Estimation And Testing Of Two-Way Interaction By MeansOf Two-Mode Clustering 167Jan Schepers
xvi
A general Model for Two-mode Clustering 168Maurizio Vichi
Wednesday, July 17Concurrent Session 7cTopic: Applications in medicineTime: 10:40-12:00Room: CZ8Chair: Lausen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .
Comprehensive Calculations of the Sensitivity and Specificity of DiagnosisUsing Bile Cytological Data 169Tatsunami S., Hayakawa C., Koike J., Hoshikawa, M., and UenoT.
Diagnostics for the Risk Prediction of Each Type of EndoleakFormationafter TEVAR Using Statistical Discriminant Analysis 170Kuniyoshi Hayashi, Fumio Ishioka, Bhargav Raman, Daniel Y.Sze, HiroshiSuito, Takuya Ueda, and Koji Kurihara
Extension Of A Multilingual Medical Lexicon By Combined FeatureExtraction Methods 171Wiebke Petersen, Denis Anuschewski, Pascal Chave, and Philipp F. Zeitz
Wednesday, July 17Concurrent Invited Session 7dTopic: Correspondence analysis and related methodsTime: 10:40-12:00Room: CZ9Organizers: Groenen (Chair) and Greenacre. . . . . . . . . . . . . . . . . . . . . . . . . .
The Joy of Fuzzy 172Michael Greenacre
Fast Iterative Implementation of Correspondence Analysis 173Alfonso Iodice D’Enza, Patrick J. Groenen and Michel van de Velden
Inverse Multiple Correspondence Analysis 174Michel van de Velden, Patrick Groenen, and Wilco van den Heuvel
Tracking Association Structures in Categorical Data Flows 175Alfonso Iodice D’Enza and Angelos Markos
Wednesday, July 17Concurrent Invited Session 7eTopic: finding the number of clustersTime: 10:40-12:00Room: CZ109Organizer and chair: Hennig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Determining the Number of Clusters: a Problem of Definition or Estimation? 176Giovanna Menardi
Enhancing The Selection Of A Number Of Clusters In Model-BasedClustering With External Qualitative Variables 177AJ.-P. Baudry, M. Cardoso, G. Celeux, M.J. Amorim, and A.S. Ferreira
xvii
Choosing the Number of Clusters after, before, and while Clustering 178B. Mirkin
Wednesday, July 17Plenary Invited SessionsTime: 13:00-14:30Room: CZ115Chair: Nadif . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .
Competitions in Machine Learning: the Fun, the Art, and the Science 179Isabelle Guyon
Playing with Data–or How to Discourage Incorrect Data Analysis 180Klaas Sijtsma
Wednesday, July 17Concurrent Session 8aTopic: ApplicationsTime: 15:00-16:20Room: CZ6Chair: Bassi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .
A Study on Small-Area Geographical Analysis of ResidentialCharacteristicsafter the Great Hanshin-Awaji Earthquake by two Individual DifferencesModel 181Mitsuhiro Tsuji, Hiroshi Kageyama and Toshio Shimokawa
Author Identification of Japanese Classical Literature by QuantitativeAnalysis 182Gen Tsuchiyama and Masakatsu Murakami
A Latent Class Approach for Estimating Labour Market Mobili ty in thePresence of Multiple Indicators and Retrospective Interrogation 183Francesca Bassi, Marcel Croon, and Davide Vidotto
Wednesday, July 17Concurrent Invited Session 8bTopic: Non-Gaussian model-based classificationTime: 15:00-16:20Room: CZ7Organizer and chair: McNicholas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
On Finite Mixtures of Skew Distributions 184Geoff McLachlan and Sharon Lee
Classification via Mixtures of Shifted Asymmetric Laplace and Mixtures ofGeneralized Hyperbolic Distributions 185Paul D. McNicholas, Ryan P. Browne, and Brian C. Franczak
Gaussian And Distance Based Clustering In High-Dimensional Space:Differences And Common Aspects 186Francesco Palumbo, Cristina Tortora, and Paul McNicholas
Clustering and Dimension Reduction using Non-Gaussian Mixtures 187Katherine Morris and Paul McNicholas
xviii
Wednesday, July 17Concurrent Session 8cTopic: Applications in social sciencesTime: 15:00-16:20Room: CZ8Chair: Dean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .
Comparison of Spatial Clusters between Suicide Data and ItsIncrease-decrease Rates in Japan 188Makoto Tomita, Takafumi Kubota, Fumio Ishioka and Toshiharu Fujita
Detection of Spatial Clusters for High and Low Suicidal RiskAreas in Japan 189Takafumi Kubota, Makoto Tomita, Fumio Ishioka, Tomokazu Fujino and HiroeTsubaki
Patterns of Cultural Practices and Characteristics of the Cultural Omnivore 190Miki Nakai
The Structure Of Subjective Social Status In Japan: An Approach BasedOn Latent Class Model 191Yusuke Kanazawa
Wednesday, July 17Concurrent Invited Session 8dTopic: Biplot-based visualisations and classificationTime: 15:00-16:20Room: CZ9Organizer and chair: Le Roux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Reference Set Selection for Multivariate Statistical Process Monitoring withBiplots 193RF Rossouw, RLJ Coetzer, and NJ Le Roux
PLS Biplot: Another Graphical Tool for Multivariate Data 194Opeoluwa V.F. Oyedele and Sugnet Lubbe
Variable Selection for Regression and PLS using Generic Algorithms andParticle Swarm Optimization: A Comparison between the Two Methods 195Martin Philip Kidd and Martin Kidd
Classification with Hyperspheres 196Morné Lamont
Wednesday, July 17Concurrent Invited Session 8eTopic: Combinatorial methods for hierarchical and non-hierarchicalclusteringTime: 15:00-16:20Room: CZ109Organizer and chair: Brucker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Separation And Convexity Properties Of Hierarchical And Non HierarchicalClustering 197Patrice Bertrand and Jean Diatta
xix
Latticial Approach for Perfect Phylogeny Problems 198François Brucker and Pascal Préa
Some Aspects of Formal Concept Analysis in Hierarchical Classificationand Data Analysis 199Mehdi Kaytoue, Sergei O. Kuznetsov, and Amedeo Napoli
Which Movie Shall I Watch? Ultrametric Based Recommendation System 200Pedro Contreras, Fionn Murtagh, and Javier Pereira
Wednesday, July 17Concurrent Session 8fTopic: Applications in geneticsTime: 15:00-16:20Room: CZ114Chair: Van Deun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Automatic Annotation and Classification of new Papillomavirus genomes 201Mohamed Amine Remita, Ahmed Halioui and Abdoulaye Baniré Diallo
Different Approaches To Modeling Family Data In GWAS: Appli cation ToCannabis Use 202Camelia C. Minica, Conor V. Dolan, Jouke-Jan Hottenga, Dorret I. Boomsmaand Jacqueline M. Vink
Utilization Of Machine-Learning Methodologies In Order To UnderstandComplex Evolutionary And Functional Links Among Bacterial Genomes 203Olivier Poiron and Benedicte Lafay
Application of a Bayesian Artificial Neural Network to the Br east CancerSurvival Data 204Masoud Salehi and Mahmood Reza Gohari
Wednesday, July 17Plenary Invited SessionsTime: 16:50-17:50Room: CZ115Chair: McLachlan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Achieving Near-perfect Classification for Functional Data 205Peter Hall (and Aurore Delaigle)
xx
Critical Issues and Developments in High-dimensionalPrediction with Biomedical Applications
Anne-Laure Boulesteix1
Abstract
The construction of prediction rules based on high-dimensional molecular (“omics”)data in small sample settings has been the focus of abundant literature in computa-tional statistics and bioinformatics in the last decade. Such rules may be used in medicalpractice, e.g., to predict the clinical outcome of patientsbased on their transcriptomic,proteomic or metabolomic profile. While the technical issues characterizing the con-struction of prediction rules in this context have been wellinvestigated in the literature,other related crucial aspects remain comparatively underconsidered. In this talk, I willgive an overview of four projects addressing some of these problems.
The focus of the first project is on cross-validation and preliminary steps – such asvariable selection, normalization or imputation of missing values – that possibly leadto an underestimation of prediction error if performed globally using both training andtest sets. The second project addresses the evaluation and improvement of the clinicalusefulness of the derived prediction rules in terms of addedpredictive value comparedto simpler models based on classical clinical predictors. The third project is about therandom forest algorithm often used for regression and classification in bioinformaticsand the statistical properties of its associated variable importance measures. The fourthproject deals with methodological aspects of comparison studies based on real-life datasets with emphasis on testing procedures and power issues.
Computational Molecular Medicine Research Group, Department of Medical Informa-tics, Biometry and Epidemiology (IBE), University of Munich
1
Flexible Model Based Clustering via the Cluster-WeightedApproach
Salvatore Ingrassia1
Abstract
Cluster-Weighted Models (CWMs) are a flexible family of mixture models for fittingthe joint density of a pair (X, Y) of a response variable Y and avector of covariates X.Statistical properties are investigated from both theoretical and numerical point of view;in particular, it is shown that CWM includes mixture of regressions as a special case.Some particular models, based on Gaussian and t distributions as well as on generalizedlinear models, will be introduced and properties of the maximum likelihood estimatesare presented. Extension to high-dimensional data modeling is finally outlined. Theo-retical results are illustrated using some empirical studies, considering both simulatedand real data.
Department of Economics and Business, University of Catania (Italy)
2
Latent Class Models in Marketing: Trading off ClassificationCertainty and Costs of Data Collection.
Maurits Kaptein1
Abstract
For long, latent class analysis has been used in marketing for consumer segmentation(Green, 1976). Often, a large feature set — such as a purchasehistory of individual cus-tomers — is used identify different segments of customers. Class membership, definingthe segments, is subsequently used to target customers. Class membership might (e.g.)be related to customer susceptibility to distinct promotions, in which case the segmentscan be used to tailor promotions.
While many classification attempts are donepost-hoc, after all the relevant individ-ual level purchase data is collected, such data is not alwaysavailable. Consider anewcustomer of whom only a limited set of purchases are observed: how should we classifysuch a customer? We could estimate class membership — with large uncertainty — buttoo early classification might lead to the use of suboptimal promotions in future interac-tions. On the other hand, refraining from tailoring to obtain more observations in itselfcan be costly.
The above trade-off raises questions about our applicationof latent class analysis.The assignment of a promotion should not merely be a functionof class membership,but also of our associated uncertainty. We will show that using Randomized ProbabilityMatching (Scott, 2010) — a means of optimising the explore-exploit trade-off inher-ent in uncertain classification — outperforms both early as well as late classificationdecisions over the lifetime of a customer.
ReferencesGREEN, P.E., CARMONE F.J. and WACHPRESS, D.P. (1976): Consumer segmen-tation via latent class analysis.Journal of Consumer Research, 3, 170-174.SCOTT, L. (2010): A modern Bayesian look at the multi-armed bandit. Appl.Stochastic Models Bus. Ind., 26, 639–658.
KeywordsCLASSIFICATION, EXPLORE-EXPLOIT TRADEOFF, SEGMENTATION, MAR-KETING
Tilburg University, Tilburg, the [email protected]
3
Market Segmentation based on Stated Preferences usingLatent Class Models and R
Andrzej Bak1, Aneta Rybicka1, and Marcin Pełka1
Abstract
Market segmentation is understood as a division of consumers to relatively homogenousgroups. Market segmentation leads on the basis of variablesdescribing consumers orproducts, or having the same time, both sets of variables. Inthe distinguished groupsare consumers, for which offered products or services have similar utility. Often usedin segmentation research tools are classification methods (cluster analysis), to whichbelong also latent class analysis models. The main aim of thepaper is to present selectedlatent class analysis models and their application in the market segmentation based onthe stated preferences. In these models can be taken into account both types of variables:describing products or services (e.g. brand, price) and characteristics of the consumers(e.g. demographics and socio-economic variables). In segmentation procedure may beused a category of consumer preferences as a criterion for separability homogeneousclasses of consumers. In the estimation of latent class models used R program, packagesand scripts.
ReferencesBEANE T.P., ENNIS D.M. (1987): Market Segmentation: A Review. European Jour-nal of Marketing, vol. 21, nr 5, s. 20-42.LINZER D.A., LEWIS J.B. (2011):poLCA: Polytomous variable Latent Class Anal-ysis. R package version 1.3, http://userwww.service.emory.edu/ dlinzer/poLCA.WEDEL M., DESARBO W.S (1994):A Review of Recent Developments in LatentClass Regression Models, In: R.P. Bagozzi (Ed.),Advanced Methods of MarketingResearch, Blackwell, Cambridge.WEDEL M., KAMAKURA W.A. (2000): Market Segmentation. Conceptual andMethodological Foundations, 2nd ed., Kluwer Academic Publishers, Boston-Dordrecht-London.
KeywordsMARKET SEGMENTATION, PREFERENCES, LATENT CLASS MODELS
1Wrocław University of Economics, Department of Econometrics and Computer Sci-ence, Nowowiejska 3, 58-500 Jelenia Góra, Poland,[email protected],[email protected],[email protected]
4
Multi-layer Cluster Analysis of Brand Switching AmongCoffee Brands
Akinori Okada1 and Satoru Yokoyama2
Abstract
Multi-layer cluster analysis assumes a hierarchical cluster structure consists of severallayers, where each upper layer cluster consists of lower layer clusters, and classifies ob-jects into clusters at each layer (Okada & Yokoyama, 2011, 2013). Multi-layer clusteranalysis has been applied to two-mode two-way data (Okada & Yokoyama, 2011). Abrand switching matrix, whose( j,k) element represents the frequency of brand switch-ing from brand corresponds to rowj to brand corresponds to columnk, is analyzed.In the present study, a brand switching matrix, which consists of one-mode two-waydata, is transposed, and is regarded as two-mode two-way data. Coffee brands vary onthree attributes; (a) type (regular instant, freeze-dried, and already mixed with sugar andcream or packed in a plastic or paper cup), (b) maker (three companies), and (c) with orwithout end (flier) when the brand was purchased. The analysis discloses the salienceof each attribute in brand switching.
ReferencesARABIE, P. and HUBERT, L. (1994): Cluster Analysis in Marketing Research. In:R.P. Bagozzi (Ed.):Advanced Methods of Marketing Research. Blackwell, Cam-bridge, MA, 160-189.OKADA, A. and YOKOYAMA, S. (2011): Cluster Analysis Based onMulti-layerStructure.Collection of Abstracts IFCS Symposium and GfKl/DAGM ConferenceTalks. 149.OKADA, A. and YOKOYAMA, S. (2013): Multi-layer Cluster Analysis of BrandSwitching.Proceeding of the 31st Meeting of the Japanese Classification Society.
KeywordsBRAND SWITCHING, CLUSTER ANALYSIS, HIERARCHICAL STRUCTURE,MULTI-LAYER
Graduate School of Management and Information Sciences, Tama University, 4-1-1 Hi-jirigaoka Tama-shi Tokyo Japan [email protected] · Department ofBusiness Administration, Faculty of Economics, Teikyo University, 359 Otsuka Ha-chioji City Tokyo Japan [email protected]
5
Polish Households’ Pharmaceutical Expenditures in Years2010− 2020− Microsimulation Analysis with FARMMES
Agata Zoltaszek, M.A.
Abstract
Healthcare is a key sector of every economy and is of grate medical, social, and econom-ical importance to all citizens. In Poland healthcare is mostly public funded, howeverdue to the system’s inefficiency “out-of-pocket” expenditures have been increasing. Thelargest share contains pharmaceutical expenditures and itis high enough to limit theaccessibility of the prescribed and over-the-counter medicine. Therefore, analysing dataon current and future private direct expenses on medicine iscrucial for evaluation andimprovement of the healthcare system in Poland. The main goal of this paper is to ana-lyze Polish households’ pharmaceutical expenditures in years 2010 2020. Aggregatedvalues on province and state level are obtained by a microsimulation experiment basedon a microsimulation model that has been constructed for thepurpose of this research- The Microsimulation Socioeconomic Model of Households’ Pharmaceutical Expen-ditures in Poland (FARMMES). Outcomes of this research can be used to analyze thedistribution of these expenditures and determine losers and winners of current health-care policy in Poland, especially by pinpointing the most economically disadvantagedsocial groups. These results might be useful in evaluating the current healthcare policyand offer some guidelines for new policies in Polish healthcare system.
ReferencesBARONI, E., RICHIARDI, M. (October 2007): Orcutt’s Vision,50 years on,WorkingPaper no. 65.CAMERON, A.C., TRIVEDI, P.K. (2009):Microeconometrisc. Methods and Appli-cations. Cambridge University Press, Cambridge.ORCUTT, G.H., CALDWELL, S., WERTHEIMER, R.F. (1976):Policy explorationthrough microanalytic simulation. The Urban Institute, Washington D.C
KeywordsMICROSIMULATION, HEALTH ECONOMICS, PHARMACEUTICAL EXPENDI-TURES
Chair of Spatial Econometrics, University of Lodz, Lodz, Poland,[email protected]
6
Clustering for Aggregated Symbolic Data
Nobuo Shimizu and Junji Nakano
Abstract
Symbolic data (SD) can express “concepts”, which include groups of individuals. Typ-ical SD take intervals, histograms or barcharts as variablevalues (Billard and Diday,2006). Symbolic data analysis (SDA) provides techniques, including cluster analysis,for handling such SD. Traditional SD uses information aboutmarginal distributions ofvariables in each SD. We consider the case where individualsare divided into some nat-ural defined groups and any descriptive statistics of groupscan be easily calculated. Wecan express each group by some descriptive statistics, and call them aggregated sym-bolic data (ASD). ASD can represent information about its marginal distributions, suchas mean and variance, and also information about joint distribution, such as correlationcoefficients. Hierarchical methods based on several definitions of dissimilarity betweentraditional SD have been studied (Billard and Diday, 2006).We define a dissimilaritybetween ASD and use it for hierarchical clustering for ASD. EM algorithm is often usedin model-based clustering for classical data (Everitt et al., 2011). We also investigate aclustering method based on Gaussian mixture model in ASD framework. We derive asimplified EM algorithm for clustering ASD by using mean and variance of each vari-able and covariance among variables in ASD. We apply our method to artificial and realdata examples.
ReferencesBILLARD, L. and DIDAY, E. (2006):Symbolic Data Analysis: Conceptual Statisticsand Data Mining. Wiley, West Sussex.EVERITT, B. S., LANDAU, S., LEESE, M. and STAHL, D. (2011):Cluster Analysis(5th Edition). Wiley, West Sussex.
KeywordsDISSIMILARITY, EM ALGORITHM, GAUSSIAN MIXTURE MODEL, SYMBOLICDATA ANALYSIS
The Institute of Statistical Mathematics, Tokyo, Japan{nobuo,nakanoj}@ism.ac.jp
7
Factor Analysis of Distributional Data using Quantiles
Rosanna Verde1 and Antonio Irpino1
Abstract
Distributional data are multi-valued weighted descriptions of a collection of measure-ments, where each unit is described by a empirical distribution for a particular quanti-tative attribute. Symbolic Data Analysis (SDA) provides tools for the statistical treat-ment of multi-valued data. When the number of variables increases, dimension reduc-tion techniques are useful for extracting pattern from data. The most known dimensionreduction techniques for quantitative data are the Principal Component Analysis (PCA).In the literature of SDA, several PCA techniques for data described by histograms ofvalues have been proposed. The proposed PCAs do not considerdirectly associationmeasures between histogram variables, but relationships between some particular fea-tures of the histograms (the means or only the vector of observed empirical frequen-cies). Starting from a new association measures for distributional variables based onthe squared Wasserstein distance, we propose a new PCA for distributional data, solv-ing the problem of working only on partial information on distributional variables andfurnishing new tools for interpreting the results of the proposed technique.
ReferencesBOCK, H.H. and DIDAY, E. (2000):Analysis of Symbolic Data: Exploratory meth-ods for extractin statistical information from complex data. Springer-VerlagMAKOSSO-KALLYTH, S. and DIDAY, E. (2012): Adaptation of interval PCA tosymbolic histogram variables.ADAC 6 (2), 147-159.RÜSCHENDORF, L. (2001): Wasserstein metric, in: M. Hazewinkel, Encyclopediaof Mathematics, Springer.VERDE, R. and IRPINO, A. (2008): Comparing Histogram Data Using aMahalanobis–Wasserstein Distance. In:COMPSTAT 2008, Phisica-Verlag 77-89.
KeywordsDISTRIBUTIONAL DATA, SYMBOLIC DATA, FACTOR ANALYSIS, WASSER-STEIN DISTANCE
Department of Political Studies “J. Monnet”, Second University of Naples, Viale Ellit-tico 31, Caserta Italy{rosanna.verde,antonio.irpino}@unina2.it
8
A Hierarchical Clustering Algorithm applied to ModalOrdinal Symbolic Data
Carmen Bravo1 and José M. García-Santesmases2
Abstract
A generalϕ function that characterizes a consensus measure is defined for probabilitydistributions for a set of ordinal categories. This measureis extended to sets of modal or-dinal symbolic data objects. A dissimilarity measure between two of these sets based inthe consensus variability of their centroids is defined. TheLeik measure, being suitablefor any ordinal scale, is shown to be aϕ function.
An ascending hierarchical clustering algorithm is appliedfor modal ordinal data. Thecriterion to be minimized in each step is based on the decrease of the variability measureof one partition when two of its members are joined. This decreasing value is shown tobe proportional to the already defined dissimilarity between the two clusters joined.Some criteria to measure the quality of partitions and clusters are given.
To illustrate the proposed method we apply it to a data set composed of 34 teachersthat were rated by their students (1350) on 12 items on the ordinal scale: poor, average,good, excellent. Teachers are described by modal ordinal symbolic data. Interpretationsof clusters regarding relevant issues are shown.
ReferencesBOCK, H.H., DIDAY, E. (Eds.) (2000):Analysis of Symbolic Data. ExploratoryMethods for Extracting Statistical Information from Complex Data. Springer Verlag,Heidelberg.GARCIA-SANTESMASES, J.M., BRAVO, M.C. (2010): Consensus Analysisthrough Modal Symbolic Objects. In:Compstat 2010 proceedings. Springer, ISBN978-3-7908-2603-6, 1055–1062.LEIK, K.R. (1966): A Measure of Ordinal Consensus.The Pacific Sociological Re-view, 9, 85–90.
KeywordsMODAL ORDINAL SYMBOLIC DATA, SYMBOLIC CONSENSUS MEASURE,SYMBOLIC HIERARCHICAL CLUSTERING
Universidad Complutense de Madrid, Servicio Informático de Apoyo al Usuario - Inves-tigación, Vicerrectorado de Alumnos, 28040 Madrid, [email protected] · Uni-versidad Complutense de Madrid, Facultad de Ciencias Matemáticas, Dpto. Estadísticae Investigación Operativa, 28040 Madrid, [email protected]
9
Constrained Clustering of Temporal Beanplot Data
Carlo Drago1
Abstract
The explosion of Big Data in last years has determined some relevant problems in datamanagement and the urgence of new methods. In fact data aggregation lead to informa-tion loss and so there is the need to consider new approaches in order to handle datain a suitable way. The SDA approach consider symbolic data (i.e. interval, boxplot orhistogram data) which take in to account the internal data structure without aggregation.In this sense our proposal is using beanplots to consider thevariation in a specific obser-vation. The beanplots are obtained by mean of a kernel density estimate which allow torepresent the original data and show their relevant features. In the temporal framework,we consider beanplot time series, ordered sequences of beanplot over time. The beanplotdata can be parameterized by mean of mixture distribution models to retain the relevantstructural information. In particular the obtained parameters can be used in clusteringand in forecasting. An important element is the possibilityto taking in to account alsothe fit of the different models obtained in the analysis. In this work we will presenta new clustering approach on Beanplot data which take in to account constraints overtime. These obtained clusters allow to identify homogeneous temporal periods whichcan be used in applicative contexts.
ReferencesDIDAY, E., and NOIRHOMME-FRAITURE, M. (Eds.). (2008). Symbolic DataAnalysis and the SODAS software (pp. 1-457). J. Wiley & Sons.DRAGO, C. and LAURO, C. and SCEPI, G. (2011): Beanplot Data Analysis in aTemporal Framework, presented at CLADAG, September 7-9 2011 Pavia.
KeywordsSYMBOLIC DATA ANALYSIS, BEANPLOT, CONSTRAINED CLUSTERING, TIMESERIES
University of Napoli Federico II, Department of Economic and Statistical [email protected]
10
Gender Gap: towards a Measurement with Chain GraphicalModels
Federica Nicolussi1 and Fulvia Mecatti2
Abstract
Recent gender literature shows a growing demand for sound statistical methods for mea-suring any gender gap, apto to capture its complexity and to embed the pattern of re-lationships among a collection of observable variables selected in order to disuntangleits latent trait. This paper focuses on parametrical Hierarchical Marginal Models (Bar-tolucci, Colombi and Forcina, 2007), which apply to binary and categorical data, as aparticularly useful tool for gender studies. We explore thepotential of Chain GraphicalModels (Drton, 2009) in presence of both directed and undirected arcs while exclud-ing directed/semi-directed cycles. These specific model features allow for representingconditional independence as well as shaping both symmetrical-associational and causalrelationships in the dataset. It will be how comparing the two distinct graphical mod-els referring to each gender, any difference displayed in the conditional independencestructure can be interpreted as a gender gap indicator. Preliminary results from a recentsurvey on the issue of sexual harassment will be illustrated, granted by the Committeeon Equal Opportunities of the University of Milano-Bicocca. The survey, as a first everattempt to collect primary data on this sensitive matter, was conducted in July 2012 atthe university site and it has reached a quite high response rate , as well as producingan unexpectedly large adhesion (sample), including all level of students, professors andstaff.
ReferencesBARTOLUCCI, F., COLOMBI, R., and FORCINA, A. (2007): An extended class ofmarginal link functions for modelling contingency tables by equality and inequalityconstraints.Statistica Sinica, 17(2),691.DRTON, M. (2009): Discrete chain graph models.Bernoulli, 15(3),736–7553.MECATTI, F., CRIPPA, F. and FARINA, P.(2012): A Special Gen(d)er of Statistics:Roots, Development and Methodological prospects of a Gender Statistics.Interna-tional Statistical Review, 80,452–467.NICOLUSSI, F.(2013): Marginal Parametrizations for independence models andgraphical models for categorical data.PhD Thesis.
KeywordsCONDITIONAL INDEPENDENCIES, GENDER STATISTICS, MARGINALMOD-ELS, MARKOV PROPERTIES.
University of Milano-Bicocca, Piazza dell’Ateneo Nuovo, 1- 20126, [email protected];[email protected]
11
Time To Graduation: Does Gender Make A Difference? AnAnalysis Of A Greek University
Adele H. Marshall1 Aglaia Kalamatianou2 and Mariangela Zenga3
Abstract
In the Greek university system the graduation happens aftera time threshold, but stu-dents can graduate any time after this threshold without a time limit. In such casesduration of studies may last for a long time and the corresponding distribution may havea long right tail that never reaches the time axis, leading toa group ofperpetual students(Kalamatianou and McClean, 2003) The aim of this paper is to analyse students’ pro-gression to graduation to estimate the influence of various factors on the probability thatstudents, with certain characteristics, will progress successfully towards their degree orbe still enrolled at the end of the observation. We propose touse the Coxian phase-typedistributions (Cox and Miller, 1965) for modelling the length of graduation of the stu-dents enrolled at a Greece public university, paying attention to the subpopulations ofmen and women students.
ReferencesKALAMATIANOU, A. and McCLEAN, S. (2003): The Perpetual Student: Model-ing Duration of Undergraduate Studies Based on Lifetime-Type Educational Data.Lifetime Data Analysis, 9, 311–330.COX, D.R. and MILLER, H.D. (1965):The theory of stochastic processes. Chapman,London.
KeywordsSURVIVAL ANALYSIS, TIME TO GRADUATION, COXIAN PHASE TYPE DIS-TRIBUTION
Department of Sociology, Panteion University of Athens, [email protected] · Centre for Statistical Scienceand Operational Research (CenSSOR) Queen’s University of Belfast, [email protected] · Department of Statistics and Quantitative Methods,University of Milano-Bicocca, [email protected]
12
Beyond indicators: a Causal Approach to Gender Statistics
Silvia Caligaris1 and Fulvia Mecatti2
Abstract
Most of gender statistical measures proposed in the last decades are composite indica-tors, i.e. weighted linear combinations of basic statistics such as ratios and percentages.Composite indicators then involve several arbitrary choices - for instance the weight-ing/aggregating system, variables selection, standardization - affecting both indexestransparency and interpretation. Furthermore gender inequality is a complex latent phe-nomenon, a collection of disparate and inter-linked issuesthat can be hardly caughtin a single indicator. The development of statistical toolsandad hocmodels is thenrequired. The aim of this work is to explore the potential of graphical models as a lan-guage able to clearly represent the complex relationship arrange among a collection ofvariables selected for statistically assessing of gender disparities. The causal approach,as traditionally applied in genetics and epidemiology, will be adopted. We will focuson causal graphs, allowing for deepening and interpreting the causal mechanism thatmay have originated a gender gap as well as for exploring the effects of gender tailoredpolicies. Causal models indeed provide transparent mathematical tools to implementthe assumptions underlying any causal inference, translating them in joint distributionsand reading off the conditional independences according tothe d-separationcriterion(Pearl, 2000). The potential of this methodology will be shown in deriving causal effectsin non-experimental studies, representing policies’ effects and interventions through thedo operator, controlling for confounders and interpreting counterfactuals.
ReferencesCALIGARIS, S., MECATTI, F., CRIPPA, F.(2013): A Narrower Perspective? Froma Global to a Developed-Countries Gender Gap Index: a GenderStatistics Excercise.Statistica, special issue on gender studies in press.MECATTI, F., CRIPPA, F., FARINA, P.(2012): A Special Gen(d)er of Statistics:Roots, Development and Methodological prospects of a Gender Statistics.Interna-tional Statistical Review, 80,452–467.PEARL, J. (2000):Causality: Models, Reasoning and Inference. Cambridge Univer-sity Press, New York.
KeywordsCAUSAL MODELS, CONDITIONAL INDEPENDENCES,d-SEPARATION CRITE-RION, GENDER GAP INDEXES
University of Milano-Bicocca, Piazza dell’Ateneo Nuovo, 1- 20126, [email protected]; [email protected]
13
Gender Differentials In Higher Education: Hints From AFuzzy States Analysis
Franca Crippa,1 Marcella Mazzoleni2 and Mariangela Zenga2
Abstract
Higher education (HE) persistence has recently shown a turnin favour of the femalepopulation, that graduates more often within the expected timeframe and is less ex-posed to drop out in comparison with males (OECD, 2012). Thisshift from a pastsituation of generalized HE male predominance to the present female outperformanceis evidenced as an inversion in differentials by gender, whose intensity varies ac-cording to HE choices as well as to the career stage. Whilst individual or institu-tional determinants have been widely considered mainly in terms of overall attain-ments, HE students’ intermediate results and strategies have gained less attention.The paper examines a methodological alternative to the existing measures of gender dif-ferentials in coping with undergraduate university requirements. In particular, Markovchains with fuzzy states are applied so as to highlight pathsand to derive synthetic in-dicators of gender gaps, whatever the direction of the latter might be, apt both to berepeated in time and to give insight in undergraduates’ choices and strategies at specifictime points.
ReferencesOECD (2012): Gender Equality in Education,Employment and En-trepreneurship: Final Report to the MCM 2012, Paris, 23-24 May 2012,http://www.oecd.org/social/family/50423364KALAMATIANOU, A.G. and KOPUGIOUMOUTZAKI, F. (2012): EmploymentStatus and Job-Studies Relevance of Social Science.International Journal of Eco-nomic Sciences and Applied Research, 1, 51–75.SYMEONAKI, M. and STAMOUB, G.B. (2004): Theory of Markov systems withfuzzy states,Fuzzy Sets and Systems 3, 427–445.
KeywordsDIFFERENTIAL, GENDER, MARKOV CHAINS, FUZZY STATES
Department of Psychology, University of Milano-Bicocca, piazza dell’AteneoNuovo, 1 Milan, Italy [email protected] · Department of Statis-tics and Quantitative Methods, University of Milano-Bicocca, via Biococcadegli Arcimboldi, 8, Milan, Italy [email protected],[email protected]
14
Analysing Categorical Variables With Similar Categories:Constrained Multiple Correspondence Analysis
Véronique Cariou and El Mostafa Qannari
Abstract
Multiple Correspondence Analysis (MCA) aims at analysing acategorical data tableby exhibiting a small set of axes (also called scores). Theseones are built in order tomaximize the sum of their squared correlation ratio with thedifferent categorical vari-ables. Let us considerK categorical variables, whereℵk is thekth one. If we representeach variableℵk with its indicator matrixXk, the first MCA componentt maximizes∑k corr2(t,Akt), whereAk is the projector associated withXk: Ak = Xk(XT
k Xk)−1XT
k .Constrained MCA introduces a new constraint on MCA in order to explore and visual-ize categorical variables having the same set of categories. This kind of data may occurin applications such as sensory analysis and Just About Right data. Constrained MCAassumes that theK dummy data tables (or alternatively indicator matrices) share com-mon loadings. It proceeds step by step by computing at each step the common vectorof loadings and the common components. Formally, we seek at each step a componentt and a common vector of loadingsu which maximize the same criterion above, witht = Vu and whereV = ∑k αkXk is the optimal linear combination of the different indi-cator matrices. The solution of this maximization problem is simple. It consists in aniterative algorithm in the course of whichα andu are alternatively updated.The method of analysis is illustrated on the basis of a case study.
ReferencesGREENACRE, M. and BLASIUS, J. (2006):Multiple Correspondence Analysis andRelated Methods.Chapman and Hall. CRC Press.
KeywordsMULTIPLE CORRESPONDENCE ANALYSIS, SENSORY DATA
UNAM University, ONIRIS,USC “Sensometrics and Chemomet-rics Laboratory”, Nantes, F-44322, France. INRA, Nantes, F-44316, France. [email protected];[email protected]
15
Constrained Dual Scaling of Successive Categories forDetecting Response Styles
Pieter C. Schoonees1,2, Michel van de Velden1, and Patrick J.F. Groenen1
Abstract
Dual scaling is a multivariate exploratory method equivalent to correspondence analysisfor analyzing contingency tables. However for Likert-scale data collected from surveyswith multiple questions, it is shown here that a peculiarityof dual scaling can be ex-ploited to detect differences in response styles.
Response styles arise in questionnaire research when respondents tend to use ratingscales in a manner unrelated to the actual content of the survey questions, often biasingresults. Interpreting a response style as a nonlinear mapping of a group of respondents’latent preferences to a rating scale allows for four main types of response styles to bemodeled by quadratic monotone splines. Using this and the link between dual scalingand correspondence analysis a spline-based constrained version of dual scaling is de-vised which can detect the presence of the four main types of response styles.
The method is based on an optimality criterion which is subsequently extended toallow for multiple response styles. A computationally intensive alternating nonnegativeleast squares algorithm is devised for estimating the parameters, which include latentclasses for group membership. It is shown how the method can be used to create a dataset in which the effects of response styles have been removed. The impact of this purgingof response styles on the results from typical analyses of ratings data is illustrated.
KeywordsMONOTONE SPLINES, NONNEGATIVE LEAST SQUARES, CORRESPONDENCEANALYSIS
Econometric Institute, Erasmus University Rotterdam, PO Box 1738, 3000 DR Rotter-dam, The Netherlands· [email protected]
16
ORTHOMALS: Orthogonal Projection Of A MultipleCorrespondence Solution On A Design Space
Ralph C.A. Rippe1 and Willem J. Heiser2
Abstract
Multiple correspondence analysis (MCA or HOMALS) (Gifi, 1990) aims to find homo-geneous groups over more than two nominal variables. However, interpretability of thesolution suffers strongly when the data matrix has structural omissions âAS the miss-ings are by design -. An unknown number of primary dimensionsare solely determinedby the structural incompleteness, instead of delivering substantive information.
ORTHOMALS adapts the original multiple correspondence algorithm by restrictingthe solution to be orthogonal to the design space in each iteration. The design space canbe obtained from e.g. OVERALS. The main focus in this work is on the orthogonalityrestriction, not on obtaining the design space.
We show through a simulation study with different levels of incompleteness that accu-rate correspondence recovery is obtained in situations with up to 80% incomplete data.Its recovery behavior is however not linear. We observe an initial decrease of recoverywith increasing incompleteness, while with further increases of incompleteness we seeincreasing recovery.
The new algorithm is applied to the assessment of mathematical problem solvingskills in primary school children. More specifically we use the mathematical divisionstrategy data of CITO PPON 2004, resulting in a solution thatis similar (in the firsttwo dimensions) to that in the PPON 1997; Realistic and Traditional strategies werestill combined with lacking or faulty strategies, whereas the Realistic and Traditionalcombination of strategies seldom occurs.
ReferencesGifi, A. (1990).Nonlinear Multivariate Analysis. New York: Wiley.
KeywordsMULTIPLE CORRESPONDENCE, ORTHOGONALITY, RESTRICTION, PROJEC-TION, INCOMPLETE
Leiden University, Inst. of Educ. & Child [email protected] ·Leiden University, Institute of [email protected]
17
Squared Covariances Or Chi-Squared Statistics BasedDistances
Antoine de Falguerolles
Abstract
In a 2 by 2 contingency table, Pearson’s Chi-squared statistics for independence is equal,up to the sample size, to the squared Pearson’s correlation between two binary quantita-tive variables obtained by coding the levels with arbitrarynumerical values. The squar-ing achieves here a limited form of invariance which may be ofinterest for some multi-variate analyses. In multidimensional scaling or in clustering, the emphasis may be ondistances based on the magnitudes of measures of co-variation regardless of their signs.A related issue is that of positive semi-definiteness of the matrix of these measures ofco-variation, a property central to visualization techniques such as PCA or metric MDS.
In this presentation, I shall advocate the use of squared covariances or squared corre-lations between any two quantitative variables. It turns out that the non-negative matrixthus formed is positive semi-definite, a property also shared by the matrix of squaredconditional (or partial) correlations.
I shall also consider the case of general multi-way tables. In line with the resultabove, the matrix of Pearson’s Chi-squared statistics of independence of all marginaltwo-way tables is positive semi-definite. But the case of conditional independence isless straightforward (see references below). I shall advocate here the use of the ma-trix of Chi-squared statistics of independence between anytwo variables given all othervariables which, in most applications, turns out to be positive semi-definite.
ReferencesSAPORTA, G. (1976): Quelques applications des opérateurs d’Escoufier au traite-ment des variables qualitatives,Statistique et analyse des données, t.1, 38-46.DAUDIN, J.-J.(1979): Coefficient de Tschuprow partiel et indépendance condition-nelle,Statistique et analyse des données, t.3, 55-58.
KeywordsCOVARIANCE, CONDITIONAL COVARIANCES, PEARSON’S CHI-SQUARED STATIS-TICS, DISTANCE
Université de Toulouse III (Retired),[email protected]
18
A New Constant Memory Recursion For Hidden MarkovModels
Francesco Bartolucci1 and Silvia Pandolfi2
Abstract
In this work, we develop the recursion for hidden Markov models proposed by Bar-tolucci and Besag (2002) and we show how it may be employed to implement an esti-mation algorithm for these models which requires an amount of memory not dependingon the length of the observed series of data. This recursion allows us to obtain the con-ditional distribution of the latent state at every occasion, given the previous state andthe observed data. With respect to the estimation algorithmbased on the well-knownBaum-Welch recursions (Baum et al., 1970; Welch, 2003), which requires an amount ofmemory that increases with the sample size, the proposed algorithm also has the advan-tage of not requiring dummy renormalizations to avoid numerical problems. Moreover,it directly allows us to perform global decoding of the latent sequence of states, withoutthe need of a Viterbi method and with a consistent reduction of the memory requirementwith respect to the latter.
ReferencesBARTOLUCCI, F. and BESAG, J. (2002). A recursive algorithm for Markov randomfields.Biometrika, 89, 724-730.BAUM, L. E., PETRIE, T., SOULES, G., and WEISS, N. (1970). A maximizationtechnique occurring in the statistical analysis of probabilistic functions of Markovchains.Annals of Mathematical Statistics, 41:164–171.WELCH, L. R. (2003). Hidden Markov models and the Baum-Welchalgorithm.IEEE Information Theory Society Newsletter, 53:1–13.
KeywordsEXPECTATION-MAXIMIZATION ALGORITHM, FORWARD-BACKWARD RECUR-SIONS, GLOBAL DECODING, HIDDEN MARKOV MODELS, VITERBI ALGO-RITHM
Department of Economics, Finance and Statistics University of Perugia (IT)[email protected] · Department of Economics, Finance and Statistics Univer-sity of Perugia (IT)[email protected]
19
Detecting Local Dependence In Binary Data Latent ClassModels: Some Developments
Daniël Oberski
Abstract
Binary data latent class models crucially assume local independence, violations ofwhich can seriously bias the results. Monitoring possible local dependencies is there-fore vital. I present three tools for detecting local dependence after fitting a latent classmodel: the bivariate Pearson residual, the score test, and the expected parameter change,and note the relationships between these measures. Some recent work on detecting localdependence is discussed and an application to published data discussed.
References
OBERSKI, D., VAN KOLLENBURG, G., AND VERMUNT, J. (submitted). A MonteCarlo evaluation of three methods to detect local dependence in binary data latent classmodels.OBERSKI, D. AND VERMUNT, J. (submitted). The Expected Parameter Change(EPC) for local dependence assessment in binary data latentclass models.OBERSKI, D. (submitted). Change in SEM parameters of interest as a criterion forpartial measurement invariance: The EPC-interest.
KeywordsLOCAL INDEPENDENCE; FINITE MIXTURE MODEL; SCORE TEST; GENER-ALIZED SCORE
Department of Methodology and Statistics, Tilburg University, The [email protected]
20
Power and Sample Size Determination for Latent ClassModels
Dereje W. Gudicha, Jeroen K. Vermunt, and Fetene B. Tekle1
Abstract
Latent class (LC) models are most frequently used by social,behavioral, and medicalscience researchers, for example, to build latent subgroups based on data from multivari-ate categorical variables, to classify cases to their most likely latent classes, to analyzeagreement data from different raters, and to evaluate the sensitivity and specificity ofdiagnostic tests for which a gold standard is not available.Despite such attractive appli-cations and their increasing popularity in widely diverging research areas, little is knownabout statistical power and sample size for LC models. The objectives of this paper aretwofold. First, a Wald based power analysis method for parameters that describe a re-lationship between an indicator and a categorical latent variable is proposed. Second,the design factors that affect the power of statistical tests are studied. We show how themost important design factors of LC models are related via the information matrix, andhow this information matrix is affcted by the fact that the latent class membership is notobservable. The proposed method is illustrated with numerical examples for differentscenarios of design factors. A simulation study conducted to assess the performance ofthe proposed power analysis procedure showed that the procedure will work for manypractical applications of LC models.
KeywordsLATENT CLASS MODELS; SAMPLE SIZE; STATISTICAL POWER; INFORMA-TION MATRIX; DESIGN FACTOR
Department of Methodology and Statistics, Tilburg University, Tilburg, The Netherlands
21
The Bias-Adjusted Three-Step Approach To Latent ClassModeling With External Variables
Zsuzsa Bakk1, Daniel Oberski1, and Jeroen K. Vermunt1
Abstract
A popular way to connect latent class membership to externalvariables is to relate theexternal variables to the estimated scores on class membership; this approach is calledthree step latent class analyses (LCA). While the three stepLCA is a popular approach,until recently it had the disadvantage that the parameters describing the association oflatent class membership and auxiliary variables were underestimated (Bolck, Croon,Hagenaars, 2004). In the current paper we present how unbiased parameter estimates ofthis association can be obtained, by using the known classification error probabilities asfixed value parameters in the third step analysis (Vermunt, 2010, Bakk, Tekle and Ver-munt in press). Next to correct parameter estimates we also show how correct standarderror (SE) estimates can be obtained. We show the results of asimulation study wherewe test the performance of the parameter bias correction, and the SE bias correctionmethods.
ReferencesBolck, A., Croon M.A. and Hagenaars J.A. (2004):EstimatingLatent Structure Mod-els with Categorical Variables: One-Step versus Three-Step Estimators.PoliticalAnalysis,12, 3-27.J.K. Vermunt (2010):Latent Class Modeling with Covariates: Two Improved Three-Step Approaches.Political Analysis,18, 450-469.Z. Bakk, F.T. Tekle and J.K. Vermunt (in press): Estimating the Association Be-tween Latent Class Membership and External Variables UsingBias-adjusted Three-step Approaches.Sociological Methodology.
KeywordsLATENT CLASS ANALYSIS, THREE STEP APPROACH, COVARIATES, BIAS AD-JUSTMENT
Tilburg University, PO BOX 90153, Tilburg, The [email protected]
22
Comparative Analysis on LDA-based Classification andSubject Categories of the Japanese Awards Database ofGrants-in-Aid for Scientific Research, KAKEN
Kei Kurakawa1, Yuan Sun2, and Yasumasa Baba3
Abstract
Since research and development projects are increasingly carried out in more competi-tive environment than ever before, it becomes more important to evaluate project results.In the evaluation process, the fact data of project results are aggregated from severalkinds of sources among different databases, and figured out in the same axes. A set ofsubject categories is one of major evaluation axes and intends to be integrated amongdifferent databases. For example, a bibliometrics-based research evaluation tool, InCitesby Thomson Reuters gives subject categories of Web of Science, which is mapped to thefields of science and technology of the OECD Frascati Manual [OECD2007] that repre-sents a standard of research and development evaluation methods. Such the mapping ofsubject categories among different databases is ideally ought to be automated for timelyand appropriate research evaluation. So, at the starting point, we compared subject cat-egories of the national grants database KAKEN (http://kaken.nii.ac.jp) andtopics derived by a topic model LDA (Latent Dirichlet Allocation) [BLEI2003] fromkeywords of projects in KAKEN. The subject categories and the topics assigned to eachproject are analyzed through the purity index [ZHAO2001].
ReferencesBLEI, D. M., NG, A. Y., and JORDAN, M. I. (2003): Latent dirichlet allocation.TheJournal of Machine Learning Research, 3:993–1022.OECD. (2007):Revised Field of Science and Technology (FOS) Classification in theFrascati Manual.ZHAO, Y. and KARYPIS, G. (2001): Criterion functions for document clustering:Experiments and analysis.Technical report, Department of Computer Science, Uni-versity of Minnesota, MN.
KeywordsKAKEN, SUBJECT CATEGORIES, LDA, TOPIC CLASSIFICATION
National Institute of Informatics, Tokyo, [email protected] · Na-tional Institute of Informatics, Tokyo, [email protected] · The Instituteof Statistical Mathematics, Tokyo, [email protected]
23
Prototype Identification through Archetypes
Giancarlo Ragozini1
Abstract
A prototype is an element chosen to represent a cluster in order to provide a simplifieddescription of it. Prototypes are usually derived by minimizing some adequacy criterion.The most known approach to obtain them is the constant radiusmethod (e.g. thek-meansalgorithm and the related methods). This latter assures good results when dealing withelliptical clusters, but could become unstable and could not allow a correct clustersidentification in the other cases. Furthermore, the clustercentroids could be too averageand, hence, prototypes could not be well distinguished and separated. In the presentpaper we propose a new method for the prototype identification based on archetypes,i.e. few “pure” points lying on the boundary of the data scatter and characterizing thearchetypal pattern in the data. Archetypes span a space in which data, both single valuedor interval valued ones, have new coordinates, the so-called barycentric coordinates.We propose to perform the clustering procedure and the prototype identification in suchnew space−that provide an outward-inward perspective on the data− by using a propercompositional distance. The proposed procedure yields prototypes well-separated andwith clear profiles.
ReferencesAITCHISON, J., BARCELÓ-VIDAL, C., MARTÝN-FERNÁNDEZ, J.A.,PAWLOWSKY-GLAHN, V. (2000): Logratio Analysis and Compositional Distance.Mathematical Geology, 32, 271–275.CUTLER, A., BREIMAN, L. (1994): Archetypal Analysis.Technometrics, 36, 338–347.D’ESPOSITO, M.R., PALUMBO, F., RAGOZINI, G. (2012); Interval Archetypes:a new tool for interval data analysis.Statistical Analysis and Data Mining, 5, 322–335.DOI:10.1002/sam.11140.
KeywordsBARYCENTRIC COORDINATES, COMPOSITIONAL DATA, SOFT CLUSTERING
Department of Political Sciences, Federico II University of Naples, [email protected]
24
Spatial Clustering based on Hierarchical Structure ofMultidimensional Lattice Data
Koji Kurihara1 and Fumio Ishioka2
Abstract
Spatial data have the information of the values of surface variables at specified loca-tions or regions. We focus on lattice data over a fixed subsetD of d-dimensional Eu-clidean space. Lattice data are synoptic observations covering an entire spatial regionsupplemented with neighborhood information. These data are known as a kind of spatialepidemiological data, remote sensing data, regionally lattice data and so on. There aresome approaches of clustering methods for such lattice data. The echelons (Myers et al.,1997; Kurihara, 2004) are useful techniques to study the topological structure of a sur-face in the systematic and objective manner. The echelons are derived from the changesin topological connectivity with decreasing surface level. The echelon dendrogram rep-resents the surface topology of lattice data and hierarchical structure of these data andregional features are shown in an echelon dendrogram. In this paper, we apply the zoneclustering method based on the peak of echelon dendrogram tomultidimensional spatiallattice data. We have some different zones based on practical definition for the relationof peak and foundation. In addition, we demonstrate some illustrations to detect hotspotareas for multidimensional spatial data.
ReferencesKURIHARA, K. (2004): Classification of Geospatial Lattice Data and Their Graph-ical Representation.Classification,Clustering and Data Mining Applications (Editedby Banks, D. et el.). Springer, Berlin, Tokyo, 251–258.MYERS, W.L., PATIL, G.P., JOLY, K. (1997): Echelon Approachto Areas of Con-cern in Synoptic Regional Monitoring.Environmental and Ecological Statistics, 4,131–152.
KeywordsMULTIDIMENSIONAL SPATIAL DATA, ECHELON ANALYSIS, CLUSTERING
Graduate School of Environmental and Life Science, OkayamaUniversity, 3-1-1Tsushima-naka Okayama 700-8530, [email protected] ·School of Law, Okayama University, 3-1-1 Tsushima-naka Okayama 700-8530, [email protected]
25
Research Literature Analytics through Mapping Narratives
Fionn Murtagh
Abstract
With large volumes of scholarly journal submissions, or conference paper submissions,it is useful and indeed necessary to determine narratives ofwriting and of researchinvolved. The same issue arises in the narrative of researchgrant funding proposals.
Some conferences and journals now use matching of submissions with reviewers,based on the content of the submitted paper, and a collectionof past work by the re-viewers (Charlin et al., 2011). In Murtagh (2010) we looked at discipline themes andsubthemes with implications for strategy, and thematic focus and coverage. This wasin connection with the work of a national research funding agency. In Murtagh et al.(2011), we looked at narrative within published journal articles.
Our objectives include taking account of subdiscipline differentiation, and mappingthe semantics of the content considered. For scalability, this work also involves use ofthe Solr (Apache Lucene) storage, retrieval and discovery system.
ReferencesCHARLIN, L., ZEMEL, R. and BOUTILIER, C. (2011): A Frameworkfor Optimiz-ing Paper Matching. InProceedings of 27th Conference on Uncertainty in ArtificialIntelligence (UAI), Barcelona.MURTAGH, F., GANZ, A. and REDDINGTON, J. (2011): Semantics from Narra-tive: State of the Art and Future Perspectives. In: M. Gettler Summa, L. Bottou, B.Goldfarb, F. Murtagh, C. Pardoux and M. Touati (Eds.):Statistical Learning andData Science. Chapman & Hall/CRC, 91–102.MURTAGH, F. (2010): The Correspondence Analysis Platform for Uncovering DeepStructure in Data and Information, Sixth Boole Lecture,Computer Journal, 53 (3),304–315.
KeywordsCLUSTERING, FACTOR ANALYSIS, BIG DATA, ANALYTICS, SEMANTICS
Department of Computer Science, Royal Holloway, University of London, EghamTW20 0EX, [email protected]
26
Effects of Moment-to-moment Likeability Patterns on theVirality of Online Ads
Tammo Bijmolt1
Abstract
Classification methods have been developed and applied numerous times in the market-ing research discipline, most notably cluster analysis andlatent class methods in marketsegmentation studies. More recently, classification methods have been applied to newtopics, such as online media and customer databases. The presentation will provide abrief overview of recent applications of classification methods in marketing and nextillustrate this using a specific project. In particular, I will discuss a study on consumers’evaluation of online commercials and their willingness to share content (viral advertis-ing). The analysis captures the dynamics of likeability evaluations by identifying MtMpatterns using trajectory finite mixture modelling and nextexamines the effect of thesepatterns on ad virality. The model is estimated using uniquedata consisting of morethan 12.000 respondents and 30 ads. The results show, among others, that high likeabil-ity values at ad beginning and end are important, while the end effect is higher.
Faculty of Economics and Business, University of Groningen
27
Formal Concepts for Classification
Bernhard Ganter
Abstract
In recent decades, a rich mathematical theory was developed, which can be regarded asa theoretical basis of poly-hierarchical classification. It bears the name “Formal ConceptAnalysis”. The usual tree hierarchies are replaced by mathematically more interestingstructures, namely complete lattices, which are interpreted as hierarchies of formal con-cepts. This name refers to the fact that extensional and intensional hierarchies are jointlyrepresented. The use of metric approaches is possible, but is of minor importance. For-mal Concept Analysis has expressive graphics, an extensivealgebraic theory and pow-erful algorithms. The mathematical setting is both simple and versatile, mathematicallyrigorous and flexible.
The origins and initial inspirations of this research area were within the classificationsocieties. Meanwhile, an independent community with lively publication and confer-ence activities has developed. The lecture describes methodology and applications ofFormal Concept Analysis by means of simple examples and informs about recent de-velopments.
28
Multinomial Logistic Regression Ensembles
Hongshik Ahn1
Abstract
We propose a method for multiclass classication problems using ensembles of multi-nomial logistic regression models. A multinomial logit model is used as a base classierin ensembles from random partitions of predictors. The multinomial logit model can beapplied to each mutually exclusive subset of the feature space without variable selection.By combining multiple models the proposed method can handlea huge database with-out a constraint needed for analyzing high-dimensional data, and the random partitioncan improve the prediction accuracy by reducing the correlation among base classiers.The proposed method is implemented using R and the performance including overallprediction accuracy, sensitivity, and specicity for each category is evaluated on two realdata sets and simulation data sets. To investigate the quality of prediction in terms ofsensitivity and specicity, area under the ROC curve (AUC) isalso examined. The per-formance of the proposed model is compared to a single multinomial logit model and itshows a substantial improvement in overall prediction accuracy. The proposed method isalso compared with other classication methods such as Random Forest, Support VectorMachines, and Random Multinomial Logit Model.
KeywordsCLASS PREDICTION; ENSEMBLE; LOGISTIC REGRESSION; MAJORITY VOT-ING; MULTINOMIAL LOGIT; RANDOM PARTITION
Department of Applied Mathematics and Statistics, Stony Brook University, StonyBrook, NY 11794-3600
29
Age-specific Disease Network For The Major Disease InKorea
Taerim Lee1 and Hongseok Kim2
Abstract
Objectives: The purpose of this paper is to analyze the relationship among major dis-eases in Korea using social network analysis and word cloud based on the literaturedata. Differences across three age groups are also studied.
Methods: We used social network analysis to draw a network graph for major diseasesbased on the relationships among the diseases by a literature search, using the prevalencerate and the mortality of such diseases from 2011 Korean National Health NutritionExamination Survey and causes of death statistics in Korea.
Results: We find that smoking and obesity is the most important factor of causingother diseases. Except obesity, anemia, hepatitis, atopicdermatitis and some other dis-eases, most diseases become more common and more dangerous across the older agegroup. We can visually recognize these results from the graphs made by social networkanalysis and wordle.
Conclusions: We made the age-specific social network graphsbetween 24 major dis-eases in Korea across three age groups. We could know most disease became more andmore prevalent and severe with people being older.
KeywordsSOCIAL NETWORK ANALYSIS, DISEASE NETWORK, KOREAN DISEASE NET-WORK, WORD-CLOUD, WORDLE
Dept. of Information Statistics, KNOU· A Public Health Doctor at Suncheon city healthcenter Dept. of Information Statistics, KNOU
30
Analysis of Questionnaire Survey with Ordinal-polytomousUsing the Binomial Confidence Limits
Ueno, T.1, Tatsunami, S.1, Otaki, M.2, and Kuwabara, R.2
Abstract
The questionnaire survey is used frequently in investigations of quality of life (QOL) aswell as other social problems.
We assume that response scales of a survey form is ordinal-polytomous and considerdata fromn responders on the questionnaire instrument consisted ofm items. Letyi j bethe response to thej-th item fromi-th responder and ¯yi the average ofi-th responder’sresponse. Putz= (zi j ) as follows:
zi j =
N.A. if yi j is N.A.,
1 if yi j ≥ yi ,
0 if yi j < yi .
Put p the probabilityzi j = 1, then each column ofz is a variable with the binomialdistributionB(n, p). We count up the number of 1 inj-th column ofz and classify theitems into a few levels by using the binomial confidence limits. The number of levelsshould be determined appropriately depending on the numberof column. Then we pickup all items with the same levels and apply them the above procedure repeatedly. Weapply this procedure, until we can no longer separate any group of items into smallergroups. we obtain a classification of items of a questionnaire survey. The correlationof the items in the same level is not necessary high. We consider that this procedure isuseful in the interpretation of a questionnaire survey.
ReferencesFAYERS, P. M. and MACHIN, D. (2000):Quality of Life, Assessment, Analysis andInterpretation. Wiley, England.
KeywordsQUESTIONNAIRE SURVEY, BINOMIAL CONFIDENCE LIMIT
Medical Statistics, St. Marianna University School of Medicine, Kawasaki, Japan [email protected] · Institute of Radioisotope Research, St. Mari-anna University Graduate School of Medicine, Kawasaki, Japan 216-8511
31
Comparison Of Methods For Handling Missing Data In AMulti-Item Instrument
I. Eekhout123, H.C.W. de Vet13, J.W.R. Twisk123, J.P.L. Brand4, M.R. de Boer25, andM.W. Heymans123
Abstract
Regardless of the proportion of missing values, complete-case analysis is most fre-quently applied, although advanced techniques such as multiple imputation are avail-able. The objective of this study is to explore the performance of simple and more ad-vanced methods for handling missing data in case some, many,or all item scores aremissing in a multi-item instrument.
Real-life missing data situations were simulated in a multi-item variable used as acovariate in a linear regression model. Various missing data mechanisms were simu-lated with an increasing percentage of missing data. Subsequently, several techniquesto handle missing data on level of item score and total score were applied such as meanimputation, two-way imputation and multiple imputation todecide on the most optimaltechnique for each scenario. Fitted regression coefficients were compared, using the biasand coverage as performance parameters.
Mean imputation caused biased estimates in every missing data scenario when dataare missing for more than 10
We recommend applying multiple imputation to the item scores in order to get themost accurate regression model estimates. Moreover, we advise not to use any form ofmean imputation to handle missing data, despite the fact that this is often times recom-mended in questionnaire manuals.
KeywordsMISSING DATA, MULTIPLE IMPUTATION, ITEM IMPUTATION, ORDINAL DATA,MULTI-ITEM QUESTIONNAIRE, SIMULATION
Department of Epidemiology and Biostatistics, VU University Medical Center, Amster-dam, The [email protected] · Institute for Health Sciences, Facultyof Earth and Life Sciences, VU University, Amsterdam, The Netherlands· EMGO+ In-stitute for Health and Care Research, Amsterdam, The Netherlands· Skyline Diagnos-tics, Rotterdam, The Netherlands· Department of Health Sciences, Univerity MedicalCentre Groningen, University of Groningen, The Netherlands
32
Common and Cluster-specific Simultaneous ComponentAnalysis
Kim De Roover1, Marieke E. Timmerman2, Batja Mesquita3 and Eva Ceulemans1
Abstract
In many fields of research, so-called ‘multiblock’ data are collected, i.e., data containingmultivariate observations that are nested within higher-level research units (e.g., inhab-itants of different countries). Each higher-level unit (e.g., country) then corresponds toa ‘data block’. For such data, it may be interesting to investigate the extent to which thecorrelation structure of the variables differs between thedata blocks. More specifically,when capturing the correlation structure by means of component analysis, one may wantto explore which components are common across all data blocks and which componentsdiffer across the data blocks. Therefore, we propose a common and cluster-specific si-multaneous component method which clusters the data blocksaccording to their cor-relation structure and allows for common and cluster-specific components. Model esti-mation and model selection procedures are presented and themethod is applied to datafrom cross-cultural values research to illustrate its empirical value.
KeywordsSIMULTANEOUS COMPONENT ANALYSIS, CLUSTERWISE SIMULTANEOUSCOMPONENT ANALYSIS, MULTIBLOCK DATA, MULTIGROUP DATA, MULTI-LEVEL DATA
Methodology of Educational Sciences Research Unit, KU Leuven,Andreas Vesaliusstraat 2, box 3762, 3000 Leuven, Belgium. Email:[email protected] · Heymans Institute of Psychology, Univer-sity of Groningen· Social and Cultural Psychology Research Unit, KU Leuven
33
Extending Clusterwise non-negative matrix factorization(NMF) to hierarchically organized data
Joke Heylen1, Philippe Verduyn2, Iven Van Mechelen2 and Eva Ceulemans1
Abstract
Researchers are often interested in capturing variabilityin time profiles. Often thesestudies induce hierarchically organized time series data,in that the time profiles arenested within higher order units. For instance, when studying the intensity of emotionsand how this fluctuates across time, researchers ask subjects to recollect distinct emo-tional episodes and to draw their intensity course over time(Verduyn et al., 2009). Thequestion then rises how the variability in these time profiles can be captured, taking indi-vidual differences into account. To this end, we extend Clusterwise non-negative matrixfactorization (NMF) (Heylen et al., 2012), to hierarchically structured data. In this ex-tension, the higher order units (e.g., persons) are clustered according to the differentshapes that their time profiles take. To gain insight into which shapes typically occur forthe higher order units in specific clusters, we partition thetime profiles within each clus-ter. We propose an algorithm for fitting the hierarchical clusterwise NMF model to dataand evaluate it by means of a simulation study. Finally, we fitthe model to empiricalintensity profiles of emotional episodes nested within subjects.
ReferencesHEYLEN, J., CEULEMANS, E., VAN MECHELEN, I. and VERDUYN, P.(2012,august):Clusterwise Non-negative Matrix Factorization (NMF) for capturingvariability in time profiles.Paper presented at the International Conference on Com-putational Statistics, Limassol, Cyprus.VERDUYN, P., VAN MECHELEN, I., TUERLINCKX, F., MEERS, K. andVANCOILLIE, H. (2009): Intensity profiles of emotional experience over time.Cognitionand Emotion, 23(7), 1427–1443.
KeywordsTIME PROFILES, HIERARCHICALLY ORGANIZED DATA, CLUSTERING, FUNC-TIONAL DATA ANALYSIS
Methodology of Educational Sciences, KU Leuven, [email protected] · Quantitative Psychology and IndividualDifferences, KU Leuven, Belgium.
34
Generalized Reduced Clustering Analysis
Michio Yamamoto
Abstract
This work develops a new procedure for finding an optimal cluster structure of mul-tivariate objects and also finding an optimal subspace for clustering, simultaneously.The proposed method is conducted by minimizing a distance between objects and theprojections with clustering penalties, and it can be considered as a generalized modelincluding some existing cluster analyses with dimension-reduction such as the reducedk-means analysis (De Soete and Carroll, 1994) and the factorial k-means analysis (Vichiand Kiers, 2001). In addition, even if the data have a structure which is independent tothe true cluster structure and affects the performance of clustering, the proposed methodfinds the optimal subspace to partition the objects by eliminating the effect of the dis-turbing structure. An efficient alternating least-squaresalgorithm, consisting of the gra-dient projection algorithm and thek-means algorithm, is described. Analyses of artificialand real data examples demonstrate that the proposed methodcan give correct resultsbut existing methods can not.
ReferencesDe Soete, G. and Carroll, J.D. (1994): K-means clustering ina low-dimensional Eu-clidean space. In: Diday, E., Lechevallier, Y., Schader, M., Bertrand, P., and Burtschy,B. (Eds.):New approaches in classification and data analysis. Springer, Heidelberg,212-219.Vichi, M. and Kiers, H.A.L. (2001): Factorialk-means analysis for two-way data.Computational Statistics & Data Analysis, 37, 49-64.
KeywordsDIMENSION REDUCTION, CLUSTERING, GRADIENT PROJECTION ALGORITHM,K-MEANS ALGORITHM
Osaka University, [email protected]
35
Mixtures Of Factor Analyzers And UnobservedHeterogeneity In Questionnaire Data
Robert Kapłon1
Abstract
A model of a mixture of factor analyzers was proposed to concurrently perform clus-tering and reduction of the number of dimensions when the number of dimensions wasrelatively large in relation to the sample size. Whilst the classification and visualizationof high-dimensional data seems to be the primary purpose, one may find MFA use-ful in accounting for population heterogeneity in data. This suggestion stems from thefact that finite mixture models have been successfully applied to explain heterogene-ity among customers in many marketing problems. Thus, in this paper we consider thepossibility of applying mixtures of factor analyzers to questionnaire data, so as to cap-ture unobserved heterogeneity. Firstly, we show how a traditional factor analysis modelthat ignores heterogeneity can lead to misleading inferences. Afterwards, based on thesetheoretical findings, a simulation experiment is conductedto investigate features of datawhich may indicate unobserved heterogeneity, thereby justifying the use of a mixture offactor analyzers. These results are then used to propose a procedure that allows us to de-cide – without parameter estimation for the MFA model – whichof these two competingmodels should be utilized. Finally, we test the proposed model on a real data set.
ReferencesALLENBY, G.M. and ROSSI, P. (1999): Marketing Models of Consumer Hetero-geneity.Journal of Econometrics, 89, 57–78DILLON, W.R. and KUMAR, A. (1994): Latent Structure and Other Mixture Mod-els in Marketing: An Integrative Survey and Overview. In: R.P. Bagozzi (Eds.):Ad-vanced Methods of Marketing Research. Blackwell, Oxford, 295–351.FRÜHWIRTH-SCHNATTER S. (2006):Finite Mixture and Markov Switching Mod-els. Springer.MCLACHLAN, G.J. and PEEL, D. (2000).Finite Mixture Models. Wiley, New York.
KeywordsFACTOR ANALYZERS, MIXTURE MODELS, HETEROGENEITY
Wrocław University of Technology, Wybrzeze Wyspianskiego 27, 50-370 Wrocław,[email protected]
36
Estimation Methods for Categorical Marginal Models:Comparing MAEL, GEE, and GSK.
Renske E. Kuijpers1, Wicher P. Bergsma2, L. Andries van der Ark1, and Marcel A.Croon1
Abstract
Categorical marginal models can be used for modeling dependent data. For example,marginal models are used to construct hypotheses tests and standard errors for certaincoefficients, such as Cronbach’s alpha and scalability coefficients. The most used esti-mation method for marginal models is maximum likelihood (ML). However, for largersets of items, problems with memory capacity occur. These problems can be avoidedby using maximum augmented empirical likelihood (MAEL; Vander Ark, Bergsma &Croon, 2013). MAEL estimation uses all nonzero cells in a contingency table, plus anumber of well-chosen zero cells. MAEL is a rather new method, and further investiga-tion is needed. More common estimation methods for marginalmodels are generalizedestimating equations (GEE), and GSK. GEE (Liang & Zeger, 1986) represents an ex-tension of the generalized linear model (GLM). In contrast to ML estimation, GEE doesnot assume a certain probability model for the data. The GSK method (Grizzle, Starmer& Koch, 1969) is based on Weighted Least Squares (WLS). Here,the new estimationmethod MAEL is compared to GEE and GSK, using simulation studies as well as areal-data example.
ReferencesGRIZZLE, J.E., STARMER, C.F. and KOCH, G.G. (1969): Analysis of CategoricalData by Linear Models.Biometrics, 25, 489-504.LIANG, K-Y. and ZEGER, S.L. (1986): Longitudinal Data Analysis Using General-ized Linear Models.Biometrika, 73, 13-22.VAN DER ARK, L.A., BERGSMA, W.P. and CROON, M.A. (2013):AugmentedEmpirical Likelihood Estimation of Categorical Marginal Models for Large SparseContingency Tables. Manuscript submitted for publication.
KeywordsMARGINAL MODELS, MAXIMUM LIKELIHOOD, GENERALIZED ESTIMAT -ING EQUATIONS, ESTIMATION
Tilburg University,[email protected] · London Schoolof Economics
37
Applying Multilevel Latent Class Analysis To Large-ScaleEducational Assessment Data: Predicting Students’Mathematical Strategy Choices From Teachers’ InstructionalPractice
Marije F. Fagginger Auer, Marian Hickendorff, and CornelisM. van Putten
Abstract
The usefulness of multilevel latent class analysis (LCA) for educational data is demon-strated, by applying this technique to data from the 2011 large-scale assessment ofDutch primary schools’ mathematics. The relation between the instructional practicereported by 107 teachers and the mathematical strategy choices of 1619 students wasinvestigated. Multilevel LCA allowed modeling of the oftenignored classroom effects,and one of its so far sparsely exploited features - the possibility of including predic-tors at different hierarchical levels - enabled modeling ofthe joint influence of teacherand student characteristics on learning outcomes. Four latent strategy choice classes ofstudents were found, and teachers had a strong effect on students’ probability of beingin these classes. Effects were found of student characteristics and of teachers’ strategyinstruction, instruction formats and instruction differentiation. It is concluded that mul-tilevel (teacher) effects should not be ignored in strategyresearch, and that multilevelLCA is especially suited for application in educational research.
KeywordsMULTILEVEL LATENT CLASS ANALYSIS, APPLICATION, EDUCATION
Leiden University, Institute of Psychology, Methods & Statis-tics, Wassenaarseweg 52, 2333 AK Leiden, the [email protected]
38
A Tuning Strategy for COSA
Maarten M.D. Kampert and Jacqueline J. Meulman
Abstract
It is well known that noise variables can overwhelm the few signals embedded in high-dimensional settings. To overcome this problem for data from these high-dimensionalsettings Friedman and Meulman (2004) proposed clustering objects on subsets of at-tributes (COSA). This technique outputs a dissimilarity matrix that can be used in con-junction with a wide variety of (distance-based) clustering algorithms, including hier-archical methods. In order to avoid distinctly suboptimal solutions, COSA employs ahomotopy strategy for which tuning parameters need to be set. However, a clear guid-ance on the different choices for these tuning parameters has not yet been published. Wepropose a tuning strategy for hierarchical clustering. Furthermore, we compare COSAwith other state of the art methods on simulated and real-life data.
ReferencesFRIEDMAN J.M., and MEULMAN , J.J. (2004): Clustering objects on subsets ofattributes.Journal of Royal Statistics Society Series B, 66, 815–849.
KeywordsHIGH-DIMENSIONAL DATA, VARIABLE SELECTION, HIERARCHICAL CLUS-TERING
Mathematical Institute, Leiden University
39
Accuracy Of Reliability Estimates
Pieter R. Oosterwijk, Klaas Sijtsma, and L. Andries van der Ark
Abstract
Test-score reliability is one of the most reported measuresfor assessing measurementquality of Psychological and Educational tests. Well knownexamples of estimates oftest-score reliability are Cronbach’s coefficient alpha, Guttman’s lambda-2, the greatestlower bound, and the Molenaar-Sijtsma estimate. Coefficient alpha has received criticalreviews for being incorrectly interpreted and being too conservative, and the greatestlower bound for being biased. However, the inaccuracy of thefour reliability estimateshas received little attention in the literature but is a threat to the practical usefulness ofreliability estimates in small and modest samples. The actual extent of the inaccuracy ofreliability coefficients due to factors as sample size and number of items under empiricalconditions is unknown. In a simulation study, we investigated the inaccuracy of coeffi-cient alpha, Guttman’s lambda-2, the greatest lower bound,and the Molenaar-Sijtsmaestimate. As measures of inaccuracy, we used the spread of the sample distribution ofa reliability estimate for different levels of sample size,numbers of item, numbers ofanswer categories, and value of the test-score reliability. We found that the spread ofthe sample distributions (95% interpercentile range) mostly depends on sample size andnumber of items. For multitude of conditions in the simulation design results show thatreliability estimates are to inaccurate to be useful in practice.
KeywordsCRONBACH’S ALPHA, COEFFICIENT ALPHA, GREATEST LOWER BOUNDTOTHE RELIABILITY, GUTTMAN’S LAMBDA-2, MOLENAAR-SIJTSMA RE LIA-BILITY, RELIABILITY ESTIMATION METHODS
Department of Methodology and Statistics, Tilburg University, P.O.Box 90153, 5000LE Tilburg, the [email protected]; [email protected];[email protected]
40
A Big Data Intensive Application System with Symbolic DataAnalysis and its Implementation
Hiroyuki Minami and Masahiro Mizuta
Abstract
Big data analysis has become a remarkable topic in the world. Most reports are fo-cused on how to handle them with the studies based on databasetechniques. Some ofthem note the importance of statistical approach, but it’s just mentioned. It is tough forstatisticians to analyze Big data straightforwardly with conventional methods (mainlyfor n× p or dissimilarity (n×n) matrices). It takes vast time and working memory in atypical computer.
Symbolic Data Analysis (SDA) is a powerful approach and applicable for most char-acteristics (starting with “V”) on Big Data. “Second level”data expression likeConceptin SDA is useful to overcome Variety. It is somehow effectivefor Volume to shrink theamount of the handled data since the expression can be regarded as data aggregation.However, much computing power and storage capacity are essentially needed for Bigdata analysis even if we would succeeded in shrinking. It is the same as to the otherfeature Velocity.
The idea of cloud computing introduced as distributed computing facility like Hadoopand MapReduce functions and distributed file system might lead us to the feasible so-lution. We have developed a statistical application systembased on SDA to seek andutilize the affinity of cloud computing for SDA.
In the paper, we introduce our system and its implementationfrom the practical view-point. Through some examples, we discuss its performance and utility.
ReferencesDIDAY, E. and NOIRHOMME-FRAITURE, M. (2008):Symbolic Data Analysis andthe SODAS Software. Wiley.MINAMI, H. and MIZUTA, M. (2012): SDA framework is the tool for Big DataAnalysis?Book of Abstracts. 3rd Workshop in Symbolic Data Analysis, 21.
KeywordsSYMBOLIC DATA, CLOUD COMPUTING
Information Initiative Center, Hokkaido University, [email protected], [email protected]
41
An Generalization Of Centre And Range Method For FittingA Linear Regression Model To Symbolic Interval Data UsingRidge Regression, Lasso And Elastic Net Methods
Oldemar Rodríguez1
Abstract
In [RODRIGUEZ O. (2000)] we had made four proposals for linear regression with in-terval data type, the simple regression with empirical correlation, linear regression basedon the maximum and minimum correlation, linear regression based on the mid-pointsand linear regression based on top-points of the hypercubes. Then in [BILLARD, L.,DIDAY, E., (2000)] the authors have presented a linear modelto an interval-valued dataset fitting the mid-points of the interval values assumed by the variables in the learningdata set and applies this model to the lower and upper boundaries of the interval valuesof the independent variables to do the prediction. In [LIMA-NETO, E.A., DE CAR-VALHO, F.A.T., (2008-2010)] the authors have proposed a newapproach to symbolicinterval data that fits the linear regression model on the mid-points and ranges of theinterval values assumed by the variables in the learning set.
Ridge Regression shrinks the regression coefficients by imposing a penalty on theirsize, then the coefficients minimize a penalized residual sum squared. In the paper“Re-gression Shrinkage and Selection via the Lasso"[TIBSHIRANI, R., (1996)] the authorpropose a new method for estimation in linear models that minimizes the residual sumof squares subject to the sum of the absolute value of the coefficients being less than aconstant. The penalties used in Lasso provide a natural variables selection to encouragesparsity and simplicity in the solution. In the paper [HASTIE, T., AND ZOU H. (2005)]the elastic net method was proposed, this is also a regularization and variable selectionmethod which is a convex combination of the lasso and ridge penalty methods.
In this paper we used Ridge Regression, Lasso and Elastic Netmethods in order toimproved the Center and Range method for fitting a linear regression model to symbolicinterval data. Finally, the approaches presented are applied to a real and simulated datasets and their performance are compared with Centre and Range method.
ReferencesBILLARD, L., DIDAY, E., (2000). Regression analysis for interval-valued data.In: Data Analysis, Classification and Related Methods,Proceedings of the SeventhConference of the International Federation of Classification Societies (IFCSŠ00),Springer, Belgium, pp. 369-374.BILLARD, L., DIDAY, E., (2003). From the statistics of data to the statistics ofknowledge: symbolic data analysis.J. Amer. Statist. Assoc. 98 (462), 470-487.HASTIE, T., TIBSHIRANI, R. and FRIEDMAN, J. (2008).The Elements of Statis-tical Learning; Data Mining, Inference and Prediction. New York: Springer.HASTIE, T., AND ZOU H. (2005). Regularization and variable selection via theelastic net.J. R. Statist. Soc. B 67, Part 2, pp. 301-320.GIORDANI P., (2011).Linear regression analysis for interval-valued data basedon the Lasso technique. Technical Report n. 7, Department of Statistical Sciences,Sapienza University of Rome.
CIMPA, School of Mathematics, University of Costa [email protected] // [email protected]
42
LIMA-NETO, E.A., DE CARVALHO, F.A.T., (2008). Centre and range method tofitting a linear regression model on symbolic interval data.Computational Statisticsand Data Analysis 52, 1500-1515.LIMA-NETO, E.A., DE CARVALHO, F.A.T., (2010). Constrainedlinear regressionmodels for symbolic interval-valued variables.Computational Statistics and DataAnalysis 54, 333-347.TIBSHIRANI, R., (1996). Regression shrinkage and selection via the lasso.Journalof the Royal Statistical Society - Series B 58, 267-288.RODRIGUEZ, O. (2000).Classification et Modèles Linéaires en Analyse des Don-nées Symboliques. Ph.D. Thesis, Paris IX-Dauphine University.
KeywordsLINEAR REGRESSION, ELASTIC NET, LASSO, RIDGE REGRESSION, SYM-BOLIC DATA ANALYSIS.
43
Symbolic Data Clustering. A Review
Justyna Wilk1
Abstract
Clustering is the unsupervised classification of patterns into relatively homogeneousgroups and one of the most important methods of exploratory data analysis. However,clustering is a complex problem. Its difficulty is deepeningwhile clustering symbolicdata more complex than classical data situation. They significantly contribute in datamining to present the huge data sets in a reduced form and alsoin more complete andnatural phenomena description.
Although there are numerous studies on symbolic data analysis and its applications,there is a lack of an overview study which would complete and systematize the knowl-edge of symbolic data clustering. The subject of this paper is to consider methods suit-able for symbolic data clustering. We present taxonomy of clustering techniques and areview of their applications in symbolic data analysis. We discuss clustering procedureof symbolic data and recommend methods suitable for symbolic data analysis.
ReferencesBOCK, H.-H. and Diday, E. (2000):Analysis of Symbolic Data. Exploratory Methodsfor Extracting Statistical Information from Complex Data. Springer-Verlag, BerlinHeidelberg.DIDAY, E. and BRITO, P. (1989): Symbolic Cluster Analysis. In: O. Opitz (Ed.):Conceptual and Numerical Analysis of Data. Springer-Verlag, Berlin Heidelberg,45–84.EVERITT, B.S. and LANDAU, S. and LEESE, M. (2001):Cluster Analysis. Arnold,London.WILK, J. (2011): Analiza skupien na podstawie danych symbolicznych [Cluster anal-ysis based on symbolic data]. In: E. Gatnar, M. Walesiak (Eds.): Analiza danychjakoIJciowych i symbolicznych z wykorzystaniem programu R[Symbolic and quali-tative data analysis with using of R software]. C.H. Beck, Warsaw, 262–279.
KeywordsCLUSTER ANALYSIS, SYMBOLIC DATA ANALYSIS, CLASSIFICATION
1Wrocław University of Economics, Department of Econometrics and Computer Sci-ence, Nowowiejska 3, 58-500 Jelenia Góra, Poland,[email protected]
44
The Ensemble Conceptual Clustering of Symbolic Data
Marcin Pełka1
Abstract
Ensemble approach based on aggregating information provided by different models hasbeen proved to be a very useful tool in the context of the supervised learning. The maingoal is to increase the accuracy and stability of the classification. Recently the sametechniques have been applied for cluster analysis where by combining a set of differentclusterings, a better solution can be received.
Since Michalski wrote about conceptual clustering as a new branch of machine learn-ing (Michalski 1980) there has been increasing attention tothat tasks. Conceptual clus-tering is not only the inherent structure of the data that drives cluster formation, but alsothe description language which is available to the learner.
The article proposes to apply conceptual clustering in ensemble learning of sym-bolic data. The main contribution of the paper is the proposal how to solve a theoreticalproblem of the conceptual clustering results aggregation.An adaptation of bagging isproposed. In the empirical part of the paper some simulationexperiment results arepresented (based on artificial and real symbolic data sets).
ReferencesBOCK, H.-H., DIDAY, E. (Eds.) (2000):Analysis of symbolic data. Explanatorymethods for extracting statistical information from complex data. Springer Verlag,Berlin-Heidelberg.FRED, A.L.N., JAIN, A.K. (2005): Combining multiple clustering using evidenceaccumulation.IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol.27, 835–850.MICHALSKI, R.S. (1980): Knowledge acquisition through conceptual clustering: Atheoretical framework and algorithm for partitioning datainto conjunctive concepts.International Journal of Policy Analysis and Information Systems, Vol. 4, 219–243.
KeywordsSYMBOLIC DATA ANALYSIS, ENSEMBLE CLUSTERING, CONCEPTUAL CLUS-TERING
Wrocław University of Economics, Department of Econometrics and Computer Sci-ence, Nowowiejska 3, 58-500 Jelenia Góra, Poland,[email protected]
45
The Hierarchy Test Of Geographic Units based on BorderLengths
Andrzej Sokołowski1, Danuta Strahl2, Małgorzata Markowska3, and MarekSobolewski4
Abstract
Whenever the geographical or administrative units are subjects for classification (clus-tering) one can wonder if the results are influenced by upper level classification. If wecluster districts, prefectures, counties into homogeneous groups it would be interestingto know whether the partition has anything in common with upper level regions or coun-tries. In the paper we propose a procedure for testing such influence. Administrativeunits are neighbors with different common border lengths. The differences in lengthsare quite large. It is natural to assume that the relations between units may be somehowstatistically proportional to the common border length. The universal test cannot be sug-gested since the place, neighborhood and common border lengths are different for eachset of analysed units. So we propose the procedure based on computer-intensive wayof finding critical values for a given problem. Results are compared with the test basedsolely on number of neighbors.
ReferencesAUDRETSCH, D.B. and FELDMAN, M.P. (1996): R&D Spillovers and the Geog-raphy of Innovation and Production.The American Economic Review, vol.86, No.3,630-640.UNWIN, D.J. (1996): GIS, spatial analysis and spatial statistics.Progress in HumanGeography, 20, 4, 540-551.MOORE, D.A. and CARPENTER, T.E. (1999): Spatial AnalyticalMethods and Ge-ographic Information Systems: Use in Health Research and Epidemiology.Epidemi-ologic Reviews, vol.21, No.2, 143-161.
KeywordsSPATIAL METHODS, CLUSTERING, INNOVATIONS
Cracow University of [email protected] · Wroclaw Uni-versity of [email protected] · Wroclaw University of Eco-nomics [email protected] · Rzeszow University of [email protected]
46
Statistical Modeling the Optimal Level of FX Reserves forPoland
Eugeniusz Gatnar
Abstract
Modeling the optimal level of international reserves is an important issue for centralbanks, especially for emerging economies such as Poland. FXreserves can be seen as aform of self-insurance against sudden stops in capital flow.Therefore, they can preventeconomies from crises and mitigate their impact, but, on theother hand, they are costly.
The research on FX reserve adequacy started in late sixties by Heller and since thenseveral models have been developed, e.g. by Frenkel and Jovanovich (1981), Wijnholdsand Kaptyen (2001), Aizeman and Lee (2005), and Jeanne and Ranciere (2009).
In this paper we introduce a model that allows estimation theoptimal level of foreignexchange reserves for Poland.
ReferencesAIZEMAN J., LEE J. (2005): International Reserves: Precautionary Versus Mer-cantilist Views, Theory and Evidence, NBER Working Paper No. 11366, NationalBureau of Economic Research, Cambridge, Massachusetts.FRENKEL J., JOVANOVICH B. (1981): Optimal International Reserves: A Stochas-tic Framework, Economic Journal, 1981, Vol. 91, pp. 507âAS-514.JEANNE O., RANCIERE R. (2009): The Optimal Level of International Reservesfor Emerging Market Countries: Formulas and Applications,IMF Working Paper,WP/06/229, Washington.WIJNHOLDS O., KAPTYEN A. (2001): Reserve Adequacy in Emerging MarketEconomies, IMF Working Paper 01/143, International Monetary Fund, Washington.
KeywordsFX RESERVES, RESERVE ADEQUACY, FINANCE, REGRESSION MODELS, STATIS-TICS
University of Economics in Katowice, 1 Maja 50, 40– 287 Katowice, [email protected]; National Bank of Poland, Swi-etokrzyska 11/21, 00–919 Warszawa, Poland,[email protected]
47
Latent Transitions with Mixture Rasch Model of BankruptcyRisk in the Classification of Polish Firms
Barbara Pawełek1, Józef Pociecha2, and Adam Sagan3
Abstract
Many types of bankruptcy prediction models have been formulated by the business the-ory and practice. Among them more popular are: multidimensional discriminant analy-sis, Logit models, neural networks and classification trees.
The aim of the paper is to present the results of modeling of bankruptcy using latenttransition models (LTA) with mixture Rasch measurement model (MRM) of bankruptcyrisk and the time-invariant and time-varying covariates. The measurement model isbased on the financial indicators of firms economic performance.
The data from Polish industry is used for estimation of classprevalences, within-classvariability on the latent variable and transition probabilities across classes that reflect thelevel of bankruptcy risk.
Finally, the variety of LTA-MRM models with actual bankruptcy as a distal outcomeis used for establishing the level of predictive validity.
ReferencesCHO, S.-J., COHEN, A.S., KIM, S.-H. and BOTTGE, B. (2010), Latent TransitionAnalysis with a Mixture Item Response Theory Measurement Model, Applied Psy-chological Measurement, 34(7), 483-504.PAWEŁEK, B. and POCIECHA, J. (2012), General SEM Model in Researching Cor-porate Bankruptcy and Business Cycles. In: J. Pociecha and R. Decker (Eds.):DataAnalysis Methods and Its Applications. C.H. Beck, Warsaw, 215-231.
KeywordsBANKRUPTCY RISK, LATENT TRANSITION ANALYSIS, MIXTURE RASCHMODEL
Cracow University of [email protected] · CracowUniversity of [email protected] · Cracow Univer-sity of [email protected]
48
Automatic Determination The Number Of Clusters InSpectral Clustering
Marek Walesiak and Andrzej Dudek
Abstract
This paper will test the usefulness of seven indices (within-group dispersion, Davies-Bouldin index, Calinski and Harabasz index, Hartigan index, Krzanowski and Lai index,Silhouette index, gap index) assessing the quality of classification in the issue of theselection of the number of clusters in the spectral clustering taking into account the fourtypes of distance (squared Euclidean distance, Euclidean distance, manhattan distance,GDM1 distance).
The article evaluates twenty eight clustering procedures (four spectral clusteringmethods and seven indices) based on simulated data (classicand non-classic). Eachclustering result is compared with the known cluster structure applying corrected Randindex.
ReferencesHUBERT, L. and ARABIE, P. (1985): Comparing partitions,Journal of Classifica-tion, 2(1), 193–218.NG, A., JORDAN, M. and WEISS Y. (2002): On spectral clustering: analysis andan algorithm. In: T. Dietterich, S. Becker, Z. Ghahramani (Eds.),Advances in NeuralInformation Processing Systems 14. MIT Press, Cambridge, 849–856.WALESIAK, M. (2011):Uogolniona miara odleglosIJci GDM w statystycznej anal-izie wielowymiarowej z wykorzystaniem programu R [The Generalized DistanceMeasure GDM in multivariate statistical analysis with R]. Wydawnictwo UE, Wro-claw.WALESIAK, M. and DUDEK, A. (2012):clusterSim package. URL http://www.R-project.org.WANG, J. (2010): Consistent selection of the number of clusters via crossvalidation,Biometrika, 97(4), 893–904.
KeywordsCLUSTER ANALYSIS, SPECTRAL CLUSTERING, NUMBER OF CLUSTERS
Wrocław University of Economics, Department of Econometrics and Computer Sci-ence, ul. Nowowiejska 3, 58-500 Jelenia Góra, Poland, [email protected],[email protected]
49
A Spectral-Mean Shift Algorithm for Clustering of SymbolicData
Andrzej Dudek1 and Marcin Pełka1
Abstract
Clustering methods have been applied with a success in many different areas. In clusteranalysis objects are usually decried by single-valued variables. This allows to representthem as a vectors, where each column represents a variable. However this kind of datarepresentation is too restrictive for more complex data. Totake into account uncertaintyand/or variability to the data, variables must assume sets of categories or intervals, evenwith weights or frequencies. Such kind of data have been mainly studied inSymbolicData Analysis(SDA).
The article proposes a new clustering method for symbolic data – the spectral meanshift clustering (SMSC). Spectral clustering is a point of interest in many papers sincethe end of the XX century. It is not a new clustering method, but rather a new method ofpreparing data for further cluster analysis. The mean shiftalgorithm is a nonparametricclustering technique which does not require prior knowledge of the number of clustersand their shape.
The proposed algorithm is a combination of spectral and meanshift approaches forsymbolic data in order to deal better with non-gaussian clusters with noisy variablesand/or outliers.
ReferencesBOCK, H.-H., DIDAY, E. (Eds.) (2000):Analysis of symbolic data. Explanatorymethods for extracting statistical information from complex data. Springer Verlag,Berlin-Heidelberg.CHENG Y. (1995): Mean shift, mode seeking, and clustering.IEEE Transactions onPattern Analysis and Machine Intelligence, Vol. 17, No. 8, p. 790–799.NG, A., JORDAN, M., WIESS, Y. (2002): On spectral clustering: analysis and al-gorithm. [In:] T. Diettrich, S. Becker, Z. Ghahramani (Eds.), Advances in NeuralInformation Processing Systems 14, MIT Press, p. 849–856.
KeywordsSYMBOLIC DATA ANALYSIS, SPECTRAL CLUSTERING, MEAN SHIFT
Wrocław University of Economics, Department of Econometrics andComputer Science, Nowowiejska 3, 58-500 Jelenia Góra, Poland,[email protected], [email protected]
50
Asymptotics of ReducedK-means Clustering
Yoshikazu Terada1
Abstract
Reducedk-means clustering proposed by De Soete and Carroll (1994) isa method forclustering objects in a low-dimensional subspace. The advantage of this method is thatboth clustering of objects and low-dimensional subspace reflecting the cluster structureare simultaneously obtained.
The relationship between conventionalk-means clustering and reducedk-means clus-tering is discussed. Conditions ensuring almost sure convergence of the estimator of re-ducedk-means clustering as unboundedly increasing sample size have been presented.The results for a more general model considering conventional k-means clustering andreducedk-means clustering are provided. The rate of convergence forthe convergence ofthe empirically optimal clustering scheme is also discussed. Moreover, a new criterionand its consistent estimator are proposed to determine the optimal dimension number ofa subspace, given the number of clusters. For more details, see Terada (2013).
ReferencesDE SOETE, G. and CARROLL, J.D. (1994):K-means clustering in a low-dimensional Euclidean space. In: Diday, E., Lechevallier,Y., Schader, M., Bertrand,P. and Burtschy, B. (Eds.):New Approaches in Data Analysis. Springer, Heidelberg,212–219.TERADA, Y. (2013): Strong consistency of Reducedk-means clustering.arXiv.
KeywordsSTRONG CONSISTENCY, DIMENSION REDUCTION,K-MEANS
Graduate School of Engineering Science, Osaka University,1-3 Machikaneyama, Toy-onaka, Osaka, [email protected]
51
Non-hierarchical Clustering Algorithm For MixedNumerical And Categorical Three-Way Three-Mode Data
Takahiro Umei1 and Hiroshi Yadohisa2
Abstract
Three-way three-mode data are defined as a set of multivariate data for the same objectsand variables. Three-way factorialk-means (Vichi et al., 2007) and Tucker 3 cluster-ing (Rocci and Vichi, 2007) have been proposed as algorithmsfor clustering such data.However, these algorithms can only deal with numerical data. For applying these al-gorithms to categorical data, the data first need to be converted into numerical data byusing concept of dummy variables. However, it is difficult tointerpret the clusteringresults because of the requirement of a large number of variables. For such a problem,Chang el al.(2004) proposed an approach to without increasing the number of variablesand consider the importance of variables, in the multivariate data clustering. Therefore,it is easy to interpret the clustering results.
In order to overcome the problems encountered in previous studies, this paper pro-poses a new non-hierarchical clustering algorithm that extends Chan et al.’s (2004)three-way three-mode data clustering algorithm. Concretely, our algorithm enables easyinterpretation of the clustering results of three-way three-mode data and considers theimportance of variables and occasions for each cluster.
ReferencesCHAN, E.Y., CHING, W.K. and HUANG, J.Z. (2004): An optimization algorithm forclustering using weighted dissimilarity measures.Pattern Recognition, 37(5), 943-952.ROCCI, R. and VICHI, M. (2005): Three-mode component analysis with crisp orfuzzy partition of units.Psychometrika, 70(4), 716-736.VICHI, M., ROCCI, R. and KIERS, H.A.L. (2007): SimultaneousComponent andClustering Models for Three-way Data: Within and Between Approaches.Journal ofClassification, 24(1), 71-98.
KeywordsSUBSPACE CLUSTERING, VARIABLES AND OCCASIONS WEIGHTS,K-MODECLUSTERING
Doshisha [email protected] · Doshisha [email protected]
52
Using Simulation Strategies to Test Clustering AlgorithmPerformances
Marina Marino1 and Cristina Tortora2
Abstract
In literature a wide number of clustering methods exist. Theeasiest and most commonused methods, like k-means or hierarchical clustering, have good performances underthe following conditions: 1) small number of variables (lower than the number of units),2) orthogonal variables (spherical clusters), 3) clustershaving the same variance, 4)absence of outliers. When one, or more than one, of these conditions are not verifiedclustering methods can fail into detect the clustering structure underling the data. In thiswork a simulation study is used to test the performance of a recently proposed clusteringmethods, Factor PD-clustering (FPDC), when the optimalityconditions are not verified.FPDC is a factorial clustering method proposed by Tortora etal. in 2011. It is based onProbabilistic Distance clustering (PD-clustering) proposed by Ben-Israel and Iyigun in2008. FPDC makes a linear transformation of original variables into a reduced numberof orthogonal ones using a common criterion with PD-Clustering. Factor PD-clusteringmakes alternatively a Tucker 3 decomposition and a PD-clustering on transformed datauntil convergence is reached. This method could significantly improve PD-clusteringperformances and allows us to work with large datasets. The method gives good resultswhen optimality conditions are not respected. The simulation design is based on thestructure proposed by Marona and Zamar in 2002.
ReferencesBEN-ISRAEL, A. AND IYIGUN, C. (2008): Probabilistic d-clustering.Journal ofClassification, 25(1):5–26.MARONNA, R.A. AND ZAMAR, R.H. (2002): Robust estimates of location anddispersion for high-dimensional datasets.Technometrics, 44(4):307–317.TORTORA, C., GETTLER SUMMA, M., AND PALUMBO, F. (2011). Factorial pd-clustering.Proceedings of the Joint Conference of the German Classification Society.
KeywordsFACTOR PD-CLUSTERING, SIMULATION STUDY
Università di Napoli Federico II, [email protected] · University of Guelph,[email protected]
53
Random Forest Variable Importance Measures: CurrentDevelopments
Anne-Laure Boulesteix1 and Silke Janitza2
Abstract
The random forest method is an increasingly common supervised learning tool used invarious application fields such as, e.g., bioinformatics and genetics. The variable impor-tance measures (VIMs) that are automatically calculated asa by-product of the algo-rithm are often used to rank predictors with respect to theirability to predict the investi-gated response. It is now well-known that VIMs may be affected by substantial biases,for instance in favour of categorical predictors with many categories. After a brief sur-vey of these issues, we address further topics related to variable importance measures:the bias affecting the Gini VIM in favor of categorical predictors with approximatelybalanced categories, a new permutation VIM based on the areaunder curve that is morerobust against class imbalance in the response variable than the usual permutation VIM,and the development of statistical tests for VIMs.
ReferencesBOULESTEIX, A.-L., JANITZA, S., KRUPPA, J. and KÖNIG, I. (2012): Overviewof random forest methodology and practical guidance with emphasis on computa-tional biology and bioinformatics.Wiley Interdisciplinary Reviews: Data Mining andKnowledge Discovery, 2, 493–507.BOULESTEIX, A.-L., BENDER, A., LORENZO-BERMEJO, J. and STROBL, C.(2012): Random forest Gini importance favors SNPs with large minor allele fre-quency.Briefings in Bioinformatics, 13, 292–304.JANITZA, S., STROBL, C. and BOULESTEIX, A.-L. (2013): An AUC-based per-mutation variable importance measure for random forests.BMC Bioinformatics(ac-cepted).
KeywordsRANDOM FOREST, ENSEMBLE METHOD, SUPERVISED LEARNING, VARI-ABLE IMPORTANCE
Department of Medical Informatics, Biometry and Epi-demiology, Ludwig-Maximilians-University of Munich,[email protected] · Department of Medical In-formatics, Biometry and Epidemiology, Ludwig-Maximilians-University of Munich,[email protected]
54
Detecting Threshold Interactions In Binary Classification:STIMA
Claudio Conversano1 and Elise Dusseldorp2
Abstract
Simultaneous Threshold Interaction Modeling Algorithm (STIMA) is a tool enabling usto automatically select interactions in a Generalized Linear Model (GLM) through theestimation of a suitable defined tree structure called “trunk”. STIMA integrates GLMwith a classification tree algorithm or a regression tree one, depending on the nature ofthe response variable (nominal or numeric). Accordingly, it can be based on the Clas-sification Trunk Model or on the Regression Trunk Model. In both cases, interactionterms are expressed as “threshold interactions” instead oftraditional cross-products.Compared with standard tree-based algorithms, STIMA is based on a different splittingcriterion as well as on the possibility to “force” the first split of the trunk by manuallyselecting the first splitting predictor. Different specifications of the generalized linearmodel with threshold interaction effects can be provided bySTIMA on the basis of thenature of the response variable. In this paper, we focus on the binary response case andpresent results on real and synthetic data in order to compare the performance of STIMAwith that of alternative methods (e.g., logistic regression, MARS, Support Vector Ma-chines, Random forests).
ReferencesCONVERSANO, C. and DUSSELDORP, E. (2010): Simultaneous Threshold In-teraction Detection in Binary Classification. In Lauro, C.N., Greenacre, M.J. andPalumbo, F. (eds.)Studies in Classification, Data Analysis, and Knowledge Organi-zation, Springer, Berlin-Heidelberg, 225-232.DUSSELDORP, E., CONVERSANO, C. and VAN OS, B.J. (2010): Combining anadditive and tree-based regression model simultaneously:STIMA, Journal of Com-putational and Graphical Statistics, 19, 514–530.
KeywordsGENERALIZED LINEAR MODELING, RECURSIVE PARTITIONING, INTERAC-TION EFFECTS, CLASSIFICATION TRUNK, REGRESSION TRUNK
Department of Business and Economics, University of Cagliari, [email protected] · Netherlands Organisation for Applied Scientific ResearchTNO, Leiden, The [email protected]
55
A Recursive Partitioning-Based Method To BalanceCovariates When Estimating Causal Effects
Massimo Cannas1, Claudio Conversano1 and Francesco Mola1
Abstract
Estimation of causal effects within observational data mayrequire prior adjustment forbalancing covariates distribution across treated and control units. We present an empir-ical method for the identification of a balanced group of observations which has beenimplemented in an algorithm that uses a balance measure criterion to recursively splitthe original dataset based on the value of covariates. Observations are finally partitionedin subsets characterized by different degrees of homogeneity. The final subset of obser-vations on which causal inference can be carried out is selected according to a suitable-defined threshold measure and bootstrap is used to assess thestability of the selectionmethod as well as the properties of the average treatment effect estimators. Results onboth simulated and real data illustrate the effectiveness of the proposed approach.
ReferencesCRUMP, R.K., HOTZ, V.J., IMBENS, G.V. and MITNIK, O.A. (2009): Dealing withlimited overlap in estimation of average treatment effectsDEHEJIA, R. and WAHBA, S. (1999): Causal Effects in Nonexperimental Studies:Reevaluating the Evaluation of Training Programs,Journal of the American Statisti-cal Association, 94(448): 1053–1062. Biometrika, 96(1): 187–199IACUS, S. M. and PORRO, G. (2009): Random Recursive Partitioning: a MatchingMethod for the Estimation of Average Treatment EffectsJournal of Applied Econo-metrics, 24: 363–385.TRASKIN, M. and SMALL, D.S. (2011): Defining the Study Population for an Ob-servational Study to Ensure Sufficient Overlap: A Tree Approach,Statistics in Bio-sciences, 3: 94–118.
KeywordsCAUSAL INFERENCE, RECURSIVE PARTITIONING, BOOTSTRAP
University of Cagliari, Department of Business and [email protected], [email protected], [email protected]
56
Recursive Partitioning for Hybrid Image Classification usingCaptions and Image Features
Adalbert Wilhelm1
Abstract
Methods for finding groups of similar objects in large data sets with the purpose of facil-itating data interpretation play an important role in exploratory data analysis. However,classical cluster analysis methods do not scale well with anincreased number of ob-jects and/or dimensions. Recent work in the field has focusedon designing algorithmsthat can overcome these difficulties while providing meaningful solutions. We proposea projection-based hierarchical partitioning method inspired by the OptiGrid algorithm.Given a data sample, the present algorithm searches for low-density points (local min-ima) in selected-dimensional projections, and partitionsthe data by a hyperplane pass-ing through the best split point found, if any. Measures suchas iterative implementation,objects and dimensions sampling, and simplified search for projections and local min-ima, ensure the computational efficiency of the algorithm. Acomparative evaluation ofthe algorithm is presented based on synthetic and referencedata. Performance of thealgorithm is explicated for some image analysis tasks.
ReferencesILIES, I. and WILHELM, A. (2010): Projection-Based Partitioning for Large, High-Dimensional Datasets.Journal of Computational and Graphical Statistics, 19, 474–492.SCHOBER, J.-P., HERMES, T. and HERZOG, O. (2005): PictureFinder: Descriptionlogics for semantic image retrieval. In:Proceedings of the 2005 IEEE InternationalConference on Multimedia and Expo. Amsterdam, 1571–1574.SIVIC, J. and ZISSERMAN, A. (2003): Video Google: A text retrieval approach toobject matching in videos. In:Proceedings of the 9th IEEE International Conferenceon Computer Vision. Nice, 1470–1477.
KeywordsDIMENSION REDUCTION, HIERARCHICAL PARTITIONING, IMAGE CLASSI-FICATION, PERFORMANCE MEASURES
School of Humanities and Social Sciences, Jacobs University Bremen, Campus Ring 1,28759 Bremen, Germany,[email protected]
57
Change of Aspects of Industrial Classification System fromHierarchical Structure to Network Structure
Hiroki Furuzumi1, Yoshiro Matsuda2, and Yasumasa Baba3
Abstract
The countries in the world employ their own SIC (Standard Industrial Classification)scheme such as JSIC in Japan, NAICS (North American SIC System) among USA,Canada, and Mexico as their common SIC. The most common ISC ofindustries is theISIC (International SIC) scheme by UNSD. As these SIC schemes are used for encodingthe activities of each establishment of a company but not forthe company itself, toassign a unique industrial classification code to each company becomes a problem hardto solve by statistical officers of every country. In order toclassify an establishmentby its economic activities and/or amounts of turnover, we should face up to the factthat majority of a companies are operating plural establishments of different activities.Most of SIC schemes are composed of several levels from the bottom or minute to anupper aggregated level, i.e. they are classified in a hierarchical classification scheme.The broadest boundary of classifying industries lies between a sphere of goods andservices and that of monetary aspects. One way to assign a code to a company which isrunning plural business is to assign only one position leaving the number of the businessout of account, and the other extreme way is to assign one position in the upper level ofaggregation.
A more adequate way of classification, however, is to abandona hierarchical classifi-cation scheme, and to use a network structure instead. Usingmicro data sets ofFinancialStatements Statistics of Corporations by Industryof Ministry of Finance Japan, we pro-pose to reclassify a company by first and second turnover of each company. It will showa different scheme from the case using only the first turnoverin a hierarchical classifi-cation scheme. For example, a certain company’s position inan industrial classificationscheme will occupy its positions both in real industrial classification sphere and that ofmonetary. And so, to express those relations in a network structure will make its positionmore clearly in the industrial classification scheme. We propose a different classificationcriterion for companies and establishments.
KeywordsPLURAL ESTABLISHMENT ENTERPRISES, MICRO DATA SETS, JSIC, NAICS
University of Hyogo. Kobe, [email protected] · AomoriPublic College. Aomori, [email protected] · The Institute of StatisticalMathematics. Tokyo, [email protected]
58
Econometric Models of Durable Goods’ Prices: A HedonicApproach
Anna Król1
Abstract
The classic demand-supply models of commodities’ prices inprinciple establish marketequilibrum price of a certain good at the intersection of curves representing the quan-tities offered by the producers and quantities claimed by the consumers. In contrast tothose models the hedonic approach links the price of the goodwith the set of those itsattributes, which are valued by the buyers and significant for the manufacturers. Themodel, which represents above mentioned relationship, refered to as hedonic regres-sion, allows to price the commodity and to estimate the prices of its respective attributes(so-called implicit prices), including the prices which are not directly observable on themarket (e.g. the commodity’s brand).
This paper presents hedonic analysis of prices for two groups of durable goods: usedcars and laptop computers, making use of extensive offers database gathered by theauthor. The conducted research provides insights into consumers preferences towarddifferent variaties of analised commodities, as well as introduce estimates of marketvaluations of significant goods’ characteristics.
ReferencesNESHEIM, L. (2006): Hedonic Price Functions.CeMMAP working papersCWP18/06. Centre for Microdata Methods and Practice, Institute for Fiscal Studies.TRIPLETT, J. (1986): The Economic Interpretation of Hedonic Methods.Survey ofCurrent Business, 36(1), 36–40.WOOLDRIDGE, J.M. (2002):Econometric Analysis of Cross Section and PanelData. The MIT Press, Cambridge.
KeywordsHEDONIC PRICE METHODS, DURABLE GOODS, IMPLICIT PRICES
Wrocław University of [email protected]
59
Smart Growth Versus Economic And Social Cohesion –Econometric Panel Analysis
Beata Bal-Domanska1 and Elzbieta Sobczak1
Abstract
Within the framework of the EU Europe 2020 strategy smart growth is listed as oneof the leading policy objectives aimed at improving the situation in such domains aseducation, research and innovation, as well as digital society. It can be demonstrated thatsmart growth represents the set of instruments which are supposed to result in dynamicgrowth and therefore enhance economic and social cohesion affecting the increase inpopulation life quality.
The objective of the paper is to evaluate relations occurring between smart growthdefined from the perspective of three pillars (smart specialization, creativity and inno-vation) as well as economic and social cohesion. Aggregate measures with a commongrowth pattern were used to measure smart growth and economic and social cohesion asrepresenting complex phenomena. They became the basis for the construction of econo-metric models allowing for the assessment of smart growth oneconomic and socialcohesion. Estimation techniques for panel data were used todescribe mutual relationsbetween these phenomena. The study was performed among the European Union coun-tries.
ReferencesA strategy for smart, sustainable and inclusive growth, European Commission, Com-munication from the Commission EUROPE 2020, Brussels, 3.3.2010.ARELLANO M. (2003): Panel Data Econometrics. Oxford: Oxford UniversityPress.WALESIAK M. (2011): Uogólniona miara odległosci GDM w statystycznej analiziewielowymiarowej z wykorzystaniem programu R [General distance measure GDMin statistical multivariate analysis applying R programme] . Wrocław University ofEconomics Publishhing House, Wrocław.WOOLDRIDGE J.M. (2002):Econometric analysis of cross section and panel data.Massachusetts Institute of Technology.
KeywordsSMART GROWTH, ECONOMIC AND SOCIAL COHESION, AGGREGATE MEA-SURES, PANEL MODELS
Wrocław University of Economics, Department of Regional Economics, Nowowiejska3, 58-500 Jelenia Góra, Poland,[email protected],[email protected]
60
Workflow Classification Based On The K-Means Partitioning
Etienne Lord, Abdoulaye Baniré Diallo, and Vladimir Makarenkov
Abstract
Workflow applications can be described as collections of tasks and the related links de-fined for being processed in a well-established order. Many complex scientific and busi-ness processes can be modeled using workflow pipelines (Van der Aalst, 2011). Usually,workflows are organized to minimize the total cost and duration of the included opera-tions. Classification and effective integration of workflows is a growing concern wheninterdisciplinary scientific projects are designed or whenlarge organizations merge andneed to integrate their business processes. In particular,the issue of clustering the ex-isting workflow pipelines into larger and more effective workflows becomes more andmore relevant. We propose to use the weighted version of the k-means partitioning algo-rithm (Makarenkov and Legendre, 2001) in order to provide a classification of the givenset of workflows. Two versions of the optimization criterionwill be considered, the firstone allowing for clustering workflows with similar topological features (i.e. tasks, links)and the second one allowing for regrouping workflows depending on both the topologi-cal features and the execution time. We will present an application of our classificationtechnique on workflows generated by our Armadillo platform (Lord et al. 2012).
ReferencesVAN DER AALST, W.M.P. (2011):Process Mining: Discovery, Conformance andEnhancement of Business Processes. Springer-Verlag, Berlin.MAKARENKOV, V. and LEGENDRE, P. (2001): Optimal variable weighting for ul-trametric and additive trees and k-means partitioning: methods and software.Journalof Classification, 18, 245-271.LORD, E. et al. (2012): Armadillo 1.1: An Original Workflow Platform for Designingand Conducting Phylogenetic Analysis and Simulations.PLoS One, 7(1), e29903.
KeywordsBIOINFORMATICS WORKFLOWS, K-MEANS PARTITIONING, WORKFLOWCLASSIFICATION
Département d’Informatique, Université du Québec à Montréal, Montréal, [email protected], [email protected],[email protected]
61
Functional Principal Component Analysis with R
Malgorzata Sej-Kolasa1 and Miroslawa Sztemberg-Lewandowska2
Abstract
Principal component analysis (PCA) transforms the original set of variables into neworthogonal set of variables that are called principal components. Functional principalcomponent analysis (FPCA) has the same advantages as classical principal componentanalysis. What is more it allows to analyze dynamical data. The main difference betweenthem is: PCA is based on multidimensional data, FPCA is basedon functional data. Thefunctional data are curves, surfaces or anything else varying over a continuum. They arenot a single observation. The purpose of this article is to describe the stages of functionalprincipal component analysis and present of the selected packages and functions in Rsystem for the implementation of these steps. In addition authors show the usefulness ofapplying functional principal component analysis in orderto analyze longitudinal data.
ReferencesHALL P., MÃIJLLER H. G., WANG J. L. (2006): Properties of Principal ComponentMethods for Functional and Longitudinal Data Analysis.The Annals of Statistics Vol.34, No. 3, 1493-1517.INGRASSIA S., COSTANZO G. D. (2005): Functional principal component analy-sis of financial time series. In: Vichi M., Monari P., MignaniS., Montanari A. (Eds.)New Developments in Classification and Data Analysis. Springer-Verlag, Berlin,351-358.RAMSAY J. O., SILVERMAN B.W. (2005):Functional Data Analysis. Springer.RAMSAY J.O., HOOKER G., GRAVES S. (2009):Functional Data Analysis with Rand MATLAB. Springer.
KeywordsFUNCTIONAL DATA, FUNCTIONAL PRINCIPAL COMPONENT ANALYSIS, RSYSTEM, LONGITUDINAL DATA
Department of Econometrics and Computer Science, Wroclaw University of Eco-nomics, [email protected] · Department ofEconometrics and Computer Science, Wroclaw University of Economics, [email protected]
62
Implementation of Time Series Methods of Forecasting inTSprediction R Package
Tomasz Bartłomowicz
Abstract
The paper presents a Time Series Prediction (TSprediction)package developed for Rprogram. The package contains an implementation of the mostpopular time series meth-ods of forecasting which include: time series models with trend (e.g. analytical mod-els, Holt model), time series exponential smoothing models(e.g. simple exponentialsmoothing model, seasonal smoothing model), time series models with seasonal fluctu-ations (e.g. Winter’s seasonal multiplicative model, Winter’s seasonal additive model,model with cyclical component), moving average time seriesmodels (e.g. simple mov-ing average model) and autoregressive time series models (ARMA and ARIMA mod-els).
In addition to time series methods of forecasting TSprediction package contains func-tions that allow to define the most important ex post forecasterrors: mean error (ME),mean absolute error (MAE), mean square error (MSE), root mean square error (RMSE),mean percentage error (MPE) and mean absolute percentage error (MAPE).
Functions of TSprediction package will be illustrated withexamples of applicationsin empirical time series forecasting.
ReferencesCIESLAK M. (1997), Prognozowanie gospodarcze. Metody i zastosowania. PWN,Warszawa.COWPERTWAIT P.S.P., METCALFE A.V. (2008),Introductory Time Series with R.Springer, New York.
KeywordsFORECASTING, TIME SERIES, R PROGRAM
Wrocław University of Economics, Department of Econometrics andComputer Science, ul. Nowowiejska 3, 58-500 Jelenia Góra, Poland,[email protected]
63
Latest developments of theRSDA: An R package for SymbolicData Analysis
Oldemar Rodríguez1 and Johnny Villalobos2
Abstract
In this new version of theR packageRSDA we have integrated the packageR2S, thatwas developed to transform relational data into symbolic data with the R packageRSDA.The main features of this package is the possibility to take into account different typesof symbolic variables (continuous, interval, histogram ormulti–valued).
Methods like centers interval principal components analysis, histogram principalcomponents analysis, multi-valued correspondence analysis and linear regression mod-els have been implemented in this version. This new version also includes new featuresto manipulate symbolic data through a new data structure that implements SymbolicData Frames.RSDA includes functions to transform relational data into symbolic data. This fea-
ture use a new set of base data types (continuous, interval, histogram or multi–valued),symbolic operators and SQL-functions to allow the creationof symbolic tables directlyin the database. The new types are implemented in the Data Base Management SystemPostgreSQL, a powerful open source object-relational database system. PostgreSQL isreleased under the PostgreSQL License, a liberal Open Source license, similar to theBSD or MIT licenses, so we have permission to use, copy, modify, and distribute thissoftware and its documentation for any purpose, without fee.
ReferencesBOCK H-H. and DIDAY E. (eds.) (2000).Analysis of Symbolic Data. Exploratorymethods for extracting statistical information from complex data. Springer, Germany.CHAMBERS, J.M. (2008).Software for Data Analysis: Programming withR.Springer, New York.EVERITT B.S. and HOTHORN T. (2010).A Handbook of Statistical Analysis UsingR. Chapman & Hall book, Florida.RODRIGUEZ R. and VILALOBOS J. (2011).RSDA: An R package for SymbolicData Analysis.Workshop In Symbolic Data Analysis Namur, Belgium.R DEVELOPMENT CORE TEAM (2007).R: A Language and Environment forStatistical Computing.R Foundation for Statistical Computing, Vienna, Austria.http://www.R-project.org.THE POSTGRESQL GLOBAL DEVELOPMENT GROUP (2012).R: PostgreSQL Developer’s Guide. PostgreSQL Development Team.http://www.postgresql.org.
KeywordsINTERVAL DATA, HISTOGRAM DATA, POSTGRESQL, RELATIONAL DATA BASE,SYMBOLIC DATA ANALYSIS.
CIMPA, School of Mathematics, University of Costa [email protected] // [email protected] · Schoolof Computer Science, National University, Costa [email protected]
64
Microeconometrics Multinomial Logit Models and theirImplementations in MMLM R Package
Andrzej Bak1 and Tomasz Bartłomowicz2
Abstract
Microeconometrics logit models are useful in analysis of categorical data (microdatadescribing individuals) often collected in marketing research based on discrete choices.Among microeconometrics models for unordered categories most frequently are usedmultinomial logit model (MNLM), conditional logit model (CLM) and mixed logitmodel (MLM). The main distinction between those models is following: MNLM fo-cuses on the individuals as the unit of analysis and uses the individual’s characteristicsas explanatory variables; CLM focuses on the set of alternatives and the explanatoryvariables are characteristics of those alternatives; MLM focuses on individuals and char-acteristics of the choice options (alternatives) and the explanatory variables are charac-teristics of individuals and alternatives.
The main aim of this paper is to present a Microeconometrics Multinomial LogitModels (MMLM) package developed for R program which can be used to estimate theprobability of choice of an individual among a set of alternatives. The package containsan implementation of multinomial, conditional and mixed logit models and functionswhich can be used in discrete choice method to design the research (e.g. to build frac-tional factorial design), encode the alternatives, estimate the models, etc. Functions ofMMLM package will be illustrated with examples of applications in empirical analysisof consumer preferences.
ReferencesAGRESTI A. (2002),Categorical Data Analysis. Second Edition, Wiley, New York,CAMERON A.C., TRIVEDI P.K. (2005),Microeconometrics. Methods and Appli-cations. Cambridge University Press, New York.JACKMAN S. (2007), Models for Unordered Outcomes. Political Sci-ence 150C/350C. http://jackman.stanford.edu/classes/350C/07/unordered.pdf(12.03.2012).SO Y., KUHFELD W.F. (1995), Multinomial Logit Models.http://support.sas.com/techsup/technote/mr2010g.pdf(12.03.2012) .WINKELMANN R., BOES S. (2006),Analysis of Microdata. Springer, Berlin.
KeywordsMICROECONOMETRICS, DISCRETE CHOICE MODELS, PREFERENCES,R PRO-GRAM
Wrocław University of Economics, Department of Econometrics and ComputerScience, Nowowiejska 3, 58-500 Jelenia Góra, Poland, [email protected]·Wrocław University of Economics, Department of Econometrics and Computer Sci-ence, Nowowiejska 3, 58-500 Jelenia Góra, Poland, [email protected]
65
Latent Spaces of the Product Baskets - A Hybrid Model ofOn-line Shopping
Adam Sagan1 and Mariusz Łapczynski2
Abstract
The aim of the paper is to identify the latent relations between product choices (on-line shopping data) using the integrated hybrid model of market basket and latent spaceanalysis.Large number of association rules were post-mined (Zhao, Zhang and Cao 2009) bycombining them with SNA that explains the relational properties (with respect to sup-port, confidence and lift indices) of products network (Raeder and Chawla 2011).
We propose the model-based Latent Space Analysis for clustering the products net-work using two-stage maximum likelihood and bayesian MCMC estimation (Handcock,Raftery and Tantrum 2007). Optimal number of segments was found on the basis ofAIC/BIC criteria.
Relational properties of product networks are explained using alternative-specificlogit p* models and autocorrelation statistics (Wassermanand Pattison 1996). An Rpackagelatentnet, UCINET and Mplushave been used during the estimations.
ReferencesHANDCOCK, M., S., RATFERY, A. E. and TANTRUM, J., M. (2007): Model-BasedClustering for Social Networks,Journal of Royal Statistical Society, 170(2), 301–354RAEDER, T. and CHAWLA, N., V., (2011): Market Basket Analysis with Networks,Social Network Analysis and Mining, 2011, 1( 2), 97–113WASSERMAN, S. and PATTISON M. (1996), Logit Models and Logistic Regres-sions for Social Networks: an Introducion to Markov Graph and p* Psychometrica61(3), 401–425ZHAO, Y., ZHANG, Ch. and CAO, L.,(2009): Post-mining of Association Rules:Techniques for Effective Knowledge Extraction, Information Science Reference
KeywordsMARKET BASKET ANALYSIS, SOCIAL NETWORK AUTOCORRELATION, LA-TENT SPACE MODEL
Cracow University of [email protected] · Cracow University [email protected]
66
Multilevel Principal Covariates Regression
Marlies Vervloet, Wim Van den Noortgate, Katrijn Van Deun and Eva Ceulemans
Abstract
Principal Covariates Regression (PCovR; De Jong & Kiers, 1991) is a weighted combi-nation of Principal Component Analysis (PCA) and linear regression. Like PCA, PCovRreduces the predictors to a few components and, like regression, it predicts the criteria,but on the basis of the components. The extent to which both aspects play a role whenconstructing the components is determined by a weighting parameter that has to be spec-ified by the user. In this paper, we extend PCovR to multileveldata (e.g. persons nestedin groups). As part of the criterion variance of such data canbe contributed to between-group differences while another part is due to within-groupdifferences, the method firstsplits the data into a between-group part and a within-grouppart (for a similar approach,see Timmerman, 2006). Subsequentially, a separate PCovR analysis is conducted on thebetween-group part and on the within-group part. Multilevel PCovR involves a fewmodel selection challenges, as for both the between-group and the within-group model,an appropriate number of components and weighting parameter value needs to be cho-sen. To this end, we propose some model selection strategies, based on the work ofVervloet et al. (in press). The use of these strategies and the interpretation of the result-ing model are illustrated by means of a real-data application.
ReferencesDE JONG, S. and KIERS, H.A.L. (1991): Principal covariates regression. Part I.Theory.Chemometrics and Intelligent Laboratory Systems, 14, 155–164.TIMMERMAN, M.E. (2006): Multilevel component analysis.British Journal ofMathematical and Statistical Psychology, 59, 301–320.VERVLOET, M., VAN DEUN, K., VAN DEN NOORTGATE, W., and CEULE-MANS, E. (in press): On the selection of the weighting parameter value in PrincipalCovariates Regression.Chemometrics and Intelligent Laboratory Systems.
KeywordsMULTICOLLINEARITY, REGRESSION, MULTILEVEL DATA
KU Leuven, Belgium
67
Three-step Estimation Method For Discrete Micro-MacroMultilevel Models
M. Bennink1, M. A. Croon1 and J. K. Vermunt1
Abstract
In ‘reversed’ multilevel analysis, a group-level outcome is explained by means ofindividual- and/or group-level predictors using a latent variable model (Croon and vanVeldhoven, 2007). The scores of the individual-level unitsare treated as indicators of alatent variable defined at the group-level and the outcome variable is regressed on thislatent group-level variable.
Maximum likelihood estimators can be obtained by estimating the model in one step.This one-step approach is not very practical to apply, especially when one wishes to usemore than just a few lower-level predictors.
A solution would be to apply a three-step estimation method with a correction forclassification error (Bolck, Croon, Hagenaars 2004; Vermunt 2010; Bakk, Tekle, & Ver-munt, in press). The application of this three-step method to discrete micro-macro mul-tilevel models is discussed in the current presentation.
ReferencesBAKK, Zs., TEKLE, F. and VERMUNT, J. K. (in press): Estimating the associationbetween latent class membership and external variables using bias adjusted three-stepapproaches.Sociological Methodology.BOLCK, A., CROON, M. A. and HAGENAARS, J. A. (2004). Estimating latentstructure models with categorical variables: One-step versus three-step estimators.Political Analysis, 12, 3–27.CROON, M. A., and van VELDHOVEN, M. J. P. M. (2007). Predicting Group-levelOutcome Variables from Variables Measured at the Individual Level: A Latent Vari-able Multilevel Model.Psychological Methods, 12, 45–57.VERMUNT, J. K. (2010). Latent class modeling with covariates: Two improvedthree-step approaches.Political Analysis, 18, 450–469.
KeywordsTHREE-STEP APPROACH, MULTILEVEL ANALYSIS, MICRO-MACRO ANALY-SIS, GENERALIZED LATENT VARIABLE MODELS
Tilburg University, Tilburg, the [email protected]
68
Single-array SNP Genotype Classification WithSemi-Parametric Log-Concave Mixtures
Paul H.C. Eilers1 and Ralph C.A. Rippe2
Abstract
SNP (pronounced as “snip") stands for single nucleotide polymorphism, positions on thegenome (DNA) that differ between individual organisms. Forhumans, millions of SNPshave been located. Using microarrays the state of up to a million SNPs can be measuredat the same time, using a single drop of blood or a small amountof body tissue. Theresults are being used on a very large scale in genome-wide association scans, in whichobservable properties of may individuals are regressed on SNP states.
Each SNP has two alleles, which we indicate here by A and B. Because DNA is orga-nized in chromosomes, and chromosomes form pairs, the stateof a SNP, its genotype,can be AA, AB or BB (it is not possible to discriminate betweenAB and BA). A crucialstep is the assignment of genotypes to all SNPs for each person in a study.
Microarray technology is based on chemical fluorescence. Unfortunately this tech-nique is far from perfect and so clustering methods are needed. Commonly this is im-plemented by estimating the AA, AB and BB clusters and cluster memberships for eachSNP in turn, for a set of microarrays.
We have developed an alternative approach, in which all SNPson one array are clus-tered at the same time. It is based on estimating a mixture of three two-dimensionalsemi-parametric densities, using tensor product P-splineto model their logarithms. Thepenalties have been chosen in such a way that they force the component densities to belog-concave.
Genotyping whole arrays has large logistic advantages, both in speed and in organi-zation of the workflow. We present the theory behind our proposal and by applying it tosamples from the HapMap archive we show its excellent performance.
KeywordsSPLINES, PENALTIES, GENOTYPING
Erasmus University Medical Center, Rotterdam, The [email protected] · Leiden University, Leiden, The [email protected]
69
On Featureless K-Means Clustering
Sergey D. Dvoenko1
Abstract
In a featureless case the set of objects is represented only by results of pairwise compar-isons in the form of a distance, similarity or kernel-based matrix. Since publication byW.S. Torgerson, the cluster centers can be represented by their distances to other objectswithout using the feature space itself, which recently has become popular as the “kernelk-means”.
We show how k-means clustering can be executed with no computations related tocluster centers at all. This procedure, referred to as the meanless featureless k-meansclustering, makes permutations on the (dis)similarity square matrix resulting in the sameclustering for both featureless and feature-based cases.
It is shown that some heuristic clustering algorithms for “diagonalization” of simi-larity matrices, popular in Russia, are suboptimal versions of the meanless featurelessk-means procedure if the matrix is semidefinite positive.
ReferencesDHILLON, I., GUAN, Y. and KULIS, B. (2004): Kernel k-means: spectral cluster-ing and normalized cuts. In:Proceedings of the 10th ACM SIGKDD Int. Conf. onKnowledge discovery and data mining. ACM New York, NY, USA, 551–556.SCHOLKOPF, B. and SMOLA, A. (2002):Learning with kernels: Support VectorMachines, Regularization, Optimization and Beyond. MIT Press, Cambridge.BRAVERMAN, E.M. and others. (1971): Diagonalization of therelation matrix anddetecting of hidden factors.Trans. of Institute of Control Sciences. 1st Issue "Prob-lems of increasing of automata possibilities", Moscow, Institute of Control Sciences,42-79. (in Russian)TORGERSON, W.S. (1958):Theory and Methods of Scaling. Wiley. N.Y.
KeywordsK-MEANS, FEATURELESS, MEANLESS, DISTANCE, SIMILARITY
State University of Tula, Russia,[email protected]
70
Two Major Least-squares Divisive Clustering Methods:Bisecting K-Means, PDDP and in between
E. Kovaleva1 and B. Mirkin2
Abstract
We first show that both bisecting k-means and principal direction partitioning (pddp)are suboptimal methods for the same least-squares criterion with ternary bases corre-sponding to rooted binary trees. Also, we combine these two by using projection of datato a number of random directions rather than to one principaldirection. To specify adivisive algorithm, one is to choose: (a) next cluster to split; (b) rule to stop splitting;(c) cluster splitting method. We choose a representative subset of divisive clusteringoptions and compare them experimentally by using a specially designed Gaussian clus-ter structure generator. Most options are unstable over theincrease of noise. The pddpmethod, recently modified by using the minima of the principal direction density func-tion to specify all of (a), (b), and (c) above, appears to be unequivocally winning inmost experiments. Yet at really noisy situations, with bothbetween-cluster overlap andrandom entities, the winner is k-means bisecting with random directions.
ReferencesBOLEY D. Principal Direction Divisive Partitioning.Data Mining and KnowledgeDiscovery, 1998 2(4), 325-344.MIRKIN B. Mathematical Classification and Clustering, Kluwer, Dordrecht, 1996,448.MIRKIN B. Choosing the number of clusters,WIRE Data Mining and KnowledgeDiscovery, 2011, 1, 252-260.TASOULIS S.K., TASOULIS D.K., PLAGIANAKOS V.P. Enhancing Principal Di-rection Divisive clustering,Pattern Recognition 43, 2010, 3391-3411.
KeywordsCLUSTERING, LEAST-SQUARES APPROACH, PDDP, BISECTING K-MEANS
NRU Higher School of Economics, Moscow, [email protected] · NRUHigher School of Economics, Moscow, [email protected]
71
Scoring Dissimilarity between Binary Images by AligningSeries of Skeleton Primitives
Olesya A. Kushnir1 and Oleg S. Seredin2
Abstract
We propose a method for matching images by converting information of their skeletonsin a series of primitives. To build a series of skeleton primitives, we traverse the skele-ton counterclockwise starting from a terminal node. Each edge on the way generates aprimitive into the series as a set of two reals, first expressing the edge’s length, and thesecond, the angle between the current edge and the next edge.
To compare two skeletons, we optimally align their series ofprimitives by using thedynamic programming approach. The alignment score is translated into our dissimilarityfunction. To improve the accuracy of a classifier built over the similarity, we incorporatea third real into the primitive, that is related to the radialsize of the skeleton in therespective node. We apply this to classify medical plant leaves.
ReferencesBYSTROV, M. YU. (2011): Structural approach application for recognition of binaryimage skeleton. In:Proceedings of Petrozavodsk State University, 2 (115), 76 –80(in Russian).GUSFIELD, D. (1997):Algorithms on Strings, Trees, and Sequences. CambridgeUniversity Press, University of California, Davis.MESTETSKIY, L. AND SEMENOV, A. (2008): Binary image skeleton - continuousapproach. In:Proceedings of the Third International conference on computer visiontheory and applications (VISAPP 2008), 1, 251 – 258.MOTTL, V.V., BLINOV, A.B., KOPYLOV, A.V., KOSTIN, A.A. (1998): OptimalProduct Positioning Based on Paired Comparison Data. In:Graph-Based Represen-tations in Pattern Recognition (J.-M. Jolion and W.G. Kropatsch, ed.) Computing,Supplement 12. Springer-Verlag/Wien, 135 – 145.
KeywordsSKELETON, PRIMITIVE, ALIGNMENT, DISSIMILARITY
Tula State University, Tula, [email protected] · Tula StateUniversity, Tula, [email protected]
72
Least-squares Consensus Clustering versus: (a) otherConsensus Approaches and (b) K-Means
A. Shestakov1 and B. Mirkin2
Abstract
We take on two criteria for consensus clustering proposed byMirkin and Muchnik(1981, in Russian) and optimize them with similarity clustering approaches describedin (Mirkin, 2005, 2012). Given a set of partitions R on the same entity set, one criterionis to find a partition r, that is behind those in R, which is akinto current concepts ofensemble consensus clustering. The other criterion is to build a partition r from R. Bothcan be equivalently reformulated as similarity clusteringcriteria; the first working overthe conventional consensus matrix, the second over the summary projection matrix. Weconsider a number of recent clustering consensus methods: Voting Scheme (Weinges-sel, Dimitriadou, Hornik 2002), Borda Voting (Sevillano, Claudi Socoro, Alias 2009),Bayesian (Wang, Shan, Banerjee 2009), Fusion-Transfer consensus (Guenoche 2011),MCLA, CSPA and HGPA (Strehl, Ghosh 2002), and cVote (Ayad, Kamel 2010). Forexperiments, we take all three types of data: (a) UCI repository datasets, (b) speciallydrawn two-dimensional “ornaments”, and (c) generated Gaussian cluster datasets. Weevaluate found cluster partitions according to their similarity to the partition hidden indata. We address two issues:
1. How least-squares consensus algorithms fare in comparison with the others? An-swer: The least-squares consensus algorithms outperform the others, usually up to alarge margin.
2. Is it true that the least-squares k-means clustering criterion is a better criterionthan consensus? Answer: No. in most situations, least squares consensus partitionis closer to the hidden partition than that minimizing the k-means criterion. Thisshows that developing algorithms for reaching deep minima of k-means criterionmay be a wrong idea.
KeywordsCONSENSUS CLUSTERING, LEAST SQUARES CONSENSUS, ONE-BY-ONE CLUS-TERING, K-MEANS
NRU Higher School of Economics, Moscow, [email protected] · NRU Higher School of Economics,Moscow, [email protected]
73
Combination of Several Control Charts using DynamicWeighted Majority Algorithm
Dhouha Mejri1, Claus Weihs2 and Mohamed Limam3
Abstract
In most process control applications, it is assumed that theprocess output follows anormal distribution with known mean and standard deviation. However, in real worlddata come over time and the process concept to be learned are often not stable andmay drift overtime. Moreover, when monitoring a process with multivariate normal dis-tribution using Shewhart, CUSUM or EWMA control chart whichare designed to re-spectively detect large, moderate and small shifts, it has been proposed that overall per-formance of different shifts can be obtained by combining control charts. This articlepresents a new combination of three different control charts using a dynamic ensem-ble method that copes with concept drifting data streams labeled: “Dynamic WeightedMajority” (DWM-WIN) algorithm [MEJ12]. The proposed combination benefits fromthe online characteristic of DWM-WIN algorithm in directing the state of the processwhen a stream of data arrives overtime. It consists of two steps: first transforming thetask of determining the state of the process into a classification problem by treatingcontrol charts as classifiers. Second, DWM-WIN is applied asan ensemble method tocombine different control charts. A real dataset with concept drift is used to simulate thecombined control chart. The proposed control chart presents an online method for driftdetection and improves the overall performance of the individual control charts over theentire process shift range.
ReferencesMEJRI, D., KHANCHEL R., LIMAM M., (2012): An ensemble methodfor conceptdrift in nonstationary environment,Journal of Statistical Computation and Simula-tion, 82, 1–14.
KeywordsSTATISTICAL PROCESS CONTROL, ONLINE CLASSIFICATION, DYNAMIC WEIGHTEDMAJORITY ALGORITHM, CONCEPT DRIFT.
ISG Tunis, University of Tunis and Technical University of Dortmund, [email protected] · Technical University of Dortmund, Germany,[email protected] · ISG Tunis, University of Tunis and Dho-far University, Oman,[email protected]
74
Multiplicity Within Clustering: Challenges And Unificatio ns
Jacques-Henri Sublemontier
Abstract
Data clustering is one of the most important unsupervised learning task and remainchallenging one despite the huge amount of method proposed in the literature [?]. Thecurrent large amount of data generated each month, days or hours have leading to the socalled “Big Data” problem have made clustering as one of the main tool to make furtheranalysis applicable. We are now faced with multiple sourcesof information, massiveand heterogeneous, coming from marketing to biology or social network analysis. Thepresent study is concerned with the multiplicity within current clustering problem. Mul-tiplicity can be found either in the data to analyse but also in the analysis to providefor demanding users. Thus several learning and mining paradigm have emerged sincethe last decade, namely multi-view clustering, consensus clustering or clustering en-semble, multiple consensus clustering or subspace and semi-supervised clustering [?].We observe here several works dedicated to these problems, then to propose a flexibleframework unifying them all. The propose framework follow the collaborative cluster-ing principle, where the objective is to find collaborative mechanisms between a set ofclusterers in order to achieve different objectives related to presented problems.
ReferencesHANS-PETER KRIEGEL AND ARTHUR ZIMEK. Subspace Clustering,EnsembleClustering, Alternative Clustering, Multiview Clustering: What Can We Learn FromEach Other? InProceedings of MultiClustKDD, 2010.ANIL K. JAIN. Data clustering: 50 years beyond K-means InPattern RecognitionLetters, 2010.
KeywordsMULTI-VIEW CLUSTERING, CONSENSUS CLUSTERING, ALTERNATIVE CLUS-TERING, SEMI-SUPERVISED CLUSTERING, COLLABORATIVE CLUSTERING
LIFO - Université d’Orléans, ENSI de Bourges, Bâtiment IIIA, rue Léonard de Vinci,F-45067 ORLEANS Cedex [email protected],http://www.univ-orleans.fr/lifo/Members/sublemontier/
75
Non-Isometric Transforms in Time Series Classificationusing DTW
Tomasz Górecki1 and Maciej Łuczak2
Abstract
Over recent years the popularity of time series has soared. As a consequence there hasbeen a dramatic increase in the amount of interest in querying and mining such data. Inparticular, many new distance measures between time serieshave been introduced. Inthis paper, we propose a new distance function based on a derivatives and transformsof times series. In contrast to well-known measures from theliterature, our approachcombines three distances: DTW distance between time series, DTW distance betweenderivatives of time series and DTW distance between transforms of time series. Thenew distance is used in classification with the nearest neighbor rule. In order to providea comprehensive comparison, we conducted a set of experiments, testing effectivenesson 47 time series data sets from a wide variety of applicationdomains. Our experimentsshow that this new method provides a significantly more accurate classification on theexamined data sets.
ReferencesGÓRECKI, T. and ŁUCZAK, M. (2013): Using derivatives in timeseries classifica-tion. Data Mining and Knowledge Discovery 26(2), 310–331.DING, H., TRAJCEVSKI, G., SCHEUERMANN, P., WANG, X. and KEOGH, E.(2008): Querying and Mining of Time Series Data: Experimental Comparison ofRepresentations and Distance Measures. In: Proc. 34th Int.Conf. on Very Large DataBases, 1542–1552.KEOGH, E. and PAZZANI, M. (2001): Dynamic Time Warping with Higher Or-der Features. In: First SIAM International Conference on Data Mining (SDM’2001),Chicago, USA.
KeywordsDYNAMIC TIME WARPING, DERIVATIVE DYNAMIC TIME WARPING, TIM ESERIES, HILBERT TRANSFORM, COSINE TRANSFORM, SINE TRANSFORM
Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Umul-towska 87, 61-614 Poznan, [email protected] · Departmentof Civil and Environmental Engineering, Koszalin University of Technology,Sniadec-kich 2, 75-453 Koszalin, [email protected]
76
Performance of the Accelerated Hyperbolic SmoothingClustering Method
Adilson Elias Xavier1 and Vinicius Layter Xavier2
Abstract
This paper considers the solution of the minimum sum-of-squares clustering problemby using the Accelerated Hyperbolic Smoothing Clustering Method. The mathemati-cal modelling of this problem leads to amin− sum−min formulation which has thesignificant characteristic of being strongly non-differentiable. The proposed resolutionmethod adopts the Hyperbolic Smoothing (HS) strategy usinga specialC∞ differen-tiable class function. The final solution is obtained by solving a sequence of low dimen-sion differentiable unconstrained optimization sub-problems which gradually approachthe original problem. The proposed algorithm applies also apartition of the set of ob-servations into two non overlapping groups: "data in frontier" and "data in gravitationalregions". The resulting combination of the HS methodology with the partition schemefor the MSSC problem has interesting properties, which drastically simplify the com-putational tasks. Computational experiments were performed with synthetic very largeinstances with 5000000 observations in spaces with up to 10 dimensions. The obtainedresults show a high level performance of the algorithm according to the different criteriaof consistency, robustness and efficiency. The robustness and consistency performancescan be attributed to the complete differentiability of the approach. The high speed ofthe algorithm can be attributed to the partition of the set ofobservations into two nonoverlapping parts, which simplifies drastically the computational tasks.
ReferencesXAVIER, A.E. (2010): The Hyperbolic Smoothing Clustering Method. PatternRecognition, 43, 731-737.XAVIER, A.E. and XAVIER, V.L. (2011): Solving the Minimum Sum-of-SquaresClustering Problem by Hyperbolic Smoothing and Partition into Boundary and Grav-itational Regions.Pattern Recognition, 44, 70-77.
KeywordsCLUSTER ANALYSIS, MIN-SUM-MIN PROBLEMS, NON-DIFFERENTIABLE PRO-GRAMMING, SMOOTHING
Federal University of Rio de Janeiro - [email protected] · FederalUniversity of Rio de Janeiro - [email protected]
77
STATIS Based Multiblock Clustering
Ndèye Niang1 and Mory Ouattara12
Abstract
Clustering multiblock data has been addressed by several consensus methods proposedby authors such as Gordon A.D. and Vichi, M. (1998) among others. The principalidea of these consensus methods is to agglomerate the separate partitions obtain fromeach block into a global partition which has to be the most similar to the contributorypartitions according to some index, eg. the Rand index. CSPA(cluster based similaritypartitioning algorithm) consists of clustering a so-called association matrix whose en-tries are defined as the fraction of partitions in which two individuals are in the samecluster. This association matrix considered as a similarity matrix is used to reclusterthe individuals. Li et al (2008) pointed out some limitations of CSPA and proposed aweighted consensus clustering method. We propose a method based on the three waymethod STATIS (Lavit et al., 1994) to find the consensus partition: letXi be the indicatormatrix related to the ith contributory partition. ApplyingSTATIS, each of these matri-ces is associated to a connectivity matrixWi . STATIS yields a compromise matrixW,weighted average of theWi which is the most similar to theWi according to theRV index(Lavit et al). We propose to recluster the individuals usingthe STATIS compromise ma-trix. The proposed method is compared to CSPA on data sets from the UCI repository,with labelled individuals in order to have a reference partition.
ReferencesGORDON A.D. AND VICHI, M. (1998 b):Partition of partitions. Journal of Clas-sification 15, 265-285 .LAVIT, C. AND ESCOUFIER, Y., SABATIER, R. AND TRAISSAC, P. (1994): TheACT (STATIS method)Computational Statistics and Data Analysis, 18: 97-119.T. LI AND C. DING. ( 2008): Weighted Consensus Clustering. InProc. SIAMInt.Conf:on Data Mining (SDM), 798-809,
KeywordsSTATIS, MULTI BLOCKS, CLUSTERING, CONSENSUS
CEDRIC CNAM 292, rue Saint Martin, 75141 Paris Cedex 03, [email protected] · CSTB, Centre Scientifique et Tech-niques du Bâtiment, 84 Avenue Jean Jaurès, 77420 [email protected]
78
Identifying Common And Distinctive Processes UnderlyingMultiset Data
Katrijn Van Deun1, Age K. Smilde2, Henk A.L. Kiers3, and Iven Van Mechelen1
Abstract
In many research domains it has become common practice to rely on multiple sources ofdata pertaining to the same set of entities. Examples include a systems biology approachto immunology with collection of both gene expression data and immunological read-outs for the same set of subjects, and the use of several high-througput techniques forthe same set of fermentation batches. A major challenge is tofind the processes underly-ing such multiset data and to disentangle therein the commonprocesses from those thatare distinctive for a specific source. Several integrative methods have been proposedto address this challenge including canonical correlationanalysis, simultaneous com-ponent analysis, OnPLS, generalized singular value decomposition, DISCO-SCA, andECO-POWER. To get a better understanding of the relations between these methods,this paper brings the methods together and compares them both on a theoretical level,as well as in terms of analyses of high-dimensional micro-array gene expression dataobtained from subjects vaccinated against influenza.
ReferencesALTER, O., BROWN, P.O., and BOTSTEIN, D. (2003): Generalized singular valuedecomposition for comparative analysis of genome-scale expression data sets of twodifferent organisms.Proceedings of the National Academy of Sciences USA 100,3351-3356.LÕFSTEDT, J., and TRYGG, J. (2010): OnPLS - a novel multiblock method for themodelling of predictive and orthogonal variation.Journal of Chemometrics 25 (2010)441-455SCHOUTEDEN, M., VAN DEUN, K., and VAN MECHELEN, I. (2012): ECO-POWER: A novel method to reveal common mechanisms underlying linked data.In: A. COLUBI, K. FOKIANOS, and E.J. KONTOGHIORGHES (Eds.):Proceed-ings of COMPSTAT’2012. 20th International Conference on Computational Statis-tics. Physica-Verlag, Heidelberg. PP–PP.TENENHAUS, A., and TENENHAUS, M. (2011): Regularized generalized canoni-cal correlation analysis.Psychometrika, 76, 257-284.VAN DEUN, K., VAN MECHELEN, I., THORREZ, L., SCHOUTEDEN, M.,DEMOOR, B., VAN DER WERF, M.J., DE LATHAUWER, L., SMILDE, A.K.,andKIERS, H.A.L. (2012): DISCO-SCA and properly applied GSVD as swinging meth-ods to find common and distinctive processes.PLoS ONE, 7, e37840, 1-13.
KeywordsMULTISET, COMMON AND DISTINCTIVE, DATA INTEGRATION
KU Leuven, Leuven, [email protected] · Univer-sity of Amsterdam, Amsterdam, The Netherlands· University of Groningen, Groningen,The Netherlands
79
Fuzzy Clustering of Three-way Proximity Arrays
Paolo Giordani1 and Henk A.L. Kiers2
Abstract
The ADditive CLUStering (ADCLUS) model is a tool for overlapping clustering oftwo-way proximity matrices (objects× objects). In the Simple Additive Fuzzy Clus-tering (SAFC) model, a variant of ADCLUS providing a fuzzy partition of the objects,that is the objects belong to the clusters with the so-calledmembership degrees rangingfrom zero (complete non-membership) to one (complete membership), is introduced.The INdividual Differences CLUStering (INDCLUS) model is ageneralization of AD-CLUS for handling three-way proximity arrays (objects× objects× subjects). Here,we propose a fuzzified alternative to INDCLUS capable to offer a fuzzy partition of theobjects by generalizing in a three-way context the idea behind SAFC. This new modelis called Fuzzy INdividual Differences CLUStering (FINDCLUS). An algorithm is pro-vided for fitting the FINDCLUS model to the data. Finally, theresults of a simulationexperiment and some applications to synthetic and real dataare discussed.
ReferencesCARROLL, J.D. and ARABIE, P. (1983): INDCLUS: an IndividualDifferencesGeneralization of the ADCLUS Model and the MAPCLUS Algorithm. Psychome-trika, 48, 157–169.GIORDANI, P. and KIERS, H.A.L. (2012): FINDCLUS: Fuzzy INdividual Differ-ences CLUStering.Journal of Classification, 29, 170–198.SATO, M. and SATO, Y. (1994): An Additive Fuzzy Clustering Model. JapaneseJournal of Fuzzy Theory and Systems, 6, 185–204.SHEPARD, R.N. and ARABIE, P (1979): Additive Clustering: Representation ofSimilarities as Combinations of Discrete Overlapping Properties.Psychological Re-view, 86, 87–123.
KeywordsTHREE-WAY ANALYSIS, CLUSTERING, PROXIMITY DATA, INDCLUS,FUZZYAPPROACH
Department of Statistical Sciences, Sapienza University of Rome, P.le Aldo Moro, 5,00185 Rome, [email protected] · Heymans Institute, Uni-versity of Groningen, Grote Kruisstraat 2/1, 9712 TS Groningen, The [email protected]
80
Principal Covariates Clusterwise Regression
Eva Ceulemans1, Eva Vande Gaer1, Henk A. L. Kiers2, Iven Van Mechelen3, and TomF. Wilderjans1
Abstract
In the behavioral sciences, many research questions pertain to a regression problem inthat one wants to predict a criterion on the basis of a number of predictors. Althoughin many cases ordinary least squares regression will suffice, sometimes the predictionproblem is more challenging, for three reasons: First, manypredictors can be available,making it difficult to grasp their mutual relations as well astheir relations to the cri-terion. In that case, it may be very useful to reduce the predictors to a few summaryvariables, on which one regresses the criterion and which atthe same time yield insightinto the predictor structure. Second, the population understudy may consist of a fewunknown subgroups that are characterized by different regression models. Third, theobtained data are often hierarchically structured, with for instance observations beingnested into persons. Although some methods have been developed that partially meetthese challenges (i.e., Principal Covariates Regression -PCovR-, clusterwise regression-CR-, and structural equation models), none of these methods adequately deals with allof them simultaneously. To fill this gap, we propose the PCCR method, which combinesthe key ideas behind PCovR (De Jong and Kiers, 1992) and CR (Spath, 1979). ThePCCR method is validated by means of a simulation study and byapplying it to datagathered in daily life on eating disorders.
ReferencesDE JONG, S. and KIERS, H. A. L. (1992): Principal covariates regression. Part I.Theory.Chemometrics and Intelligent Laboratory Systems, 14, 155-164.SPATH, H. (1979): Algorithm 39: Clusterwise linear regression.Computing, 22, 367-373.
KeywordsMULTICOLLINEARITY, DIMENSION REDUCTION, CLUSTERWISE REGRES-SION, MULTILEVEL DATA
Methodology of Educational Sciences Research Group, KU Leuven. Email:[email protected] · Heymans Institute, Faculty of Behaviouraland Social Sciences, University of Groningen. Email:[email protected] · Re-search Group of Quantitative Psychology and Individual Differences, KU Leuven.Email:[email protected]
81
Clusterwise PARAFAC To Identify Heterogeneity InThree-Way Data
Tom F. Wilderjans and Eva Ceulemans
Abstract
Three-way data, like, for example, sensory profiling data (e.g., products rated on a setof features by different judges) and EEG data (e.g., the spectrum of multichannel EEGrecordings over time for a set of participants), are frequently encountered in practice.When analyzing such three-way data, often the PARAFAC modelis adopted to disclosethe structure underlying the data (in terms of components).An implicit assumption ofthe PARAFAC model is that the underlying components are the same for all objects(i.e., products and participants). In many circumstances,however, this is a too restric-tive assumption in that different groups of objects may exist for which the data can besummarized well by a different set of components. For example, groups of participantsmay differ in the components that underlie their EEG recordings and other dimensionsmay be used to evaluate the quality of different groups of products. Therefore, in thispresentation, a new clusterwise PARAFAC generic modeling strategy is proposed. Thekey ingredient of this strategy is that the objects are partitioned into a set of mutuallyexclusive clusters, and that for each cluster of objects, a separate PARAFAC model isfitted, resulting in (cluster-specific) components that areallowed to vary across objectclusters. As a consequence, the data of objects belonging tothe same cluster can besummarized well by the same components, whereas different components are underly-ing the data from objects from different clusters. To evaluate the performance of thenew clusterwise PARAFAC strategy the results of an extensive simulation study will bediscussed. Finally, an application of the strategy to EEG and/or sensory profiling datawill be presented.
KeywordsCANDECOMP/PARAFAC, POPULATION HETEROGENEITY, THREE-WAYDATA,EEG DATA, QUALITATIVE (AND QUANTITATIVE) DIFFERENCES BETWEENOBJECTS
Methodology of Educational Sciences Research Group, Faculty of Psychology and Ed-ucational Sciences, KU Leuven, Andreas Vesaliusstraat 2 box 3762, 3000 Leuven, Bel-gium. Email:[email protected]
82
Structure-Revealing Data Fusion Model
Evrim Acar, Anders J. Lawaetz, Morten A. Rasmussen, and Rasmus Bro
Abstract
In many disciplines, data from multiple sources are acquired and jointly analyzed forenhanced knowledge discovery. However, the task of fusing data is challenging sincedata are often incomplete, heterogeneous, i.e., in the formof higher-order tensors andmatrices, and have both common (shared) and individual (unshared) components. Witha goal of addressing these challenges, we formulate data fusion as a coupled matrix andtensor factorization problem tailored to automatically reveal common and individualcomponents. In order to solve the coupled factorization problem, we use a gradient-based all-at-once optimization algorithm, which easily extends to coupled analysis ofincomplete data sets. We demonstrate that the proposed approach provides promisingresults in joint analysis of metabolomics data sets consisting of fluorescence and NMRmeasurements of plasma samples of a group of colorectal cancer patients and controls.
ReferencesACAR, E., KOLDA T. G. and DUNLAVY D. M. (2011): All-at-once Optimizationfor Coupled Matrix and Tensor Factorizations,arXiv:1105.3422.
KeywordsDATA FUSION, COUPLED MATRIX AND TENSOR FACTORIZATIONS, MISS-ING DATA, GRADIENT-BASED OPTIMIZATION
Faculty of Science, University of Copenhagen, Denmark{evrim, ajla, mortenr, rb}@life.ku.dk
83
Effects of Resampling Schemes on Stability of ClusterValidation Indices
Rainer Dangl and Friedrich Leisch
Abstract
Model validation in clustering involves the question whether the appropriate numberof groups was chosen. In order to investigate this, a wide range of indices has beendeveloped so far. Examples include the Rand Index, Jaccard-Coefficient, CH Index,KL Index, Gap Statistic, Prediction Strength, etc. In recent years, increased computa-tional power has facilitated the feasibility of resamplingbased validation studies, whichthrough repeated calculation of validity measures by usingresampled data provide amore stable trend towards a particulark. This in turn poses new questions - not only thechoice of a particular index may affect the outcome of the validation process, but alsothe method of resampling. Three main options are available:bootstrapping, splittingand random selection, depending on if an internal or external index is used. The ques-tion now arises whether the resampling scheme has an influence on the index values.The present study investigates exactly this problem. For this purpose, the three schemesand a selected range of cluster validation indices are benchmarked on simulated data.
ReferencesDOLNICAR, S. and LEISCH, F. (2010): Evaluation of structureand reproducibilityof cluster solutions using the bootstrap.Marketing Letters, 21, 83–101.MILLIGAN, G. and COOPER, M. (1985): An examination of procedures for deter-mining the number of clusters in a data set.Psychometrika, 50 (2), 159–179.TIBSHIRANI, R. and WALTHER, G. and HASTIE T. (2000): Estimating the numberof clusters in a dataset via the Gap Statistic.Journal of the Royal Statistical Society:Series B (Statistical Methodology), 63, 411–423.
KeywordsRESAMPLING, MODEL VALIDATION, CLUSTERING
Institute for Applied Statistics and Computing, University of Natural Re-sources and Life Sciences, Vienna, Peter-Jordan-Strasse 82, 1190 Vienna, [email protected]; [email protected]
84
Functional Canonical Correlation Analysis
Mirosław Krzysko1 and Łukasz Waszak2
Abstract
In this paper we propose a new method of constructing canonical correlations andcanonical variables for the pair of stochastic processes
X(t) =p
∑k=1
αkϕk(t), Y(t) =q
∑l=1
βlψl (t)
represented by a finite number of orthonormal basis functions
ϕ(t) = (ϕ1(t), ...,ϕp(t))′, ψ(t) = (ψ1(t), ...,ψq(t))
′,
wheret ∈ [0,T], α1, ...,αp andβ1, ...,βq are random variables with zero means and finitevariances. Canonical correlation analysis for a random process with finite basis expan-sion is equivalent to multivariate canonical correlation analysis between two randomvectorsα = (α1, ...,αp) andβ = (β1, ...,βq).
This problem has been initiated by Leurgans et al. (1993) anddeveloped by Ramsayand Silverman (2005).
ReferencesLEURGANS, S.E., MOYEED, R.A. AND SILVERMAN, B.W. (1993): Canonicalcorrelation analysis when the data are curves,J.R. Statist. Soc. B 55, No 3, 725-740RAMSAY, J.O., SILVERMAN , B.W. (2005).Functional Data Analysis(2nd ed).Springer.
KeywordsFUNCTIONAL DATA, ORTHONORMAL BASIS, CANONICAL CORRELATIONANALYSIS, REPRODUCING KERNEL HILBERT SPACE, KERNEL
Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Umul-towska 87, 61-614 Poznan, [email protected] · Faculty of Mathemat-ics and Computer Science, Adam Mickiewicz University, Umultowska 87, 61-614 Poz-nan, [email protected]
85
Pearson’s Product-Moment Correlation is a Special Case OfCohen’s Weighted Kappa
Matthijs J. Warrens
Abstract
In behavioral and biomedical sciences it is frequently required that two observers eachindependently rate the same set of targets on an ordinal scale. The raters may be clini-cians who classify children on asthma severity, or pathologists that rate the severity oflesions from scans. A widely used descriptive statistic forquantifying the agreementbetween the two observers is Cohen’s weighted kappa (Cohen 1968, Warrens 2011,2012).
Weighted kappa was proposed for situations where the disagreements between theobservers are not all equally important. For example, when categories are ordered, theseriousness of a disagreement depends on the difference between the ratings. Weightedkappa allows the use of weights to describe the closeness of agreement between cate-gories.
Since the magnitude of weighted kappa is greatly influenced by the relative magnitudeof the weights (Warrens 2013) a practical problem since its introduction has been, whatweights should be chosen? In this talk we show that if cell weights may be calculatedfrom the data, then the sample estimate of Pearson’s product-moment correlation is aspecial case of Cohen’s weighted kappa.
ReferencesCOHEN, J. (1968): Weighted Kappa: Nominal Scale Agreement With Provision forScaled Disagreement or Partial Credit.Psychological Bulletin, 70, 213–220.WARRENS, M. J. (2011): Cohen’s Linearly Weighted Kappa is a Weighted Averageof 2×2 Kappas.Psychometrika, 76, 471–486.WARRENS, M. J. (2012): Some Paradoxical Results for the Quadratically WeightedKappa.Psychometrika, 77, 315–323.WARRENS, M. J. (2013): Conditional Inequalities Between Cohen’s Kappa andWeighted Kappas.Statistical Methodology, 10, 14–22.
KeywordsCOHEN’S KAPPA, WEIGHTED KAPPA, ORDINAL AGREEMENT
Leiden University, Institute of Psychology, Unit Methodology and Statistics, P.O. Box9555, 2300 RB Leiden, The Netherlands,[email protected]
86
Ternary Diagrams Based On A Probabilistic Ideal PointModel
Mark de Rooij1 and Paul Eilers2
Abstract
The ternary diagram is a familiar and useful display of triples of probabilities that sumto one. The scales of the diagram are linear, and so small probabilities lead to dots closeto the boundary or even in the corners. Details are hard to judge then.
We propose a transformation, inspired by the probabilisticideal point model of DeRooij (2009). In a plane an objecti with probabilities(pi1, pi2, pi3) is represented by apoint with coordinates(xi ,yi), such thatpi j = cexp(−d2
i j ), whered2i j = (xi −u j)
2+(yi −
v j)2 is the squared Euclidean distance to an “anchor point"j, with coordinates(u j ,v j).
These anchor points are defined by the user, and will generally form an equilateral tri-angle.
The proposed display has several interesting properties. Triples with very small prob-abilities can be represented well. Equal log-odds of pairs of probabilities corresponds tostraight lines perpendicular to the line connecting two anchor points. Equal log-odds ofa single probability against the two others are given by smooth curves.
ReferencesDE ROOIJ, M. (2009): Ideal point discriminant analysis witha special emphasis onvisualization.Psychometrika, 74, 317–330.
KeywordsBIPLOTS, MULTIDIMENSIONAL SCALING, COMPOSITIONAL DATA
Leiden University, The [email protected] · Erasmus Uni-versity Medical Center, The [email protected]
87
The Matter Of Scale: Perceiving Distances And ProximitiesIn The Bi-Partial Clustering Setting
Jan W. Owsinski
Abstract
In the analysis of empirical data the issue of scale is of paramount importance. If thereexists a clear knowledge of the “actual” space of (feasible)attribute values, then it hasan obvious influence on the interpretation of the ones that are available for analysis.While in the case of, say, “independent” binary data this might often be trivial, it isby far not, when we deal with continuous data or with seriously restricted domains ofmultidimensional discrete, or even binary data. This is related at the same time to thedegree of “fillin” of this “feasible spac” with the data and tothe distance/proximityrelations among observations available.
Yet, the issue of scale appears as important on several levels. First, it intervenes atthe level of individual observations and the distance/proximity definitions, and in closeassociation with feature/variable importance. With this respect there is a distinct feed-back loop in reasoning, for it is the geometric properties that suggest which variablesare important, while importance may be held as having impacton the way geometry istreated of the data set. Then,second, it bears direct influence onseparation, distancesand proximities among groups of observations. Finally, it appears through theassess-ment of the entire image of data(do we deal with two or three models? is the propernumber of clusters four or six?). The intuitions relative tothese three basic levels maynot necessarily be consistent.
In many cases it may be of high significance to analyse explicitly the influence ofthe perception of scale on the results of respective analysis. The bi-partial approach,proposed by the present author, allows for an explicit consideration of this aspect, atleast at two of the previously mentioned levels. The bi-partial approach, which stemsmainly from clustering, but applies to numerous domains of data analysis at large (seeOwsinski, 2011), proposes to use a two-part objective function,namely
minP
{QSD(P) =CS(P)+CD(P)},
whereP is a partition of the data set that we look for,CS(P) corresponds to theoverall assessment of similarity(S) of the components forming partitionP (we want thecomponents to be possibly dissimilar), andCD(P) corresponds to the overall assessmentof the internal “compactness” of the particular groups, butmeasured through distances(D), so that we would like it to be possibly small. This formulation can be replaced byits “dual”, namely
maxP
{QDS(P) =CD(P)+CS(P)},
with analogous notation.Given that in this formulation we deal at the same time with distances and proxim-
ities, at least two of the previously mentioned levels of perception are involved. One
Systems Research Institute Polish Academy of Sciences,[email protected]
88
refers to the basic definitions of distances and proximitiesfor pairs of observations (ob-jects). Actually, one deals in this context with the bidirectional transformationd ↔ sbetween distance and proximity definitions. In quite a natural way, this transformationinvolves the establishment of respective scale (e.g., for standardised magnitudes ofdands, whens= 1−d and vice versa, meaning we operate within a unit figure), whetherdone explicitly or implicitly. Further, joint consideration of CS(P) andCD(P) (in the“primal” formulation) implies establishment of a certain scale at the level of groups ofobservations.
If so, experiment can be carried out on (a) the character of results of respective anal-ysis (cluster analysis, first of all) as a function of the scale (transformation) parameters,quite intuitively – from one cluster to the number of clusters equal the number of obser-vations (excluding identical ones); and (b) comparison of the results thus obtained withthose indicated either by the humans or by the usually applied statistical criteria.
The paper presents the rationale and the purposefulness of the exercise, and illustratesit with simple examples for the basic concrete formulationsof the bi-partial objectivefunction.
ReferencesOWSINSKI, J.W. (2011): The bi-partial approach in clustering and ordering: themodel and the algorithms.Statistica & Applicazioni, Special Issue, pp. 43-59.
KeywordsDISTANCE, PROXIMITY, SCALE, PERCEPTION LEVELS, BI-PARTIAL OBJEC-TIVE FUNCTION
89
Comparing Direct Estimators of the Mode
Andrzej Sokołowski1 and Kamil Fijorek2
Abstract
Since Karl Pearson paper in 1895 many estimators for the modewere proposed in sta-tistical literature. They can be grouped into two classes: indirect and direct. The firstone involves the estimation of density function and then finding its maximum. Thereare different types of direct estimators which work withoutprior estimation of density.In the paper several direct estimators are compared with simulation studies based onspecially designed generating models, both for univariateand multivariate distributions,with single and multiple modes.
ReferencesPEARSON, K. (1895): Contribution to the mathematical theory of evolution – II:Skew variation in homogeneous material.Philosophical Transactions of the RoyalSociety of London, A, 186, 343-414.SOKOŁOWSKI, A. (2013):Bezposrednie estymatory modalnej. Wydawnictwo Uni-wersytetu Ekonomicznego, Kraków.SAGER, T. (1978): Estimation of a Multivariate Mode.The Annals of Statistics,vol.6, No.4, 802-812.BICKEL, D.R. and FRÜWIRTH, R. (2006): On a fast, robust estimator of the mode:Comparison to other robust estimators with applications.Computational Statistics &Data Analysis, vol. 50, 12, 3500-3530.
KeywordsMODE, MODE ESTIMATION, SIMULATIONS
Cracow University of [email protected] ·Cracow University of [email protected]
90
k-NN Algorithm for Instantaneous Classification
Carmen Villar-Patiño1 and Carlos Cuevas-Covarrubias2
Abstract
k-NN (k-nearest neighbors) algorithms are standard methods of statistical classification.They are accurate and distribution free. In spite of these convenient features,k-NN im-plies a high computational cost. How to implementk-NN efficiently is an importantquestion in applied pattern recognition. We describe a new condensation method fork-NN and we explore its performance in instantaneous color identification problems.As in some other solutions reported in the literature, we represent the training data setin terms of a reduced collection of informative prototypes.This is similar to thek-NNmodel based approach; never the less, our method includes two parameters to be cali-brated in order to obtain a convenient exchange of precisionfor condensation; we callthis k-NN “controlled condensation”. We evaluate its performance with a real data setin a computer vision context. The results suggest that this proposal is accurate and effi-cient. It is a good alternative to implement efficient applications ofk-NN in challengingclassification problems.
References
GUO, G.; WANG, H.; BELL, D; BI, Y.; and GREERL, K. (2003):KNN model-based approach in classification.On The Move to Meaningful Internet Systems 2003:CoopIS, DOA, and ODBASE, 2888, 986-996.JIMENEZ, R. and CUEVAS, C. (2010): Curvas ROC y Vecinos Cercanos, Porpuestade un nuevo algortimo de Condensación,Revista de MatemÃatica:Teoria y Aplica-ciones,18, 21-32.MURTY, M. N., and DEVI, V. S. (2012):Pattern Recognition: An Algorithmic Ap-proach. Springer and Universities Press.
KeywordsSUPERVISED CLASSIFICATION,k-NN, CONDENSATION, COMPUTER VISION,COLOR CLASSIFICATION
Universidad Anáhuac, Estado de México, Mé[email protected] · e-mail:[email protected]
91
Flexible Multiclass Support Vector Machines: An Approachusing Iterative Majorization and Huber Hinge Errors
G.J.J. van den Burg1 and P.J.F. Groenen2
Abstract
A flexible multiclass support vector machine (SVM) is proposed which can be used forclassification problems where the number of classesK ≥ 2. Traditional extensions of thebinary SVM to multiclass problems such as the one-vs-all or one-vs-one approach suf-fer from unclassifiable regions. This problem is avoided in the proposed method by con-structing the class boundaries in aK −1 dimensional simplex space. Nonlinear classi-fication boundaries can be constructed by using either kernels or spline transformationsin the method. Similar to earlier work by Groenen et al. (2008), an Iterative Majoriza-tion algorithm is derived to minimize the constructed loss function. The performanceof the method is measured through comparisons with existingmulticlass classificationmethods on several datasets. From this we find that in most cases the performance ofthe proposed method is similar to that of existing techniques, but in some cases classifi-cation accuracy is higher.
ReferencesGROENEN, P.J.F., NALBANTOV, G. and BIOCH, J.C. (2008): SVM-Maj: A Ma-jorization Approach to Linear Support Vector Machines withDifferent Hinge Errors.Advances in Data Analysis and Classification, 2, 17–43.
KeywordsMULTICLASS SUPPORT VECTOR MACHINES, ITERATIVE MAJORIZATION,CLASSIFICATION
Econometric Institute, Erasmus University Rotterdam, P.O. Box 1738, 3000 DR [email protected] · Econometric Institute, Erasmus University Rotterdam,P.O. Box 1738, 3000 DR [email protected]
92
Power-Stress for Multidimensional Scaling
Patrick J.F. Groenen1 and Jan de Leeuw2
Abstract
Several loss functions exist for multidimensional scaling. Two important ones are basedon the sum of squared differences of distances and dissimilarities (Stress) and on differ-ences of squared distances and squared dissimilarities (S-Stress). The Power-Stress lossfunction incorporates these loss functions as it takes the sum of squared differences ofdistances and dissimilarities to some power larger than one, that is,
σpower(X) = ∑i< j
wi j (δ λi j −dλ
i j (X))2,
with X is the n× p configuration,wi j s are known nonnegative weights, theδi j s areknown dissimilarities, anddi j (X) is the Euclidean distance between rowsi and j ofX. Thus, we fit distances raised to some powerλ ≥ 1 to the dissimilarities raised to thesame power. Larger choices ofλ leads to emphasizing the fit of larger dissimilarities andconversely the smallerλ to spreads the emphasis over the dissimilarities. In this paper,we propose a new majorization algorithm to minimize the Power-Stress loss function.The core of this algorithm is the majorization of∑i< j wi j d2λ
i j (X) by a term of the form
tr[(X′X)λ ]. As with any majorizing algorithm, a monotonically nonincreasing series ofPower-Stress values is obtained that in almost all practical situations ends up in a localminimum. We show some of the main steps in the derivation of this algorithm andprovide some numerical comparisons.
KeywordsMULTIDIMENSIONAL SCALING, MAJORIZATION, STRESS, S-STRESS
Econometric Institute, Erasmus University, Rotterdam, P.O. Box 1738, 3000 DR Rotter-dam, The [email protected] · Department of Statistics, Universityof California, Los Angeles, CA 90095-1554, [email protected]
93
Variable Selection in Cluster Analysis Using ResamplingTechniques: a Proposal
Hans-Joachim Mucha1 and Hans-Georg Bartel2
Abstract
Variable selection is a well-known problem in many areas of multivariate statistics suchas classification and regression. The hope is that the structure of interest may be con-tained in only a small subset of variables. In contradictionto supervised classificationsuch as discriminant analysis, a quite difficult problem in cluster analysis is to do vari-able selection because there is nothing known about the trueclasses. In addition, vari-able selection in cluster analysis is highly related to the main difficult problem of deter-mining the number of clusters present in the data (Hennig, 2007). The latter is subjectof many investigations and papers considering resampling techniques as practical tools(Jain and Moreau, 1987). We propose a new and general approach to variable selectionusing non-parametric resampling techniques. General means it can be applied to anycluster analysis method. The starting point is an assessment of the evidence of univari-ate clusterings. Concretely, we are looking for the most stable univariate clustering (i.e.,the best variable) with respect to indexes such as the adjusted Rand. Here, additionally,one gets a rough idea about what the number of clustersK is. Subsequently we look foradditional variables as long as an improvement of the stability of clustering is realized.To be more precise, we are going to find the most stable bivariate (and furthermore mul-tivariate) clustering. We demonstrate the performance of our proposal on both syntheticand real data. Here, different resampling techniques such as nonparametric bootstrap-ping and subsampling are used (Mucha and Bartel, 2013).
ReferencesHENNIG, C. (2007): Cluster-wise assessment of cluster stability. ComputationalStatistics and Data Analysis 52: 258–271.JAIN A. K. and MOREAU, J. V. (1987): Bootstrap technique in cluster analysis.Pattern Recognition 20: 547–568.MUCHA, H.-J. and BARTEL H.-G. (2013): Soft Bootstrapping inCluster Anal-ysis and Its Comparison with Other Resampling Methods. In: M. Spiliopoulou,L. Schmidt-Thieme and R. Janning (Eds.):Data Analysis, Machine Learning andKnowledge Discovery. Springer, Berlin, forthcoming.
KeywordsCLUSTERING, VARIABLE SELECTION, RESAMPLING
Weierstrass Institute for Applied Analysis and Stochastics (WIAS), 10117 Berlin,Mohrenstraße 39, Germany,[email protected] · Department of Chem-istry at Humboldt University, Berlin, Brook-Taylor-Straße 2, 12489 Berlin, Germany,[email protected]
94
Adversarial Risk Analysis in Auctions
David Banks
Abstract
Adversarial risk analysis (ARA) is a decision-analytic approach to strategic games. Itbuilds a Bayesian model for the decision process of an opponent, with subjective dis-tributions over all unknown quantities. Then the analyst maximizes his expected utilitywith respect to the distribution over the action space induced by model for the opponentand the corresponding uncertainties. This talk applies theARA perspective to auctions,an important and well-studied class of strategic games. Under some assumptions, the re-sults align with Bayes Nash equilibrium solutions. But the approach also introduces aninteresting new class of auction problems, which are both realistic and mathematicallychallenging.
Duke University
95
Gaussian Process Classification And Duration Models ForCredit Risk
Silvia Figini1 and Aki Vehtari2
Abstract
Credit risk models are used to evaluate the insolvency risk caused by credits that enterinto default. Many models for credit risk have been developed over the past few decades.In this paper, we focus on those models that can be formulatedin terms of the probabilityof default by using semi-parametric and non parametric survival analysis models (seee.g. Figini and Fantazzini 2009).In order to write the default probability in terms of the conditional distribution functionof the time to default, in this paper we compare classical survival models with GaussianProcess (GP) which are a powerful tools for probabilistic modeling purposes. As pointedout in Vehtari et al. 2013, despite their attractive theoretical properties GPs providepractical challenges in their implementation.In this contribution we compare in terms of cross validation(see e.g. Vehtari and Ojanen2012) the results of the survival model with respect to GP. Anempirical study, based onreal data, illustrates the performance of each model.
ReferencesFIGINI, S. and FANTAZZINI, D. (2009): Random Survival Forest models for SMECredit Risk Measurement,Methodology and computing in applied probability, 11,29–45.VEHTARI A., VANHATALO, J., RIIHIMAKI J., HARTIKAINEN J., JY LANKIP. and TOLVANEN V. (2013): GPstuff: A Toolbox for Bayesian Modeling withGaussian Processes,Journal of Machine Learning Research Machine Learning OpenSource Software, in press.VEHTARI A., and OJANEN J. (2012): A survey of Bayesian predictive methods formodel assessment, selection and comparison,Statistics Surveys, 6:142-228.
KeywordsSURVIVAL ANALYSIS, GAUSSIAN PROCESS, CROSS VALIDATION, PROBA-BILITY OF DEFAULT, CREDIT RISK.
University of Pavia [email protected] · University of [email protected]
96
Model Averaging For Credit Risk Modelling
Silvia Figini1 and Marika Vezzoli2
Abstract
When many competing models are available for estimation, model averaging repre-sents an alternative to model selection. Despite model averaging approaches have beenpresent in statistics for many years, only recently they arestarting to receive attentionin applications especially in credit risk modelling (see e.g. Figini and Fantazzini 2009).In this paper we investigate model averaging and ensemble learning in order to ob-tain a well calibrated credit risk model in terms of predictive accuracy. We compareBayesian (see e.g. Steel, 2011 and the references therein) and classical model aver-aging approaches, like Random Forest (Breiman, 2001), Boosting (Freud and Schapire,1996), and CRAGGING (Vezzoli and Zuccolotto, 2011) with thefinal aim of improvingthe predictive performance of the models.
ReferencesBREIMAN, L. (2001): Random Forests,Machine Learning, 45, 5–32.FIGINI, S. and FANTAZZINI, D. (2009): Random Survival Forest models for SMECredit Risk Measurement,Methodology and computing in applied probability, 11,29–45.FREUND, Y. and SCHAPIRE, R.E. (1996): Experiments with a newboosting algo-rithm, Machine Learning: Proceedings of the Thirteenth International Conference,148–156. San Francisco: Morgan Kaufman.STEEL, M.F.J. (2011): Bayesian Model Averaging and Forecasting, Bulletin of E.U.and U.S. Inflation and Macroeconomic Analysis, 30–41.VEZZOLI, M. and ZUCCOLOTTO, P. (2011): CRAGGING measures ofvari-able importance for data with hierarchical structure, in S.Ingrassia, R. Rocci, M.Vichi (Eds.),New Perspectives in Statistical Modeling and Data Analysis, 393–400.Springer.
KeywordsMODEL AVERAGING, PREDICTIVE PERFORMANCE, CLASSIFICATION, EN-SEMBLE METHODS, WEAK LEARNER
University of Pavia [email protected] · University of [email protected]
97
Multiobjective Optimization Of Financing Household GoalsWith Multiple Investment Programs
Lukasz Feldman1, Radoslaw Pietrzyk2, and Pawel Rokita2
Abstract
In the article there is proposed a technique of facilitatinglife-long financial planningfor a household by finding the optimal match between unit-linked products and multi-ple financial goals of different realization terms and magnitudes. This is, moreover, amulticriteria optimization. One of the objectives is compliance between expected termstructure of cumulated net cash flow throughout the life cycle of the household with itslife-length risk aversion and bequest motive. The second isfinancial liquidity in all peri-ods under expected values of all stochastic factors. The third is minimization of net cashflow volatility. The fourth is minimization of costs of the investment plan combination.The result is a set of unit-linked investment programs with accompanying informationwhich programs are destined to cover which financial goal. Payoffs of one program maybe used to cover more than one goal and the order may be other than sequential.
ReferencesCAMPBELL, J.Y. (2006): Household finance.Journal of Finance, Vol. 61, No. 4,1553-1604.CARROLL C. (2006): The Method of Endogenous Gridpoints for Solving DynamicStochastic Optimization Problems.Economics Letters, Vol. 91, Issue 3, 312-U320.CORRIGAN J., MATTERSON W., NANDI S. (2009):A Holistic Framework for LifeCycle Financial Planning. Milliman.
KeywordsMULTIOBJECTIVE OPTIMIZATION, PERSONAL FINANCE, ASSET SELECTION,INTERTEMPORAL CHOICE
Wroclaw University of [email protected] · WroclawUniversity of [email protected] · Wroclaw Univer-sity of [email protected]
98
Power Of Skewness Tests In The Presence Of Fat TailedFinancial Distributions
Krzysztof Piontek
Abstract
The best known and mostly used test of skewness is the Jarque-Bera approach. However,this test is not reliable for discriminating between symmetric and asymmetric returndistributions in the presence of leptokurtosis that is usually observed in financial data.Testing skewness is still an open and significant issue.
The goal of this paper is to investigate the power of some skewness tests when ap-plied to fat-tailed (typical for finance) return distributions. Four approaches are brieflyreviewed and discussed in respect of testing skewness in thewhole return distribution:classical Jarque-Bera test, adjusted Jarque-Bera test (taking fater tails into considera-tion), test based on the Pearson type IV distribution and Peiro test without any assump-tion about the type of distribution.
In the empirical part, the power of each test is estimated by using Monte Carlo simu-lations. Different asymmetric and fat tailed distributions are used to data generation. Thefrequency of rejecting a null hypothesis (of symmetry of thedistribution, if it is false)is used as an approximate value of the power of test. Data series of different number ofobservations and different skewness values are simulated.
The last part summarizes results, compares values obtainedby using different testmethods and gives hints for risk managers.
ReferencesASAI, M. and DASHZEVEG, U. (2008): Distribution-Free Test for Symmetry withan Applic. to S&P Index Returns.Applied Economics Letters, 15(6),461–464.BERA, A., PREMARATNE, G. (2001): Adjusting the Tests for Skewness and Kur-tosis for Distributional Misspecifications, UIUC-CBA Research WP No. 01-0116.BRYS, G., HUBERT, M., STRUYF, A. (2003): A comparison of somenew measuresof skewness.Developments in Robust Statistics, 98–113.
KeywordsTESTS OF SYMMETRY, RETURN DISTRIBUTION, FAT TAILS
Department of Financial Investments and Risk Management, WrocławUniversity of Economics, ul. Komandorska 118/120, Wroclaw, [email protected]
99
Robust Clustering for Anti-Fraud Analysis
Andrea Cerioli1 and Domenico Perrotta2
Abstract
We address the problem of clustering the transactions that arise in international trademarkets, from the point of view of anti-fraud analysis. These observations typically fol-low a mixture of regression lines, corresponding to different market conditions. Outliersand high leverage points are also present and may provide information about anomalieslike fraudulent transactions. In order to eliminate the effect of outliers on the classifi-cation of “regular” trade, and in order to properly highlight them, robust methods areneeded (Riani et al. 2008; Garcìa-Escudero et al., 2009, 2010). However, robust cluster-ing techniques can fail when a large proportion of non-contaminated observations fallin a small region, which is another likely occurrence in international trade data sets. Insuch instances, the effect of a high-density region is so strong that it can override thebenefits of trimming and other robust devices. We propose to solve the problem by sam-pling a much smaller subset of observations which preservesthe cluster structure andretains the main outliers of the original data set. We show the advantages of our methodboth in empirical applications to international trade examples and through a simulationstudy.
ReferencesGARCÌA-ESCUDERO, L. A., GORDALIZA, A., SAN MARTÌN, R., VAN AELST,S. and ZAMAR, R. (2009): Robust linear clustering.Journal of the Royal StatisticalSociety B, 71, 301–319.GARCÌA-ESCUDERO, L. A., GORDALIZA, A., SAN MARTÌN, R. and MAYO-ISCAR, A. (2010): Robust clusterwise linear regression through trimming.Compu-tational Statistics and Data Analysis, 54, 3057–3069RIANI, M., CERIOLI A., ATKINSON A. C., PERROTTA, D. and TORTI, F. (2008):Fitting mixtures of regression lines with the forward search. In: Fogelman-Soulié Fet al. (Eds.):Mining Massive Data Sets for Security. IOS Press, Amsterdam, 271-286
KeywordsINTERNATIONAL TRADE, OUTLIERS, RLGA, TCLUST
Dipartimento di Economia, Università di Parma, [email protected] · European Commission, Joint Research Cen-tre, Ispra, [email protected]
100
An Extended Gravity Approach To Examining InternalMigrations. The Case Of Poland
Justyna Wilk1 and Michał Pietrzak2
Abstract
Internal migrations play a significant role in regional development. They determine asize and structure of human resources as well as stimulate regional labour markets etc.The subject of this paper is to formulate an approach with using of econometric grav-ity model and multivariate data analysis methods to determining dependencies betweensocio-economic aspects and migration phenomena. An attempt to apply it for the anal-ysis of internal migrations in Poland in 2004-2011 is also made.
Gravity model considers migration flows from origin to destination and explains theirconditions. Economic, household and labour market situation, innovativeness and livingconditions are examined in this paper. These potential pushand pull factors of popula-tion flows - complex in their nature - are defined with using of taxonomical syntheticmeasures. A significance, intensity and impact direction ofsocio-economic aspects andalso geographical distance on population inflows and outflows are examined. Two pe-riods of time are distinguished to identify relationships between economic cycle andintensity and conditions of domestic migrations.
ReferencesHWANG, C.L. and YOON, K. (1981):Multiple Attribute Decision Making Methodsand Applications. Springer, Berlin Heidelberg.LEE, E.S. (1966): A Theory of Migration.Demography, Vol. 3, No. 1, 47–57.LeSAGE, J.P. and PACE, R.K. (2008): Spatial Economic Modeling of Origin-Destination Flows.Journal of Regional Science, Vol. 48(5), 941–967.WHITE, M.J. and LINDSTROM, D.P. (2006): Internal Migration. In: D.L. Poston,M. Micklin (Eds.):Handbook of Population. Springer, Berlin-Heilderberg, 311–345.
KeywordsINTERNAL MIGRATION, REGIONAL DEVELOPMENT, GRAVITY MODEL,SYN-THETIC MEASURE
Wrocław University of Economics, Department of Econometrics and Computer Sci-ence, Nowowiejska 3, 58-500 Jelenia Góra, Poland,[email protected] ·Nicolaus Copernicus University in Torun, Department of Econometrics and Statistics,Gagarina 11, 87-100 Torun, Poland,[email protected]
101
Clustering of US counties based on their demographicstructures
Simona Korenjak-Cerne1, Vladimir Batagelj2, Nataša Kejžar3
Abstract
Population pyramid is a very informative graphical representation of a demographicstructure of a particular region. In the paper we will present the use of symbolic hier-archical clustering method, implemented in R-package clamix, in the study of demo-graphic structure of U.S. counties. The presented approachoffers an additional insightin the data and, as such is important especially for experts –demographers. The analysiswill be presented on a data of the latest US census from 2010, where also time changesbetween demographic structures from 2000 to 2010 will be observed. Another analy-sis considering also the distributions by ethnicity will bedone and compared with theresults of age-sex only analysis.
ReferencesBATAGELJ, V. (1988): Generalized Ward and Related Clustering Problems. In: H.H.Bock (Ed.):Classification and Related Methods of Data Analysis, North-Holland,Amsterdam, 67–74.BATAGELJ, V. and KEJŽAR, N. (2010): clamix - Clustering Symbolic Objects. Pro-gram in R, Available from: https://r-forge.r-project.org/projects/clamix/BILLARD, L. and DIDAY, E. (2006):Symbolic Data Analysis. Conceptual statisticsand data mining.Wiley, New York.KORENJAK-CERNE, S. and BATAGELJ, V. (2002): Symbolic Clustering of LargeDatasets. In: K. Jajuga, A. Sokołowski and H.H. Bock (Eds.):Classification, Clus-tering, and Data Analysis. Springer, Berlin, 319–327.U.S. Census Bureau, Census 2000 and Census 2010http://www.census.gov/population/age/data/decennial.html
KeywordsSYMBOLIC DATA ANALYSIS, HIERARCHICAL CLUSTERING, POPULATIONPYRAMID, SYMBOLIC OBJECT, R-PACKAGE CLAMIX
University of Ljubljana, Faculty of Economics, [email protected] · University of Ljubljana, Faculty of Mathematicsand Physics, [email protected] · University ofLjubljana, Faculty of Medicine, [email protected]
102
Strategic, Motivational And Emotional Aspects Of UniversityStudy. A Latent Class Approach
Anna Giraldo1, Silvia Meggiolaro2, and Elisa Visentin3
Abstract
University outcomes are strictly related to students’ attitudes, motivations and emotionstowards university study. These factors, as well as personal and households character-istics of the students, deeply influence study path. In this work we use a latent classapproach (McMoutcheon, 1987) to find the underlying latent factors that summarize aseries of items investigating students’ position as regards four domains: strategic skills,emotions, motivations, and resilience. Data come from a CAWI survey conducted in2012 on a cohort of students enrolled in academic year 2006/07 at Padova University(Clerici et al., 2012). Results show that the underlying latent factors are in line with psy-chological literature and they can be used in regression models as explicative variables,along with personal and households’ characteristics of thestudents, to explain more indepth students’ university outcomes.
ReferencesCLERICI, R., DA RE, L., GIRALDO, A., MEGA, C., VISENTIN E. (2012) As-petti strategici, motivazionali ed emotivi e successo accademico. Progettazione econduzione di un’indagine sugli studenti dell’Universitàdi Padova,Technical ReportSeries, 1, Department of Statistical Sciences, Universityof Padova.McCOUTCHEON, A.L. (1987)Latent Class Analysis. Sage, Newbury Park.
KeywordsLATENT CLASS FACTOR ANALYSIS, UNIVERSITY STUDY
Department of Statistical Science, via C. Battisti 241, [email protected] · Department of Statistical Science, via C. Bat-tisti 241, [email protected] · Department of Philosophy,Sociology, Education and Applied Psychology, Via Beato Pellegrino 8, [email protected]
103
The Comparative Log–Linear Analysis Of Unemployment InPoland In 2004–2011
Justyna Brzezinska
Abstract
In categorical data analysis we can analyze categorical variables simultaneously inmulti–way tables. Such tables present special problem of analysis and interpretation,which is usually connected with the number of variables. This paper presents the useof log–linear models which allow to analyze the independence and the path of associa-tion between any number of categorical variables. Different types of independence canbe analyzed: conditional independence, homogeneous association or conditional inde-pendence. There are several criteria for testing the goodness–of–fit of the model: thechi-square statistic, the likelihood ratio, information criteria (AIC, BIC).
With the rising unemployment rate in recent years, unemployment is one of the mostimportant economic and social problem in Poland. A strong differentiation is observedin the unemployment rates for various regions of Poland, especially for young and uni-versity graduates, as well as for males as females. The log–linear analysis will be pre-sented on the example from the Central Statistical Office of Poland. The comparativelog–linear analysis will be conducted for multi–way tableson unemployment in 2004–2011. All calculations will be conducted inR with the use ofloglm function inMASSlibrary.
ReferencesCHRISTENSEN, R.(1997): Log-linear Models and Logistic Regression. Springer–Verlag, New York.KNOKE D., BURKE P.(1997): Log-linear Models. Sage University Paper Series onQuantitative Applications in the Social Science, series no. 07-020, Beverly Hills andLondon Sage.
KeywordsLOG–LINEAR MODELS, MULTI-WAY CONTINGENCY TABLES, COMPARA-TIVE LOG–LINEAR ANALYSIS, UNEMPLOYMENT IN POLAND
Faculty of Management, University of Economics in Katowice, 1 Maja 50, 40–287 Ka-towice, [email protected]
104
Measurement of Quality in Cluster Analysis
Christian Hennig1
Abstract
There is much work on benchmarking is supervised classification, where “quality” cangenerally be measured as a function of misclassification probabilities. In unsupervisedclassification (cluster analysis), the measurement of quality is much more problematic,because in reality there is no true class label which can be used for cross-validationand the like. Furthermore, there is no guarantee that in situations where there is a trueclassification (for example, where benchmark data sets fromsupervised classificationare used to assess clustering methods, or where data is simulated from a mixture dis-tribution), this classification is unique. There can be a number of different reasonableclusterings of the same data, depending on the research aim.
I will discuss the use of statistics for the assessment of clustering quality that canbe computed from classified data without making reference to“the true clusters”. Suchstatistics have traditionally been called “cluster validation indexes” (such as the aver-age silhouette width), and sometimes been used for estimating the number of clusters.Most of the traditional statistics try to balance various aspects of a clustering againsteach other (such as within-cluster homogeneity and between-cluster separation), but inorder to characterize what advantages and disadvantages a clustering has, it is useful toformalize different aspects of cluster quality separately. This can also be used to explainmisclassification rates in cases where “true” clusterings exist as function of the featuresof these clusterings.
KeywordsBENCHMARKING, CLUSTER VALIDITY, MISCLASSIFICATION RATE,HOMO-GENEITY, SEPARATION, STABILITY
Department of Statistical Science, University College London, [email protected]
105
Resampling Methods for Exploring Cluster Stability
Friedrich Leisch
Abstract
Model diagnostics for cluster analysis is still a developing field because of its ex-ploratory nature. Numerous indices have been proposed in the literature to evaluategoodness-of-fit, but no clear winner that works in all situations has been found yet.Derivation of (asymptotic) distribution properties is notpossible in most cases. Over thelast decade several resampling schemes which cluster repeatedly on bootstrap samplesor random splits of the data and compare the resulting partitions have been proposed inthe literature. These resampling schemes provide an elegant framework to computation-ally derive the distribution of interesting quantities describing the quality of a partition.Due to the increasing availability of parallel processing even on standard laptops anddesktops these simulation-based approaches can now be usedin everyday cluster anal-ysis applications. We give an overview over existing methods, show how they can berepresented in a unifying framework including an implementation in R package flex-clust, and compare them on simulated and real-world data. Special emphasis will begiven to stability of a partition, i.e., given a new sample from the same population, howlikely is it to obtain a similar clustering?
KeywordsCLUSTER ANALYSIS, RESAMPLING METHODS, BOOTSTRAP, R
Institute for Applied Statistics and Computing, University of Natural Re-sources and Life Sciences, Vienna, Peter-Jordan-Strasse 82, 1190 Vienna, [email protected]
106
The Effect Of Data Generation On Our Understanding OfClustering Algorithms
Doug Steinley1
Abstract
Often, benchmarking in clustering and classification is conducted by comparing andcontrasting various algorithms and procedures on data setswith known structure viasimulation. These comparisons take place at both a broad level (e.g., a full experimentaldesign) and a narrow level (e.g., a couple of generated examples). Regardless of the ap-proach, it is found that the evaluation of the performance ofmethods is closely linked tothe nature of the generation. Results are provided that quantify the “robustness” of var-ious performance critiques based on how stable the assessment of clustering algorithmsacross generation schemes.
KeywordsCLUSTERING ALGORITHMS, BENCHMARKING
University of Missouri, Columbia
107
CLustering Constrained Symbolic Objects Constrained ByRules
Marc Csernel1
Abstract
To obtain a standardized layout when printing the abstract volume, Clustering is one ofthe most common operation in data analysis while constrained is not so common. Wepresent here a clustering method in the framework of Symbolic Data Analysis (S.D.A)which allows to cluster Symbolic Data. Such data can be constrained relations betweenthe variables, expressed by rules which express the domain knowledge. But such rulescan induce a combinatorial increase of the computation timeaccording to their number.We will present a way to cluster such data in a quadratic time.This method is based firston the decomposition of the data according to the rules called Normal Symbolic Form,then we apply to the data a clustering algorithm based on dissimilarities.
ReferencesBOCK, H.-H. and E. DIDAY (2000).Analysis of Symbolic Data: Explanatory Meth-ods for Extracting Statistical Information from Complex Data. Heidelberg: Springer.CSERNEL, M. and F. A. T. de CARVALHO (1999). Usual operationswith sym-bolic data under normal symbolic form.Applied Stochastic Models in Business andIndustry 15(4), 241–257.GORDON, A. D. (1999).Classification. Boca Raton, Florida: Chapman andHall/CRC.LECHEVALIER, Y. (1974). Optimisation de quelques criteres en classificationautomatique et application a l’etude des modifications des proteines seriques enpathologie clinique.Ph. D. thesis, Universite Paris-VI.
KeywordsCLUSTERING, RULES, SYMBOLIC OBJECTS, NORMAL FORM
Inria-Rocqencourt, BP-105-78180 Le Chesnay, [email protected]
108
Conceptual Clustering with Interval Representation
Paula Brito1 and Géraldine Polaillon2
Abstract
In this work, we propose a hierarchical conceptual clustering method, where eachformed cluster corresponds to a concept, i.e., a pair (extent, intent), based on the princi-ples of the methods in (Brito, 1995). The method allows considering simultanously datapresenting real or interval-valued numerical values, categorical ordered values and/orprobability/frequency distributions on a set of categories. Concepts are obtained by aGalois connection with generalisation by intervals, whichallows dealing with differentvariable types on a common framework (see Brito and Polaillon, 2011). In the case ofdistributional data, the obtained concepts are more homogeneous and more easily inter-pretable than those obtained by using the maximum and minimum operators previouslyproposed (Brito and Polaillon, 2005). A measure of generality of a concept is definedsimilarly for all these variable types, which is a weighted mean of variable-wise values.An example illustrates the proposed method.
ReferencesBRITO, P. (1995). Symbolic Objects : Order Structure and Pyramidal Clustering.Annals of Operations Research, 55, 277–297.BRITO, P., and POLAILLON, G. (2005). Structuring Probabilistic Data by GaloisLattices.Mathématiques et Sciences Humaines - Mathematics and Social Sciences,(43ème année) 169, (1), 77–104.BRITO, P. and POLAILLON, G. (2011). Homogeneity and Stability in ConceptualAnalysis. In: A. Napoli and V. Vychodil. (Eds.):Proc. of the 8th International Con-ference on Concept Lattices and Their Applications. INRIA, Nancy, France, 251–263.
KeywordsCONCEPTUAL CLUSTERING, INTERVAL DATA, DISTRIBUTIONAL DATA, SYM-BOLIC DATA
Faculdade de Economia & LIAAD-INESC Porto LA, Universidadedo Porto, Portu-gal [email protected] · SUPELEC Science des Systèmes (E3S) - DépartementInformatique, [email protected]
109
Hierarchical Symbolic Cluster Analysis with QuantileFunction Representation
Yusuke Matsui1, Hiroyuki Minami2, and Masahiro Mizuta2
Abstract
In symbolic data analysis, we can use various types of variables, e.g., interval val-ued variable, categorical multi valued variable, distribution valued variable. Britoetal. (2010) offered that quantile (or quartile) was a powerful tool for those variables.
In this paper, we focus on its extensive representation,quantile functionand proposehierarchical symbolic cluster analysis.
We assume each object is represented by ap dimensional distribution. We derivedis-tribution valued dissimilaritybetween distributions. We exploit quantile function fordistribution valued dissimilarity and develop a clustering method with the function,based on Mizuta (2011). We also demonstrate it with some example data.
ReferencesBRITO, P. and ICHINO, M. (2010): Symbolic clustering based on quantile represen-tation.Proceedings of COMPSTAT2010, Paris, France.MIZUTA, M. (2011): Hierarchical clustering for distribution valued dissimilaritydata.Proceedings of Joint Conference of GfKl, DAGM and IFCS.
KeywordsSYMBOLIC DATA ANALYSIS, DATA MINING, DISTRIBUTION VALUED D IS-SIMILARITY
Graduate School of Information Science and Technology, Hokkaido University, [email protected] · Information Initiative Center, Hokkaido Univer-sity, [email protected], [email protected]
110
Multilevel Consumer Preference Model on Symbolic Data
Adam Sagan1, Marcin Pełka2, and Aneta Rybicka2
Abstract
Multilevel data arises from a hierarchical and contextual data structure that comes fromheterogenuous populations and the complex sampling. This type of data is very popularin international, educational as well as marketing research (pupils nested in schools,individuals nested in households etc.). The multilevel modeling involves usually theclassical types of data and are based on the decomposition ofcovariance matrix into a“within” and “between” submatrices.
The main aim of the paper is to propose a new way of analyzing structured data(multilevel-like data) and use symbolic data in multilevelmodeling of the family mem-bers’ consumer preferences. Symbolic data analysis allowsto represent and model twotype of objects – single individuals (first-level objects),and aggregate objects (super-individuals, second-level objects). This allows to analyze not only dependencies, clus-ters, etc. at individual level of data but it allows also to analyze the dependencies ataggregate level. Moreover, symbolic data analysis allows to represent data in more de-tailed way and to keep all the information from individual level at aggregate levels.
ReferencesBOCK, H.-H., DIDAY, E. (Eds.) (2000):Analysis of symbolic data. Explanatorymethods for extracting statistical information from complex data. Springer Verlag,Berlin-Heidelberg.NAKANO, J., (2012): Regression Analysis for Aggregated Symbolic Data. In: J. Ar-royo, C. Maté, P. Brito and M. Noirhomme-Fraiture (Eds.)3 rd Workshop in SymbolicData Analysis, 33STEENBERGEN M.R., JONES B.S. (2002): Modeling Multilevel Data Structures.American Journal of Political Science, Vol. 46, No. 1, pp. 218–237.
KeywordsPREFERENCES, SYMBOLIC DATA, MULTILEVEL DATA ANALYSIS
Cracow University of Economics, Department of Market Analysis and Mar-keting Research, Cracow, Poland,[email protected] · Wroclaw Uni-versity of Economics, Department of Econometrics and Computer Science,Nowowiejska 3, 58-500 Jelenia Góra, Poland,[email protected],[email protected]
111
The Variance of the Adjusted Rand Index (and otherproperties)
Doug Steinley1
Abstract
The variance of the adjusted Rand index (Hubert & Arabie, 1985) is provided and itsproperties are explored. The variance is then used to highlight the differences betweentwo formulations of the expected value of the Rand index (Hubert & Arabie, 1985;Morey & Agresti, 1984), showing that the latter is asymptotically under-biased and itsassociated variance is consistently underestimated.
KeywordsCLUSTERING ALGORITHMS, BENCHMARKING
University of Missouri, Columbia
112
Identifying Clusters Bayesian Disease Mapping
Nema Dean1, Craig Anderson1, and Duncan Lee1
Abstract
In spatial modelling, it is often the case that, instead of individual point data, only ag-gregate data is available for each of a set of sub-areas for a given period. In diseasemodelling, the most common type of data available is a count of disease cases for par-ticular subdivisions of the area of interest for a year. Thisresults in population levelcount data rather than individual level binary outcomes. This type of data is known asareal data. In addition to the counts for each sub-area, neighbourhood information aboutwhich areas border each other is also available. One common assumption about arealdata is that there is a global level of correlation across bordering areas and that the dis-ease risk surface varies smoothly. In practice this is oftennot the case, with rich areasneighbouring poor areas with drastically different disease risks. This talk will discussan adaptation of hierarchical clustering to enforce spatial contiguity when clustering logstandardised incidence ratios (the ratio of observed to expected counts) in areal data. Thecandidate clusterings produced by the adapted hierarchical clustering will be modelledwith a piecewise constant (across clusters) conditional autoregressive (CAR) hierarchi-cal poisson log-linear model. The best clustering model is selected using the DevianceInformation Criterion. Results of the proposed approach onsimulations and real datawill be presented and discussed.
KeywordsAREAL DATA, CLUSTERING, SPATIAL MODELLING
University of Glasgow, 15 University Gardens, Glasgow G12 8QQ,[email protected]
113
Classification Boundary Mapping
Yuning He1 and Herbert Lee2
Abstract
In some problems, such as a computer simulation experiment,it may be of interest tomap the boundary between two classes. Having a physical understanding of the clas-sification boundary can lead to insights about the underlying problem. Our motivatingexample of a flight controller simulator leads us to the use ofa shape library for param-eterizing the boundary, which lets us better understand when the controller will be ableto stabilize an aircraft and when it could lead to catastrophic failure.
Modeling of classification is done via tree models. Taking a sequential design ap-proach, the tree models can be updated via particle learning. The shape library is usedto best model the classification boundary, with a shape set chosen that provides the bestsummarization, completeness, and minimality.
KeywordsCOMPUTER MODEL, CLASSIFICATION TREE, ACTIVE LEARNING, MODELSELECTION
National Aeronautics and Space Administration, Ames Research Center, Moffett Field,CA, USA [email protected] · University of California, Santa Cruz, CA, [email protected]
114
Deduplicating Text Records by Clustering the Results ofAggregated Conditional Classifiers
Rebecca Nugent1 and Samuel L. Ventura2
Abstract
Deduplication, or the process of linking records corresponding to unique entities withina single database, is an atypical record linkage problem. Traditional record linkagemethods (e.g. Fellegi and Sunter, 1969) assume a one-to-onematching across twodatabases and thus cannot be trivially applied to deduplication, where each unique en-tity may be duplicated any number of times. Recent alternatives extend the Fellegi-Sunter approach to work with three or more databases, but these approaches are notcomputationally feasible for the deduplication of large databases. We explore the use ofclustering approaches to identify (typically singleton orvery small) clusters of recordscorresponding to unique entities. We calculate pairwise distances between records usinga novel classification technique that conditions on informative features of record-pairs.We apply our methodology to the identification of unique inventors in the United StatesPatent and Trademark Office patent-inventor database and demonstrate its efficacy overalternative, more heuristic approaches.
ReferencesVENTURA, NUGENT, FUCHS: Methods Matter: Rethinking Inventor Disambigua-tion Algorithms with Classification Models and Labeled Records .submitted to Man-agement Science, March 2013.
KeywordsRECORD LINKAGE, DEDUPLICATION, AGGREGATION, RANDOM FORESTS,CLUSTERING
Department of Statistics, Carnegie Mellon University, Pittsburgh, [email protected] · Department of Statistics, Carnegie Mellon University,Pittsburgh, [email protected]
115
Classifications of Baseball Pitching Strategies and ExploringEffects of the New Official Balls in the Japanese ProfessionalBaseball League
Kazunori Yamaguchi1
Abstract
The baseball is one of the favorite sports in US and Japan. Pitching is one of the mostimportant parts of this game. Many researches have been donefor baseball statistics forpitching or team offence strategies using MLB data in US (e.g. see Tangoet al. 2007,Albert and Bennet 2003, Thorn and Palmer 1985).
The NPB (Nippn Professional Baseball) organization has changed the official balls in2011. They said that new balls were similar to balls used in the World Baseball Classicand that they were lower resilient than balls used before 2011.
We recognized pitchers have advantages after deriving new balls, but some of pitchersresults in 2011 were not better than those in 2010. We classify all pitchers by pitchingdata in 2010. Here we use the numbers of games, pitches, an average speed of fast balls.maximum speed, variety of pitches, courses of pitches, and so on as the pitching data.After classifications, we explore groups that pitchers got much better results in 2011than in 2010, or groups that pitchers got worse results in 2011 than in 2010, in order toexplore the effects of new balls on pitching strategies.
All data sets for this research are provided by Data Stadium Inc.
ReferencesALBERT, J. and BENNETT, J.(2003)Curve Ball: Baseball, Statistics, and the Roleof Chance in the GameSpringer.TANGO, T., LICHTMAN, M. and DOLPHIN, A.(2007):The Book: Playing the Per-centages in Baseball. Potomac Books.THORN, J. and PALMER, P.(1985):The Hidden Game of BaseballDoubleday, NewYork.
KeywordsBASEBALL, CLUSTER ANALYSIS, PITCHING STRATEGIES
College of Business, Rikkyo University, Tokyo 171-8501 [email protected]
116
Life Long Learning Idea on Background of Poles’ Needs
Marta Dziechciarz-Duda1 and Klaudia Przybysz2
Abstract
The rate of economic changes and the aging of the population made it necessary to givethe importance of lifelong learning a priority (see for example the Lisbon Strategy). Theconducted study concerned demand on the courses, training and certifications in Poland.
This article aims to analyze the educational needs reportedby respondents who arein production age. Our study contains the classification based on sex, education level,type of occupation in relation to whether they declare such needs. The research also in-cluded the type of courses undertaken and their assessment of the usefulness of furtherprofessional life. The proposed approach to this issue may be a substructure of a multi-dimensional analysis of the situation on the labor market and help to identify the factorsdetermining the attractiveness of potential employees in point of view of employers’needs.
ReferencesGATNAR, E. and WALESIAK, M. (2004):Metody Statystycznej Analizy Wielowymi-arowej w Badaniach Marketingowych. Wydawnictwo Akademii Ekonomicznej im.Oskara Langego we Wrocławiu, Wrocław.GROSSMAN M. (2005): Education and Nonmarket Outcomes.NBER Working Pa-per Series, Working Paper 11582. http://www.nber.org/papers/w11582.PSACHAROPOULOS, G. and PATRINOS, H. (2004): Returns to Investment inHigher Education. A Further Update.Education Economics, 12(2), 111–134.
KeywordsLIFE LONG LEARNING, MULTIDIMENSIONAL ANALYSIS, LABOR MARK ET
Wrocław University of [email protected] ·Wrocław University of [email protected]
117
Migration Of Population - The Analysis With The Use OfLog-Linear Models
Justyna Brzezinska
Abstract
Log-linear analysis is a widely used tool for the independence analysis of qualitativedata in multi–way contingency table. Cell counts are Poisson distributed and all vari-ables are treated as response. Log-linear models, where interaction terms are included,enable to examine various types of association (conditional independence, partial as-sociation, complete independence, homogenous association). In log-linear analysis wemodel cell counts in terms of associations among variables and marginal frequencies.For testing the goodness of fit the likelihood ratio test and information criteria AIC[Akaike 1973] and BIC [Raftery 1986] are used. The advantages of this method is thatwe can use several plots for visualizing contingency table,we can analyze any numberof categorical variables and we include interactions in themodel equation. The use oflog-linear analysis will be presented on the data on migration of population in Polandin 2011 reported by the Central Statistical Office. All calculations will be conducted inR with the use ofloglm function inMASS library.
ReferencesCHRISTENSEN, R.(1997): Log-linear Models and Logistic Regression. Springer–Verlag, New York.KNOKE D., BURKE P.(1997): Log-linear Models. Sage University Paper Series onQuantitative Applications in the Social Science, series no. 07-020, Beverly Hills andLondon Sage.
KeywordsLOG–LINEAR MODELS, MULTI-WAY CONTINGENCY TABLES, MIGRATIONOF POPULATION IN POLAND
Faculty of Management, University of Economics in Katowice, 1 Maja 50, 40–287 Ka-towice, [email protected]
118
The Influence of Emotion Recognition and AcademicPerformance on Group Popularity
Ivan Loredana
Abstract
This study analyzed the influence of academic grades and emotion recognition on theway social relations are structured within a relatively large group of college students (N= 154). Using DANVA-2 to assess individual differences in emotion accuracy and peers’nominations procedures, we investigated the relative contribution of positive and neg-ative emotions to three popularity dimensions: visibility, social interaction, and friendnominations. Compared with studies on children and adolescents, grades had a marginaleffect on popularity only when friendship ties are factoredin/considered. Furthermore,the accuracy in decoding facial expression of emotions was negatively correlated withthe number of friendship nominations, particularly for sadness. In the case of happinesswe found a positive relation between the accuracy of decoding using body items andstudents’ level of interaction. The results are discussed in the light of the functionalistemotion theories.
ReferencesBAUMEISTER, R.F. AND LEARY, M.R. (1995): The need to belong:Desire for in-terpersonal attachments as a fundamental human motivationPsychological Bulletin,117(3), 497-529.BOYATZIS, C. J. AND SATYAPRASAD, C. (1994): Children’s facial and gestu-ral decoding and encoding: Relations between skills and with popularity.Journal ofNonverbal Behavior, 18(1), 37-58.DE BRUYN AND VAN DEN BOOM (2005): Interpersonal behavior, peer popular-ity, and self-esteem in the early adolescence.Social Development, 14 (4), 555-573.
KeywordsEMOTION DECODING, GROUP POPULARITY; ACADEMIC PERFORMANCE
National School of Political and Administrative Studies, Povernei 6, Bucharest [email protected]
119
Hierarchical Classes Analysis vs. Formal Concept Analysis
Bernhard Ganter and Cynthia V. Glodeanu
Abstract
Hierarchical Classes Analysis(HCA) is a discrete, categorical data analysis methoddeveloped for applications in personality organisation and implicit belief systems. Thetechnique as well as its generalisations for three-way, numerical and ordinal data weresuccessfully applied in different clinical studies.
Formal Concept Analysis(FCA) is an instrument for data analysis based on latticetheory. Amongst other things FCA represents the whole information contained in a dataset by means of so-called formal concepts. These are understood as units with a con-ceptual extent and a conceptual intent. The extent containsall the objects shared by theattributes from its intent. The dual holds for the intent. Recently, an approach,Booleanfactorisations, using formal concepts was discussed that produces the smallest possiblenumber of factors in a sense similar to Factor Analysis.
We show that HCA and Boolean factorisations coincide for binary and three-waydata. Moreover, we discuss how this connection allows the two methods to benefit fromeach other. New doors for the application of Boolean factorisations are opened by HCA.The latter gains structural explanations, graphical representations and algorithmic is-sues. Further, we propose the modelling of fuzzy, i.e., vague data, within the frameworkof HCA.
References
J. SCHEPERS AND I. Van Mechelen (2010): Uniqueness of real-valued hierarchicalclasses models.Journal of Mathematical Psychology, 54, 215–221.B. GANTER AND R. WILLE (1996): Formale Begriffsanalyse: MathematischeGrundlagen. Springer, Berlin, Heidelberg.R. Belohlávek AND V. VYCHODIL (2010): Discovery of optimal factors in binarydata via a novel method of matrix decomposition.Journal of Computer and SystemSciences, 76, 3–20.
KeywordsFACTOR ANALYSIS, NON-METRIC ANALYSIS, DATA REDUCTION
Institute of Algebra, TU Dresden, 01062 Dresden, Germany{Bernhard.Ganter,Cynthia.Glodeanu}@tu-dresden.de
120
The Diversity of Pattern Structures in Formal ConceptAnalysis
Aleksey Buzmakov1, Sergei O. Kuznetsov2, and Amedeo Napoli3
Abstract
Pattern structures [3] provide an extension of Formal Concept Analysis (FCA [1]) fordealing with complex data. They are based on a triple(G,(D,⊓),δ ), whereG is a set ofobjects,(D,⊓) is a semi-lattice of descriptions, andδ is a mapping associating an objectwith a description. The similarity operation⊓ induces a subsumption relation in(D,⊓)such asc⊓d = c iff c⊑ d.
In this presentation, we would like to discuss the diversityand the capabilities of pat-tern structures in various applications. Pattern structures are used under many forms,e.g. numbers and intervals [3], graphs [3], strings and sequences [1], and ontology el-ements [2]. Moreover, the so-called projections are mathematical functions respectingsome properties and reducing the computational costs and the volume of resulting pat-terns. Accordingly, the pattern concept lattice can the be navigated and more easilyinterpreted by domain experts.
References1. A. BUZMAKOV, E. EGHO, S.O. KUZNETSOV, A. NAPOLI, AND C. RAÏSSI.
String Pattern Structures in FCA – An Application to Sequential Data Analysis, 2013.(Submitted.).
2. A. COULET, F. DOMENACH, M. KAYTOUE, AND A. NAPOLI. Using patternstructures for analyzing ontology-based annotations. InProceedings of ICFCA 2013,Springer LNCS, 2013.
3. B. GANTER AND S.O. KUZNETSOV. Pattern structures and their projections. InProceedings of ICCS, LNCS 2120, pages 129–142, 2001.
4. B. GANTER AND R. WILLE. Formal Concept Analysis. Springer, 1999.5. M. KAYTOUE, S.O. KUZNETSOV, AND A. NAPOLI. Revisiting Numerical Pat-
tern Mining with Formal Concept Analysis. InProceedings of IJCAI, pages 1342–1347, 2011.
KeywordsFORMAL CONCEPT ANALYSIS, PATTERN STRUCTURES, PROJECTION,CLAS-SICATION
LORIA (CNRS – Inria Nancy – U. de Lorraine)[email protected] ·HSE [email protected] · LORIA (CNRS – Inria Nancy – U. deLorraine)[email protected]
121
Decision Aiding Software And Consensus Theory
Florent Domenach1 and Ali Tayari
Abstract
There is variety of approaches, solutions, and methods on how to construct and derive aconsensus from a selection of phylogenetic trees, whether you consider the case wheretrees share the same set of taxa or when you have super-trees methods. Despite thenumber of existing consensus functions, practitioners often use a selected few - eitherbecause they are not aware of other existing functions, or not knowing which one(s)would be suitable. In order to tackle this problem, DASACT (Decision Aiding Softwarefor Axiomatic Consensus Theory) has been developed in orderto guide users in hischoice depending on a series of axiomatic properties.
DASACT is based on a previously written paper (Domenach and Tayari 2013) whichuses an exhaustive approach in order to examine the structural relationship (a conceptlattice) among a series of axiomatic properties and consensus functions. This lattice isused to determine relevance (determined using variety of distance functions) of consen-sus functions in respect to desired constraints (axiomaticproperties) set by the user. Itthen provides the consensus trees for users to compare and choose most appropriate.
ReferencesDAY, W.H.E. and MCMORRIS, F.R. (2003):Axiomatic Consensus Theory in GroupChoice and Biomathematics. Siam, Philadelphia.DOMENACH, F., and TAYARI, A., (2013): Implications of Axiomatic ConsensusProperties. In: Lausen, B., van den Poel, D., and A. Ultsch (Eds.):Algorithms from& for Nature and Life, Studies in Classification, Data Analysis, and Knowledge Or-ganization, Springer-Verlag GmbH, Heidelberg (to appear).GANTER, B. and WILLE, R. (1999):Formal Concept Analysis : MathematicalFoundations. Springer.
KeywordsCONSENSUS GENERATION, CONSENSUS THEORY, CONCEPT LATTICE,PHY-LOGENETIC TREE
Computer Science Department, University of Nicosia, 46 Makedonitissas Ave., PO Box24005, 1700 Nicosia, [email protected]
122
Experimental Comparison of Some Triclustering Algorithms
Dmitry V. Gnatyshak, Dmitry I. Ignatov, and Sergei O. Kuznetsov
Abstract
In this talk we show the results of the experimental comparison of five triclustering algo-rithms on real-world and synthetic data by resource efficiency and 4 quality measures.We also provide the results’ interpretation for analyses ofreal-world datasets.
The talk is organised as follows. In part 1 we give main definitions and describe thetriclustering methods selected for comparison. Part 2 describes all the experiments andtheir results along with specially introduced quality measures. Part 3 concludes the talkand indicates some further research direction.
ReferencesIGNATOV, D.I., KUZNETSOV S.O., MAGIZOV R.A., and ZHUKOV, L.E. (2011):From Triconcepts to Triclusters. In:13-th International Conference on Rough Sets,Fuzzy Sets, Data Mining and Granular Computing (RSFDGrC-2011). 257–264.JASCHKE R., HOTHO A., SCHMITZ C., GANTER B., and STUMME G. (2006):TRIAS - An Algorithm for Mining Iceberg Tri-Lattices. In:ICDM. 907–911.MIRKIN B. and KRAMARENKO A. (2011): Approximate Bicluster and TriclusterBoxes in the Analysis of Binary Data. In:13-th International Conference on RoughSets, Fuzzy Sets, Data Mining and Granular Computing (RSFDGrC-2011). 248–256.S. KROLAK-SCHWERDT, P. ORLIK, and B. GANTER (1994): TRIPAT:a modelfor analyzing three-mode binary data.Studies in Classification, Data Analysis,and Knowledge Organization, volume 4 of Information systems and data analysis.Springer, Berlin.
KeywordsFORMAL CONCEPT ANALYSIS, TRICLUSTERING, TRIADIC DATA, DATA MIN-ING
National Research University Higher School of [email protected]
123
A Framework For Modeling Covariances
Age K. Smilde, M.E. Timmerman, H.C.J. Hoefsloot, J.J. Jansen, and E. Saccenti
Abstract
In modern functional genomics it is more rule than exceptionthat multiple data tables(groups) are collected in a study pertaining to the same organism. In such cases it isworthwhile to analyze all data tables simultaneously to have a global view of the bi-ological system. This is the area of “data fusion”, which is alively research topic inbioinformatics. Most methods of analyzing such complex data focus on group means,treatment effects or time courses. However, considerable information may be presentin the covariances within a group, since this relates directly to individual differencesand heterogeneity of responses of the biological system to aperturbation. Hence, themethodology to study such covariances - and their changes upon treatment or in time -deserve attention in computational biology.
We will present a framework for modeling such covariances encompassing severalalready existing methods. Moreover, we will present a new method coined Combina-tion Simultaneous Component Analysis (COMSCA) which also fits in this framework.COMSCA aims to model differences in covariance matrices in terms of a few low rankprototypical component matrices The method is illustratedwith real-life examples fromtime-resolved metabolomics data.
KeywordsINDSCAL, COMSCA, IDIOSCAL, Common Principal Component, Covariances
Age K. Smilde· M.E. TimmermanHeymans Institute, University of Groningen, The Netherlands
H.C.J. Hoefsloot· E. SaccentiBiosystems Data Analysis, University of Amsterdam, The Netherlands
J.J. JansenInstitute for Molecules and Materials, Radboud UniversityNijmegen, The Netherlands
124
Biadditive Models, Alternative Estimation Procedures AndBetter Biplots
Fred A. van Eeuwijk1, Gerrit Gort1, Sabine K. Schnabel1, and Paul H.C. Eilers1,2
Abstract
Biadditive models are a useful model class for investigating interactions in two-waytables. An area where biadditive models are popular is plantbreeding and genetics,where sets of genotypes are evaluated across a range of environmental conditions, withthe results being summarized in two-way tables of genotype by environment (GxE)means. For GxE tables, various biadditive models have been proposed, like the Finlay-Wilkinson model (Yi j = µ +Gi +βiE j +εi j ), the additive main effects and multiplicativeinteractions model (Yi j = µ +Gi +E j +∑k γkiδk jεi j ), and the GGE or PCA model (Yi j =µ +E j +∑k γkiδk jεi j ).
For the estimation of parameters in biadditive models, least squares procedures area common choice. However, inference in a least squares framework offers limited pos-sibilities. We investigate Bayesian and penalized regression methods and discuss theirpossibilities.
For the interpretation of bilinear model fits, biplots, in which genotypes and environ-ments are assigned coordinates on the basis of their bilinear parameters, are an importanttool. Surprisingly, biplots often lack clarity and attractiveness. We propose a number ofcosmetic improvements.
Genotypes in the centre of biplots are less interesting, in contrast to those furtheraway. The convex hull has been used to identify the most extreme genotypes. So-calledalpha-bags are a generalization; they aim at hull that contains a chosen percentage ofthe genotypes. They are hard to compute and visually not veryattractive. As a quick andpleasing alternative we present expectile hulls, based on asymmetric least squares.
The convex hull is useful for the identification of groups of environments, mega-environments, which elicit comparable adaptations in genotypes. We extend this idea toexpectile hulls.
KeywordsSVD, EXPECTILES, ASYMMETRIC LEAST SQUARES
Biometris, Wageningen University and Research Centre, Wageningen, The Nether-lands fred.vaneeuwijk|gerrit.gort|[email protected] ·Erasmus University Medical Center, Rotterdam, The [email protected]
125
Triadditive Models for Three-way Tables
John C. Gower1, Casper J. Albers2, and Steffen Unkel1
Abstract
In this presentation we are concerned with three-way tables. Essentially, our approachis to adopt the usual linear models for representing main effects, two factor interactionsand three factor interactions. Just as two factor interactions may be approximated bymultiplicative bilinear terms, three factor interactionsmay be approximated by multi-plicative trilinear terms. In the bilinear case the approximations have standard least-square estimates based on singular value decompositions, but in the trilinear case, wepropose that the estimates be conditioned on the residuals from the bilinear model. Inprincipal, it would be possible to do a full unconditional least-squares solution but theconditional approach is easier and avoids difficulties withconstraints. In the bilinearcase identification constraints are not substantive but in the full trilinear case there isa troubling substantive interaction between the bilinear and trilinear parameter con-straints. This problem is avoided when using the conditional method of analysis andthe CANDECOMP algorithm may be applied directly to the conditioned residuals.
A special virtue of bilinear models is the way that they lend themselves to simplebiplots for visualing the interactions between rows and columns of the two classifyingfactors. This is particularly useful when bilinear interactions are adequately approxi-mated in two dimensions. It would be helpful if similar visualisations were available fortriaddittive interactions. We have made some progress in deriving triplots for rank-twotridimensional interaction tables. For each factor, this gives points in two-dimensionsdisplayed on three orthogonal surfaces. Each of the three faces may be shown sepa-rately and attempts can be made to show the full three dimensional visualisation.
We give some preliminary results for the ranks of trilinear interaction tables. Theseare special tables, as all their main and two-way margins arenull. However, it is not clearto us that, apart from mathematical interest, trilinear rank has any particular use from thepoint of view of data analysis. As with bilinear approximation, degree of approximationis more important than rank per se.
KeywordsINTERPRETATING INTERACTION, VISUALISING INTERACTION
Department of Mathematics and Statistics, The Open Univer-sity, Walton Hall, Milton Keynes, MK7 6AA, United [email protected]/[email protected] · Heymans Institute forPsychological Research, University of Groningen, Grote Kruisstraat 2/1, 9712 TSGroningen, The Netherlands,[email protected]
126
Three-way Candecomp/Parafac And The DivergingComponents Problem
Alwin Stegeman1
Abstract
Three-way Candecomp/Parafac (CP, also known as Canonical Polyadic Decomposition)can be viewed as a three-way generalization of the marix SVD (or PCA). Finding a best-fitting CP decomposition withRcomponents to a given three-way arrayZ is equivalentto finding a best rank-Rapproximation toZ . The CP decomposition consists ofR rank-1 arrays, where each rank-1 array is the outer vector productof three vectors. Contraryto PCA, a CP decomposition is rotationally unique under mildconditions. However,in many cases a best-fitting CP decomposition may not exist (whenR≥ 2). Trying tocompute a best-fitting CP decomposition then results indiverging components: some(groups of) rank-1 terms become nearly identical up to sign and arbitrarily large inmagnitude. To avoid this problem, several constraints can be imposed (orthogonality,nonnegativity). A different approach is to obtain the limitpoint of the CP-sequencefeaturing diverging components (Stegeman, 2012). The decomposition of the limit pointis more general than CP and its form can be inferred from the diverging CP-sequence.This decomposition form is then fitted to the dataZ using intitial values computed fromthe diverging CP-sequence. For a well-studied three-way dataset of ratings of TV shows(15 TV shows by 16 rating scales by 30 raters) it is shown that the decomposition of thelimit point has a clear and intuitive interpretation.
ReferencesSTEGEMAN, A. (2012): Candecomp/Parafac: from diverging components to a de-composition in block terms.SIAM Journal on Matrix Analysis and Applications, 33,291–316.
KeywordsCANDECOMP, PARAFAC, TENSOR DECOMPOSITION, LOW RANK APPROXI-MATION, DIVERGING COMPONENTS
Heymans Institute for Psychological Research, Universityof Groningen, Grote Kruis-straat 2/1, 9712 TS Groningen, The [email protected]
127
Cluster-weightedt-factor Analyzers for Clustering ofHigh-dimensional Data
Sanjeena Dang1, Antonio Punzo2, Salvatore Ingrassia3, and Paul D. McNicholas4
Abstract
Cluster-weighted modelling (CWM) is a flexible statisticalframework for modellinglocal relationships in heterogeneous populations on the basis of weighted combinationsof local models. We will extend cluster weighted models to include an underlying latentfactor structure of the independent variable resulting in afamily of parsimonious cluster-weightedt-factor analyzers (CWtFA). This provides the model with the flexibility ofclustering of high-dimensional data. Expectation-maximization framework along withBayesian information criterion (BIC) will be used for parameter estimation and modelselection. The approach is illustrated on simulated data sets as well as a real data set.
KeywordsCLUSTER-WEIGHTED MODELS, FACTOR ANALYZERS, HIGH-DIMENSIONALDATA
Department of Mathematics and Statistics, University of Guelph, ON, [email protected] · Department of Economics and Business, University ofCatania, Catania, [email protected] · Department of Economicsand Business, University of Catania, Catania, [email protected] ·Department of Mathematics and Statistics, University of Guelph, ON, [email protected]
128
Cluster-Weighted Modeling For Time To Event Data
Utkarsh J. Dang1 and Paul D. McNicholas2
Abstract
We implement a mixture of accelerated failure time models for a competing risks situa-tion in a cluster-weighted modeling (CWM) framework. CWM models the joint proba-bility of data arising from a population of sub-populationsusing combinations of localmodels. Both reliability and survival models analyze data on time to some event of inter-est in the engineering and medical fields respectively. Here, we present a novel approachto mixture group estimation and classification for time to event data. Finally, we presentour results on some simulated and real data where the time to failure and cause of failurewas recorded only on some of the observations.
KeywordsCLUSTER-WEIGHTED MODELING, ACCELERATED FAILURE TIME MODEL,COMPETING RISKS, EM-ALGORITHM
University of Guelph, 50 Stone Road East, Guelph, Ontario, N1G 2W1, [email protected] · University of Guelph, 50 Stone Road East, Guelph, Ontario,N1G 2W1, [email protected]
129
Modeling Bivariate Mixed-Type Data with the GeneralizedLinear Exponential Cluster-Weighted Model
Salvatore Ingrassia1 and Antonio Punzo2
Abstract
In the mixture with random covariates modeling frame, the recently proposed general-ized linear Gaussian cluster-weighted model (CWM) allows for flexible clustering anddensity estimation of a random vector composed by a responsevariable and by a set ofcovariates. In each mixture component, while the covariates are assumed to have a real-valued support and are modeled by a Gaussian density, various supports are allowedfor the response variable as conceived in the exponential family. For bivariate data,this paper presents the generalized linear exponential CWM. It extends the generalizedlinear Gaussian CWM by applying an exponential family distribution to the responsevariable too. This gives the possibility of modeling bivariate data of mixed-type. Thenatural counterparts, in the frames of mixture models with fixed covariates and latentclass models, are also defined and compared with the generalized linear exponentialCWM. Maximum likelihood parameter estimates are derived using the EM algorithmand model selection is carried out using the Bayesian information criterion (BIC). Ar-tificial and real data are finally considered to exemplify andappreciate the proposedmodel.
ReferencesINGRASSIA, S., MINOTTI, S. C., and VITTADINI, G. (2012). Local statisticalmodeling via the cluster-weighted approach with elliptical distributions.Journal ofClassification, 29(3), 363–401.INGRASSIA, S., MINOTTI, S. C., PUNZO, A., and VITTADINI, G. (2012): Gener-alized linear Gaussian cluster-weighted modeling. arXiv.org e-print 1211.1171, avail-able at:http://arxiv.org/abs/1211.1171.
KeywordsCLUSTER-WEIGHTED MODELS, LATENT CLASS MODELS, MIXED-TYPEDA-TA, EXPONENTIAL FAMILY DISTRIBUTIONS
Dipartimento di Economia e Impresa - Università di Catania (Italy)[email protected] · Dipartimento di Economia e Impresa - Universitàdi Catania (Italy)[email protected]
130
Cluster Inference using Modes
Surajit Ray
Abstract
Li, Ray and Lindsay (2007) proposed the method of modal clustering that identifyingthe local mode by starting at any point based on kernel density estimates and furtherclustering the data that converge to the same mode. Assessing the number of clustersafter modal clustering is lack of consideration. Ray and Lindsay (2005) introduced theridgeline manifold. It can capture the ridgeline between the two modes and find the an-timode, which is defined as the point on the ridgeline with thelowest density, betweenthem. In this work, we proposed two tests of modal significance based on the ridgelinemanifold. The first one is the paired test. Each point has the impact of kernel densityheights of mode and antimode. We considered the test statistic as the paired t-test statis-tic. The second method is to consider the ratio of the densityheights of antimode againstmode with lower density. We chose uniform as the reference distribution and simulatethe empirical distribution of both the paired test statistic and ratio statistic. We alsofound the empirical distribution of the paired test statistic is closed to T-statistic as thesample size is large.
131
IFCS Presidential AddressClassipedia: A Road Map to Help Traverse the ClassificationJungle
Iven Van Mechelen
Abstract
As a research domain, clustering and classification is aliveand kicking. Yet, the avail-able clustering models, algorithms, and data analysis techniques, in their entirety, forman inconvenient and intricate jungle. This is a most problematic obstacle for researchersin classification, for students who want to familiarize themselves with the domain, andfor applied researchers who are on the lookout for suitable clustering methods to addresssubstantive problems at hand. Within the IFCS, we want to work out a way to overcomethis obstacle. The proposed solution takes the form of a roadmap for the clustering do-main: Classipedia. In this talk, I will introduce the aim of the Classipedia project, theguiding questions and conceptual distinctions that constitute its conceptual backbone,and a first blueprint of its architecture. I will conclude by clarifying how the furtherdevelopment of this blueprint constitutes a thrilling challenge for the IFCS communityas a whole.
KeywordsCLUSTERING, CONCEPTUAL FRAMEWORK, CLASSIPEDIA
University of Leuven, Tiensestraat 102 box 3713, 3000 Leuven, [email protected]
132
A Restricted ADCLUS Type Model for Transition Matrices
Tadashi Imaizumi1
Abstract
The production positioning analysis of given brands is veryuseful on understandingconsumer market. However, the need to know is not the positioning of each product,but the the positioning of the categories that products willbelong to. ADCLUS typemodel(Arabie, Carroll and DeSarbo) is a useful model and method in production posi-tioning and market segmentation. LetF(t) be a given transition frequency matrix of sizen(t−1) rows andn(t) columns,
fi j (t)≈ fi j =m
∑k=1
wkpik(t)q jk(t),
wherewk represents the salience of the property or the categoryk, and
pik =
{
1, if objectoi has property or belong to the categoryk0, otherwise
We extend the above ADCLUS type model forT successive transition matrices forfinding common categories. LetF(1),F(2), ·,F(T) be T transition matrices. And weassume the columns ofF(t) are same to rows ofvecF(t +1). This means thatqik(t) =pik(t +1). We will estimate theT{P(t),W(t),m} andQ(T) under this restriction. Weadopt an optimization procedure which minimizes
T
∑t=1
[ fi j (t)−mt
∑k=1
wkpik(t)q jk(t)]2+
T−1
∑t=1
λt [qik(t)− pik(t +1)]2.
ReferencesARABIE, P., CARROLL, J., and DESARBO, W. (1981). Overlapping clustering: Anew method for product positioning.Journal of Marketing Research, 18, 310-317
KeywordsPRODUCT POSITIONING,CATEGORY
Tama University, 4-1-1 Hijirigaoka, Tama-shi, Tokyo, JAPAN, [email protected]
133
Clustering Of Time Series Via A Segmentation Approach
Christian Derquenne1
Abstract
The similarity between time series can be seen under two mainaspects: shape and levelof the curve. But it can be also interesting to discover similar behaviors on a piece oftime, to detect a break point on the instant, to analyze the links (linear, polynomial, ...)between two curves. The segmentation is a potential aid to synthesize a time series insegments. Each one owns a homogeneous behavior that can be compared with segmentsof another time series. Many segmentation methods are developed (Lavielle et al., 2006)and these ones can be used to make clustering of curves, (Hébrail et al., 2010). We havedeveloped a segmentation method based on an exploratory approach (Derquenne, 2011,2012) which has given very good results on simulated data andapplications. In thispaper, we introduce new similarity indexes based on our segmentation method, then weuse these ones to make clustering of time series. Furthermore, this clustering allows toidentify the characteristics of time series belonging to each cluster with respect to theproperties of similarity or dissimilarity. Lastly we propose some potential researches inthe domain of structural equations model for multivariate time series and the forecastingmodels.
ReferencesDERQUENNE, C. (2011): An Explanatory Segmentation Method for Time Series,in Proc. of Compstat 2010, Y. Lechevallier & G. Saporta (eds.), 935–942.DERQUENNE, C. (2012): Meta-segmentation of time series forsearching a bettersegmentation,in Proc. of Compstat 2012, Limassol, Cyprus, 191–204.HÉBRAIL G., HUGUENEY B., LECHEVALLIER Y. and ROSSI F. (2010): Ex-ploratory analysis of functional data via clustering and optimal segmentation.Neu-rocomputing 73(7-9): 1125–1141.LAVIELLE, M. and TEYSSIÈRE, G. (2006): Détection de ruptures multiples dansdes séries temporelles multivariées.Lietuvos Matematikos Rinikinys, Vol 46.
KeywordsTIME SERIES, SIMILARITY, CLUSTERING, SEGMENTATION
Electricité de France - Research and Development - 1, av. du Général de Gaulle - 92141Clamart Cedex - France [email protected]
134
Looking For A Best Compromise Between The UltrametricSupremum-Norm Approximations
B. Fichet
Abstract
All ultrametric Lp-norm approximations are well-known to be NP-hard, except theL∞-norm (supremum-norm) one, as shown by Farach et al (1995). The authors provided analgorithm to get a solution. Later on, Chepoi et al (2000), established in a general con-text, the link with the subdominant ultrametric approximation, showing that the greatestL∞-norm solution derives from a simple translation of the subdominant.
Similar results hold from some upperminimal ultrametrics,but not all of them. Fichet(2012) gave an algorithm to build those appropriate approximations, yielding minimalL∞-norm solutions by translation, hence interval solutions given by them and the sub-dominant. Then, following Chepoi et al (2000), an optimal consensus may summarizeany interval.
In this talk, we try to improve such a compromise. We focus on the choice of the up-perminimal ultrametric approximation, with the aim to get astructure similar to the oneof the subdominant, hence of the compromise, for instance having similar preordon-nances (linear interpoint distance preorders), similar tree-representations or commoncompatible (Robinsonian) order. We will discuss and justify those approaches, throughthe existence of solutions and our ability to compute them.
ReferencesFARACH, M., KANNAN, S. and WARNOW, T. (1995): A Robust Model for FindingOptimal Evolutionary Trees.Algorithmica, 13, 155-179.CHEPOI, V. and FICHET, B. (2000):l∞-Approximation via Subdominants.Journalof Mathematical Psychology, 44, 600-616.FICHET, B. (2012): Intervals as Ultrametric Approximations According to theSupremum Norm. In: M. Deza, M. Petitjean and K. Markov (Eds.): Mathematicsof Distances and Applications. ITHEA, Sofia, 147.
KeywordsULTRAMETRIC, SUBDOMINANT ULTRAMETRIC, UPPERMINIMAL ULTRA-METRICS, SUPREMUM-NORM APPROXIMATIONS.
LIF. Aix-Marseille University. 163 Avenue de Luminy. Case 901. F-13288 Marseillecedex [email protected]
135
Ultrametric Tree Representation For Three-WayThree-Mode Data With Weights Of Variables And Occasions
Kensuke Tanioka1 and Hiroshi Yadohisa2
Abstract
Three-way three-mode data is defined asXXX ∈ R|I |×|J|×|K|, whereI , J, andK represent a
set of objects, variables, and occasions, respectively, and | · | is the cardinality of a set.Here, we forcus on three-way three-mode data which consist of a set of multivariate dataamong various occasions for the same objects and variables.Such data, not to be con-fused with the three-way three-mode proximity data, are commonly observed in panelor psychological research. In paner research,I , J, andK represent a set of participants,questions, and years, respectively. When a classification structures ofI is calculatedfrom the three-way three-mode data, the masking variables and occasions, which pos-sess no classification structure, affects clustering structures. Milligan (1980) showedthe effects of masking variables in two-way two-mode data byconducting Monte Carlosimulations. These effects are also expected to occur in three-way three-mode data.
In this paper, we proposed three-way three-mode hierarchical clustering on the basisof the least squares criterion for weighting variables and occasions. Specifically, weextend the method of De Soete (1985) to three-way three-modedata. The method canconsider the effects of masking variables and occasions through adding the weights tovariables and occasions.
ReferencesDe SOETE, G., DESARBO, W.S., and CARROLL, J,D. (1985): Optimal variableweighting for hierarchical clustering: An alternating least-squares algorithm,Journalof Classification, 2, 173–192.MILLIGAN, G. W. (1980): An Examination of the effect of six types of error pertur-bation on fifteen clustering algorithms,psychometrika, 45, 325–342.
KeywordsALS, MASKING VARIABLES, MASKING OCCASIONS
Graduate school of Culture and Information Science, Doshisha [email protected] · Department of Culture and Information Science,Doshisha [email protected]
136
Which Movie Shall I Watch? Ultrametric BasedRecommendation System
Pedro Contreras1, Fionn Murtagh1, and Javier Pereira2
Abstract
In previous work we have shown how an ultrametric (Murtagh etal, 2008. Pereira etal, 2010. Contreras et al, 2012) can be used to create hierarchical clusters in constantalgorithmic time. In particular we make use of the Baire metric or the longest commonprefix to construct our classification trees. Sometimes whena technique to reduce thedata dimensionality was needed we opted to project the data randomly to one dimension(Murtagh et al, 2008).
Our aim in this work is to show how the Baire metric can be used to classify,match and retrieve categorical data. We demonstrate this bycreating a movie rec-ommendation system based in the Baire metric and using the MovieLens dataset(http://www.grouplens.org/node/73).
ReferencesCONTRERAS, P. and MURTAGH. F. (2012): Fast, Linear Time Hierarchical Clus-tering Using the Baire Metric. In: Journal of Classification, 29(2):118–143.MURTAGH, F., DOWNS, G. and CONTRERAS P. (2008): Hierarchical Clusteringof Massive, High Dimensional Data Sets by Exploiting Ultrametric Embedding. In:SIAM Journal on Scientific Computing, 30(2):707–730.PEREIRA, J., SCHMIDT, F. CONTRERAS, P., MURTAGH, F. and H. ASTUDILLO(2010): Clustering and Semantics Preservation in CulturalHeritage InformationSpaces. In: RIAO’2010, 9th International Conference on Adaptivity, Personalizationand Fusion of Heterogeneous Information, 100–105. Paris, France.
KeywordsULTRAMETRIC, BAIRE METRIC, CLUSTERING, RECOMMENDATION SYS-TEMS, INFORMATION RETRIEVAL.
Royal Holloway, University of London. Egham Hill, Egham. England. TW20 [email protected], [email protected] · Universidad Diego Portales.Avenida Ejército 441. Santiago, [email protected]
137
Model-Based Recursive Partitioning for DetectingInteraction Effects in Subgroups
Achim Zeileis1, Torsten Hothorn2, and Kurt Hornik3
Abstract
Recursive partitioning (also known as decision trees) is a standard approach for “learn-ing” a nonlinear regression relationship between some response variable and a set ofexplanatory variables. The result is a partition of the datathat can be easily visualizedand interpreted. However, classical decision trees typically lack a concept of “signifi-cance” and cannot be combined easily with classical parametric models. Therefore, ageneral framework for model-based recursive partitioningis suggested by Zeileiset al.(2008) that provides a synthesis between parametric modelsand the algorithmic treeapproach.
More formally, an algorithm for model-based recursive partitioning is suggested withthe following basic steps: (1) fit a parametric model to a dataset (e.g., via least squaresor maximum likelihood), (2) test for parameter instabilityover a set of partitioning vari-ables, (3) if there is some overall parameter instability, split the model with respect tothe variable associated with the highest instability, (4) repeat the procedure in each ofthe daughter nodes. The algorithm yields a partitioned (or segmented) parametric modelthat can be effectively visualized and that subject-matterscientists are used to analyzingand interpreting. It enables data-driven detection and modeling of subgroup interactionsin parametric regression models. The approach is illustrated using two logistic regres-sion trees for the risk of diabetes in Pima Indian women and the size of treatment effectsfor a chronic disease, respectively.
ReferencesZEILEIS, A., HOTHORN, T., and HORNIK, K. (2008): Model-Based RecursivePartitioning.Journal of Computational and Graphical Statistics, 17, 492–514.
KeywordsCHANGE POINTS, MAXIMUM LIKELIHOOD, MODEL TREES, PARAMETERINSTABILITY, RECURSIVE PARTITIONING
Universität Innsbruck, [email protected] · UniversitätZürich, [email protected] · WU Wirtschaftsuni-versität Wien, [email protected]
138
Predicting Individual Causal Effects (ICE)
Xiaogang Su1 and Joseph Kang2
Abstract
Within Rubin’s causal model, the individual causal effect (ICE) is defined asE(Y1−Y0|X = x) for a subject withX = x, whereY1 andY0 are potential outcomes. Knowledgeof ICE implies that of average causal effect (ACE) and sub-population causal effects,but not vice versa. Moreover, ICE plays a critical role in advancing personalized orstratified medicines. According to the formulation, estimation of ICE is essentially apredictive modeling problem. In this project, two machine learning methods are pro-posed for predicting ICE with observational data, where howto tease out the confound-ing and moderating effects of other covariates on causal inference is the key. The firstmethod is based on the causal inference tree (Su et al., 2012JMLR) while the secondis based onk-nearest neighbor and kernel smoothing. We compare the proposed meth-ods with available approaches via simulation and illustrate their use on with the NSWdata in (Dehejia and Wahba, 1999JASA) where the objective is to assess the impactof a labor training program, the National Supported Work (NSW) demonstration, onpost-intervention earnings.
ReferencesDEHEJIA, R. H. and WAHBA, S. (1999): Causal Effects in Nonexperimental Stud-ies: Re-evaluating the Evaluation of Training Programs.Journal of the AmericanStatistical Association, 94, 1053–1062.SU, X. G., KANG, J., FAN, J. J. , LEVINE, R. A., and YAN, X. (2012). Facilitat-ing Score and Causal Inference Trees for Large Observational Studies.Journal ofMachine Learning Research, 13, 2955–2994.
KeywordsCAUSAL INFERENCE, CONFOUNDING AND INTERACTING, OBSERVATIONALDATA, RECURSIVE PARTITIONING, KERNEL SMOOTHING
University of Alabama at Birmingham, [email protected] · Northwestern Univer-sity, [email protected]
139
A New Tool For Identifying Qualitative Treatment-SubgroupInteractions: QUINT
Elise Dusseldorp1 and Iven Van Mechelen2
Abstract
When for some disease two alternative treatments -A and B- are available, one subgroupof patients may display a better outcome with treatment A than with B, whereas for an-other subgroup the reverse may be true. If this is the case, a qualitative (i.e., disordinal)treatment-subgroup interaction is present. Such interactions imply that some subgroupsof patients should be treated differently, and are therefore most relevant for personalizedmedicine. In case of data from randomized clinical trials with many patient character-istics that could interact with treatment in a complex way, afew statistical approachesexist to detect treatment-subgroup interactions; examples include STIMA (Dusseldorpet al., 2010) and Interaction Trees (Su et al., 2009). However, a suitable approach todetect qualitative interactions is not yet available. In the present paper, we propose anew method for this purpose, called QUalitative INteraction Trees (QUINT). QUINTresults in a binary tree that subdivides the patients into terminal nodes on the basis ofpatient characteristics; these nodes are further assignedto one of three classes: a first forwhich A is better than B, a second for which B is better than A, and an optional third forwhich type of treatment makes no difference. Results with regard to the optimizationand recovery performance of QUINT will be presented.
ReferencesDUSSELDORP, E., CONVERSANO, C., Van OS, B.J. (2010): Combining an addi-tive and tree-based regression model simultaneously: STIMA. Journal of Computa-tional and Graphical and Statistics, 19, 514–530.SU, X., TSAI, C-L., WANG H., NICKERSON, D.M., LI, B. (2009): Subgroup anal-ysis via recursive partitioning.The Journal of Machine Learning Research, 10, 141–158.
KeywordsINTERACTION, MODERATOR, SUBGROUP ANALYSIS, PARTITIONING, CLUS-TER
KU Leuven, Netherlands Organisation for Applied ScientificRe-search TNO [email protected] · KU [email protected]
140
A Comparison Of Six Sequential Partitioning Methods ToFind Subgroups Involved In Treatment-SubgroupInteractions
Lisa Doove1, Elise Dusseldorp2, Katrijn Van Deun3, and Iven Van Mechelen4
Abstract
In case multiple treatment alternatives are available for some medical problem, thedetection of treatment-subgroup interactions is of key importance for personalizedmedicine and the development of optimal treatment assignment strategies. Random-ized Clinical Trials (RCT) often go without clear a priori hypotheses on the subgroupsinvolved in treatment-subgroup interactions, and with a large number of pre-treatmentcharacteristics in the data. In situations like this, relevant subgroups (defined in terms ofpre-treatment characteristics) are to be induced during the actual data analysis. For suchan analysis, recently six different methods have been proposed, all being of a sequen-tial partitioning type. These are Model-based recursive partitioning, Interaction Trees,STIMA, SIDES, Virtual Twins, and QUINT. However, they have been developed almostindependently, and the relations between them are not yet understood. This presentationcloses this gap. Using an illustrative RCT data set, a systematic comparison of the meth-ods is presented, hereby focusing on major similarities anddifferences.
ReferencesDOOVE, L.L., DUSSELDORP, E., VAN DEUN, K. and VAN MECHELEN, I.(2013): A comparison of five sequential partitioning methods to find person sub-groups involved in meaningful treatment-subgroup interactions.Manuscript submit-ted for publication.DUSSELDORP, E. and VAN MECHELEN, I. (2013): Qualitative interaction trees:A tool to identify qualitative treatment-subgroup interactions.Manuscript submittedfor publication.
KeywordsTREATMENT HETEROGENEITY, SEQUENTIAL PARTITIONING, SUBGROUPANALYSIS, TREATMENT-SUBGROUP INTERACTION
KU [email protected] · KU Leuven, [email protected] ·KU [email protected] ·KU [email protected]
141
Automatic Bayes Factors for Comparing Variances of TwoIndependent Normal Distributions
Florian Böing-Messing1 and Joris Mulder2
Abstract
When analyzing differences between two independent populations researchers com-monly focus on comparing means. However, it is equally important to investigate dif-ferences in the populations’ variances. We often would liketo know whether two pop-ulations are equally heterogeneous, whether population 1 is more heterogeneous thanpopulation 2, or whether population 2 is more heterogeneousthan population 1. Toanswer this question we shall perform a multiple hypothesistest on the variances oftwo independent normal distributions using the Bayes factor, a Bayesian testing crite-rion. The Bayes factor has two important properties which are not shared by classicalp-values. First, Bayes factors can straightforwardly be used for simultaneously testingmultiple hypotheses. Second, the Bayes factor has an intuitive interpretation as the rel-ative evidence in the data in favor of a hypothesis against another hypothesis. However,when using Bayes factors for testing equality constrained hypotheses the choice of theprior plays an important role due to the Jeffreys-Lindley paradox. In this paper differ-ent automatic priors will be compared when using Bayes factors in the above multipletesting problem. We investigate the performance of these priors by looking at importantproperties such as consistency, the information paradox, balancedness, and the similar-ity with classical p-values. Our results can be used as a guideline for choosing a priorwhen testing hypotheses on the variances of two independentnormal distributions.
KeywordsHOMOGENEITY OF VARIANCE, MULTIPLE HYPOTHESIS TESTING, DEFAULTBAYES FACTOR, JEFFREYS-LINDLEY PARADOX, INFORMATION PARADOX
Department of Methodology and Statistics, Tilburg University, Postbus 90153, 5000 LETilburg, the Netherlands,[email protected] · Department of Methodol-ogy and Statistics, Tilburg University, Postbus 90153, 5000 LE Tilburg, the Netherlands,[email protected]
142
Bayesian Model Selection For Evaluating Equality AndOrder Constraints On Correlation Matrices
Joris Mulder
Abstract
Researchers often formulate their expectations using equality constraints and order con-straints on correlation coefficients. When translating these expectations into a set ofcompeting equality-constrained and order-constrained models on the zero-level corre-lations in an unstructured correlation matrix, the goal is to determine which model re-ceives most evidence from the data. For this purpose, Bayes factors shall be developed.The Bayes factor is a Bayesian model selection criterion that can be used for quanti-fying the relative evidence in the data in favor of a model in comparison to anothermodel. Particular attention is paid to proper prior specification which plays a crucialrole when computing Bayes factors. Priors will be considered that (i) result in positivedefinite correlation matrices, (ii) are ‘balanced’ in the sense that every possible order-ing is equally likely a priori, and (iii) result in Bayes factors that are consistent whenevaluating equality constraints and order constraints on correlations.
KeywordsBAYES FACTOR, CORRELATION MATRIX, PRIOR, COMPLEXITY
Departement of Methodology and Statistics, Tilburg University, the [email protected]
143
Bivariate Dependence Patterns And Copulas: ModelDiscrimination And Robustness
Lianne Ippel1 and Johan Braeken2
Abstract
Different dependence patterns can be hiding behind the samevalue of a general depen-dence measure. In this landscaping Monte Carlo experiment,we investigate the distin-guishability of qualitatively different bivariate dependence structures that have equiva-lent rank-order correlation and fixed univariate distributions with the same means andstandard deviations. Hence, the difference in structure only shows graphically and notin any of the summary statistics.
A conceptual and graphical introduction will be given to copula functions, a mul-tivariate modeling approach that allows for construction of such varying dependencestructures. Model fit, general and local dependence measures are considered to studythe informativeness of the data in discriminating between four of these copula models.
Results stress the importance of focus when assessing differences between models.Although the models discriminate fairly well based upon fit statistics, model misspecifi-cation hardly affected general dependence measures. This robustness might make modelselection a seemingly non-issue in practice. In contrast, when focus is on local depen-dence measures and prediction, model misspecification can be be rather harmful.
KeywordsCOPULA FUNCTIONS, DEPENDENCE, MODEL DISCRIMINATION, MODEL SE-LECTION
Tilburg School of Social and Behavioral Sciences (TSB), Tilburg [email protected] · Departement of Methodology andStatistics, Tilburg [email protected]
144
Posterior Predictive checking as alternative to Asymptoticsand Bootstrapping in Latent Class Analysis
Geert H. van Kollenburg1, Joris Mulder2, and Jeroen K. Vermunt3
Abstract
As the use of latent class analysis becomes more widespread,the importance of cor-rect interpretation and availability of reliable fit statistics increases. Most methods forassessing model fit involve using asymptotic reference distributions, which may not al-ways be appropriate. Using asymptotic p-values on sparse frequency tables can lead toa dramatic increase in Type-I-error levels (Reiser & Lin, 1999).
Resampling techniques can provide empirical p-values thathave good properties,even under sparseness. We apply posterior predictive checks (Gelman, Meng & Stern,1996) to obtain empirical p-values for a number of commonly used fit statistics withinlatent class analysis.
In a Monte Carlo simulation study we compared the posterior predictive check tothe use of asymptotics and to the parametric bootstrap method. Results show that theposterior predictive check is a sound alternative to the useof asymptotics and that itworks equally well as the parametric bootstrap.
ReferencesGELMAN, A., MENG, X. L. and STERN, H. (1996): Posterior predictive assessmentof model fitness via realized discrepancies.Statistica Sinica, 6, 733–759.REISER, M. and LIN, Y. (1999): Goodness-Of-Fit Test for the Latent Class ModelWhen Expected Frequencies Are Small.Sociological methodology, 29(1), 81–111.
KeywordsLATENT CLASS ANALYSIS, BAYESIAN MODEL CHECKING, POSTERIORPRE-DICTIVE CHECK, BOOTSTRAP
Tilburg University, [email protected] ·Tilburg University, [email protected] ·Tilburg University, [email protected]
145
Statistical Modeling Of The Distribution Of FinancialReturns
Cuevas-Covarrubias C.1, Iñigo-Martínez J.2 and Rosales-Contreras J.3
Abstract
Most of the models applied in Finance assume that daily financial returns are normallydistributed; however, this fundamental assumption is not always satisfied in practice.Financial returns frequently showleptokurticdistributions: does it mean that the NormalDistribution is not useful in Financial Modeling? To estimate the distribution functionof financial returns is an important task in Actuarial Mathematics and Risk Theory.This article is a practical discussion on finite Gaussian Mixtures and its potential inFinancial Risk Modeling. It is based on the analysis of different financial series fromseveral markets in Latin America. Our discussion considersthe estimation of Marginaland Joint Distributions and compares the results with thoseobtained with other modelsproposed in the literature.. The empirical evidence shows that Gaussian Mixture modelshave an interesting potential in financial modeling for riskassessment. Our conclusionis that financial returns may not be normally distributed, but they frequently behave asa mixture of Gaussians.
ReferencesKLUGMAN, S.A.; and PARSA R. (1999): Fitting bivariate distributions with copu-las,Insurance Mathematics and Economics, 24 139-148BEHR A. and POETTER U. (2009): Modeling Marginal Distributions of Ten Eu-ropean Stock Market Index Returns,International Research Journal of Finance andEconomics,28, 104-119.McLACHLAN G. and PELL D. (2000):Finite Mixture Models. Wiley series in Prob-ability and Statistics, Wiley inter-science.
KeywordsFIANCIAL RETURNS, RISK THEORY, GAUSSIAN MIXTURES, EXPECTATION-MAXIMIZATION, COPULA MODELING.
Universidad Anáhuac, Estado de México, México,[email protected], · Universidad Anáhuac, Estado de México, Méx-ico. · Instituto Tecnológico de Estudios Superiores de Monterrey, México.
146
Combining Decision Trees And Stochastic Curtailment ForAssessment Length Reduction Of Test Batteries Used ForClassification
Marjolein Fokkema1, Niels Smits2 Henk Kelderman3
Abstract
For classification problems in psychology (e.g., clinical diagnosis), batteries of testsare often administered. However, not every test or item may be necessary for accurateclassification. In this paper, we introduce a combination ofclassification and regressiontrees (CART; Breiman, Friedman, Oshen & Stone, 1984) and stochastic curtailment(SC; Finkelman, He, Kim & Lai, 2011) to reduce assessment length of questionnairebatteries. First, the CART algorithm provides relevant subscales and cutoffs needed foraccurate classification, in the form of a decision tree. Second, for every subscale andcutoff appearing in the decision tree, SC reduces the numberof items needed for accu-rate classification. This procedure is illustrated by post-hoc simulation on a dataset of3579 patients, to whom the Mood and Anxiety Symptoms Questionnaire (MASQ) wasadministered. Subscales of the MASQ are used for predictingdiagnoses of depression.Results show that CART-SC provided an assessment length reduction of 56%, withoutloss of accuracy, compared to the more traditional prediction method of performing lin-ear discriminant analysis (LDA) on subscale scores. CART-SC appears to be an efficientand accurate algorithm for shortening test batteries.
ReferencesBREIMAN, L. and FRIEDMAN, J. and OLSHEN, R. and STONE, C. (1984): Clas-sification and Regression Trees. Wadsworth, New York.FINKELMAN, M.D. and HE, Y. and KIM, W. and LAI, A.M. (2011): Stochastic cur-tailment of health questionnaires: A method to reduce respondent burden.Statisticsin Medicine, 30, 1989–2004.
KeywordsTEST BATTERIES, COMPUTERIZED TESTING, SEQUENTIAL TESTING, CLAS-SIFICATION AND REGRESSION TREES, STOCHASTIC CURTAILMENT,EFFI-CIENCY
Vrije Universiteit, Amsterdam,[email protected] · Vrije Universiteit, Amsterdam·Vrije Universiteit, Amsterdam
147
Gaussian Tree Models For Discrimination
Gonzalo Perez–de–la–Cruz1 and Guillermina Eslava–Gomez2
Abstract
We consider Graphical Gaussian models with tree structure in discriminant analysis fortwo populations. We restrict to the case where each model hasthe same tree structure,though not necessarily the same concentration matrix. By considering a tree structure,the maximum likelihood estimator (MLE) for the concentration matrices can be ex-pressed analytically. Whereas by considering the same treestructure for each of the twoconcentration matrices, the estimation of the unknown structure is solvable by findingthe minimum weight spanning tree (MWST).
In this work, we propose to use the J-divergence as a measure of discrimination be-tween two populations, and the one to be optimized efficiently by finding the MWST.By using the MLE of each concentration matrix and the MWST we get an estimateddiscriminant function.
We illustrate the empirical performance of the proposed andof other existing methodsusing some data. This example shows similar performance forthe methods using treestructure on the models, and a better one respect to linear and quadratic discriminantanalysis for small sample sizes.
ReferencesCHOW, C. and LIU, C. (1968): Approximating Discrete Probability Distributionswith Dependence Trees.Information Theory, IEEE Transactions, 14, 462–467.FRIEDMAN, N., GEIGER, D. and GOLDSZMIDT, M. (1997): Bayesian NetworkClassifiers.Mach. Learn., 29, 131–163.LAURITZEN, S. L. (1996):Graphical Models. Clarendon Press, Oxford.TAN, V., SANGHAVI, V., FISHER, J. and WILLSKY, A. (2010): Learning Graphi-cal Models for Hypothesis Testing and Classification.IEEE Transactions on SignalProcessing, 58, 5481–5495.
KeywordsDISCRIMINANT ANALYSIS, GRAPHICAL GAUSSIAN MODELS, TREES,J-DIVERGENCE,STRUCTURE ESTIMATION.
Posgraduate Studies in Mathematics, National University of Mexico, UNAM. Mex-ico, D.F. [email protected] · Department of Mathemat-ics, Faculty of Sciences, National University of Mexico, UNAM. Mexico, D.F. [email protected]
148
Stochastic Curtailment Of Questionnaires For Three LevelClassification: Shortening The Ces-D For Assessing Low,Moderate, And High Risk Of Depression
Niels Smits1, Matthew Finkelman2, and Henk Kelderman3
Abstract
Health questionnaires are often built up from sets of questions which are totaled toobtain a sum score; often, this score is subsequently used toclassify respondents. Animportant consideration in designing questionnaires is tominimize respondent burden.Finkelman et al. (2011, 2012) introduced stochastic curtailment (SC) as an efficientmethod of questionnaire administration aimed at classification into two categories, suchas ‘at risk’ and ‘not at risk’. SC uses a prediction model for forecasting observed classmembership; the strategy is to stop testing when not yet administered items are unlikelyto change the respondent’s classification. The current paper adjusts SC for classificationinto three categories such as ‘low risk’, ‘moderate risk’, and ‘high risk’. It is shown thatthis adjustment is not trivial. The outcomes of a post hoc simulation study are presentedin which real responses on the Center for Epidemiologic Studies Depression scale wereused by several versions of SC for classification into three categories. SC substantiallyreduced the respondent burden while maintaining a high classification quality. Benefitsand limitations of this new methodology are discussed.
ReferencesFINKELMAN , M. D. , HE, Y., K IM , W., and LAI , A. M. (2011): Stochastic curtail-ment of health questionnaires: A method to reduce respondent burden.Statistics inMedicine, 30, 1989–2004.FINKELMAN , M. D., SMITS, N., KIM , W. and RILEY , B. (2012): Curtailment andstochastic curtailment to shorten the CES-D.Applied Psychological Measurement,36, 632–658.
KeywordsCOMPUTERIZED TESTING, RESPONDENT BURDEN, CURTAILMENT, ORDI-NAL REGRESSION
VU University [email protected] · Tufts University School of DentalMedicine, Boston· VU University Amsterdam
149
Tree-Based Prediction with Missing Data
Holger Cevallos Valdiviezo, Stefan Van Aelst
Abstract
In prediction problems missing data are frequently encountered. Misleading predictionsmay be obtained if the missing data issue is not addressed correctly. Thus, it is crucialto find an appropriate prediction rule, with low bias and highprecision, which takes theuncertainty caused by missing values into account. To handle this problem, we investi-gated the performance of ten prediction methods based on trees. Some methods handleincomplete data by themselves (e.g. via surrogates) while others use a preliminary im-putation step. The methods in question are: CART (surrogatesplits), Random Forest(RF) with imputation by either median or proximity matrix, Bagging (surrogate splits),Multiple Imputation via Sequential Trees (MIST) followed by either CART or RF, boot-strap samples imputed by conditional means followed by either CART or RF, bootstrapsamples imputed by draws from the conditional distributions followed by either CARTor RF.
We studied the performance of these methods on real and simulated high-dimensionaldatasets with 5%, 10% and 25% of missing data generated completely at random, atrandom and not at random. We considered both linear and nonlinear data generatingmodels in the simulations. The performance is evaluated on alarge test set using meansquared prediction error for regression and misclassification rate for classification. Over-all, MIST followed by RF showed a very good performance in allscenarios for bothregression and classification with stable predictions across missing data fractions andmissingness mechanisms. A computationally less intensivealternative is RF with im-putation by proximity matrix which performs well for lower fractions of missing data.Finally, we compare our findings to related results on the useof surrogates versus mul-tiple imputation that have been published recently.
ReferencesBURGETTE, L.F. and REITER, J. (2010): Multiple imputation for missing data viasequential regression trees.American Journal of Epidemiology, 172, 1070–1076.
KeywordsTREE METHODS, PREDICTION, MISSING DATA, IMPUTATION
Ghent [email protected];[email protected]
150
Sparse Classifier Ensembles for Improved Interpretability.
Werner Adler1, Zardad Khan2, Sergej Potapov1 and Berthold Lausen2
Abstract
Classification tree ensembles like bagged classification trees or random forests (Breiman,2001) often show improved classification performance in comparison to single trees.This comes to the cost of less interpretability which is an important aspect e.g. inmedical applications, where interpretability is important and black box methods areunwanted when it comes to decisions regarding future treatment of patients. Severalmethods exist to combine both, improved performance and larger interpretability. Forexample Node Harvest proposed by Meinshausen (2010) is characterized by it’s inter-pretability and competitive performance in various situations.
A high diversity between individual base classifiers is deemed to be important in theperformance of an ensemble. Hence, our approach to improving the interpretability ofclassifier ensembles is based on a dramatic reduction of the number of trees constitut-ing the ensemble depending on their diversity. To obtain this goal, we examine severaldiversity measurements (Tang et al., 2006) and create sparse classifier ensembles byweighting the individual trees based on these measurements. We report and discuss theresults obtained using simulated data as well as a clinical example data set.
ReferencesBreiman, L. (2001): Random forests.Machine Learning, 45, 5–32.Meinshausen, N. (2010): Node Harvest.The Annals of Applied Statistics, 4(4), 2049–2072.Tang, E.K., Suganthan, P.N., Yao, X. (2006): An analysis of diversity measures.Ma-chine Learning, 65, 247–271.
KeywordsCLASSIFICATION TREES, ENSEMBLES, DIVERSITY, INTERPRETABILITY
Department of Biometry and Epidemiology, University of Erlangen-Nuremberg, Ger-many· Department of Mathematical Sciences, University of Essex,[email protected]
151
A ROC-Optimised Multi-Prototype Classifier
Mario Ziller
Abstract
In many diagnostic problems in medicine, biology, and far beyond that, there has beenthe desire for detecting typical reference objects. They should act as proof-samples forfuture ruling in practice. In comparable problems, global distance-based classificationturned out to be a useful mathematical vehicle. The application of its results to largedata sets moreover operates much faster than applying a local nearest-neighbour-likeprocedure. In this context, we report on a new multi-prototype classifier which reliablyworks in many respects, even in multi-class diagnostics.
For a short mathematical outline, let all objects be considered as points in a metricspace. Any class to be investigated is modelled as an overlapof potentially different-sized hyperspheres, the centres of which represent the sought reference objects, hence-forth referred to as prototypes. The radii of the hyperspheres are individually optimisedby a generalised ROC-analysis which all other hypersphereswere fixed in. For the ap-proximate solution of the entire discrete optimisation problem, a greedy algorithm hasbeen developed. It runs inO (n2k2) time wheren is the number of training objects andk is the number of prototypes to be selected.
In case of multi-class problems, prototypes and related cutoffs are determined foreach single class, separately. The diagnostic decision is finalised for that class of max-imum specificity when in doubt. Objects not recognised as a member of any of theclasses are assigned to an additional remainder-class.
The performance of the classification system presented is demonstrated at variouspractical examples, and in comparison to other methods.
KeywordsPROTOTYPE CLASSIFIER, MULTI CLASS DIAGNOSTICS, ROC ANALYSIS, GREEDYALGORITHM
Friedrich-Loeffler-Institut, Federal Research Institutefor Animal Health,Biomathematics Working Group, Greifswald - Insel Riems, [email protected]
152
Classification of Rounded Shapes with Penalized SignalRegression
Johan J. de Rooi1 and Paul H.C. Eilers1
Abstract
Various medical and biological applications require the classification of two-dimensional(rounded) outlines. In addition to classification, proper preprocessing is needed. We pro-pose a scheme with several steps: 1) rectangular coordinates are converted to polar; 2)scaling and rotation is applied; 3) the radius is lightly smoothed, using (circular) P-splines as a function of the angle; 4) the spline coefficientsare used as explanatory vari-ables in logistic penalized signal regression. This set-uphas several advantages. UsingP-splines makes the signals of equal length, while unsupported regions can be correctedusing a difference penalty. The penalty prevents overfitting of the data and makes theproblem well-posed. Because the model is a member of the class of generalized linearmodels, we are not limited to a binomial outcome. Applications show excellent classifi-cation performance.
KeywordsSHAPE ANALYSIS; SIGNAL REGRESSION; P-SPLINES
Department of Biostatistics, Erasmus Medical Center, Rotterdam, The [email protected],[email protected]
153
Classification of Topics on Twitter in Consideration of TimeSeries Variation
Atsuho Nakayamar1, Hiroyuki Tsurumi2, and Junya Masuda3
Abstract
We address the task of classifying topics of tweet data of Twitter. Twitter is microblogservice that enables its users to post and read text-based messages of up to 140 charac-ters. Twitter spread rapidly in Japan in recent years thanksto using Chinese ideograms.Since Chinese ideograms are symbols representing meanings, the meaning is easy todiscern by even a few characters. The 140 characters in Japanese are enough to expressa lot of ideas. However, we have to select appropriate words,which are represented thekeywords of the meaningful topics, from a lot of words. It is important to set criteriafor the choice of candidate words. We have used the complementary similarity measure(Sawaki & Hagita, 1996) in order to find appropriate words which represent time seriesvariation of topics and gain more understanding of those characteristics. The comple-mentary similarity measure method is a classification method and widely applied in thearea of character recognition. Then, we will classify the words extracted from the tweetdata by using non-negative matrix factorization (NMF)(Lee& Seung, 2000). NMF hasadvantages for applications involving large and sparse matrices. We empirically showthat our method generates a good summary on the dataset of microblog documents on anew line of beverage.
ReferencesLee, D.D. and Seung, H.S. (2000): Algorithms for Non-Negative Matrix Factoriza-tion. In K. T. Leen, T. G. Dietterich and V. Tresp (Eds.):Advances in Neural Infor-mation Processing Systems, Vol. 13. MIT Press, 556–562.Sawaki, M. and Hagita, N. (1996): Recognition of Degraded Machine-Printed Char-acters Using a Complementary Similarity Measure and Error-Correction Learning.IEICE Transactions on Information and Systems, Vol. E79-D,No.5, 491–497.
KeywordsCOMPLEMENTARY SIMILARITY MEASURE, MICROBLOG DATA, NMF
Graduate School of Social Sciences, Tokyo Metropolitan University, 1-1 Minami-Ohsawa, Hachioji-shi, Tokyo 192-0397 Japan,[email protected] · College ofBusiness Administration, Yokohama National University· Dentsu Marketing InsightINC
154
Classifying Real-World Data With The DDα-Procedure
Pavlo Mozharovskyi1, Karl Mosler1, and Tatjana Lange2
Abstract
The DDα-classifier, a nonparametric fast and very robust procedureintroduced byLange et al. (201x), is applied to fifty classification problems regarding a broad spectrumof real-world data. The procedure first transforms the data from their original propertyspace into a depth space (Li et al., 2012), which is a low-dimensional unit cube, andthen separates them by a projective invariant procedure, called α-procedure (Vasil’evand Lange, 1998). To each data point the transformation assigns its depth values withrespect to the given classes. Here the random Tukey depth (Cuesta-Albertos and Nieto-Reyes, 2008) is employed, which approximates the Tukey depth by minimizing univari-ate Tukey depths over a finite number of directions. ‘Outsiders’, that is data points hav-ing zero depth in all classes, need an additional treatment for classification. Several suchtreatments are introduced and evaluated. TheDDα-procedure has been implemented asan R-package.
ReferencesLANGE, T., MOSLER, K. and MOZHAROVSKYI, P. (201x): Fast nonparametricclassification based on data depth.Statistical Papers, in press.LI, J., CUESTA-ALBERTOS, J.A. and LIU, R.Y. (2012):DD-classifier: nonpara-metric classification procedure based onDD-plot.Journal of the American StatisticalAssociation, 107, 737–753.VASIL’EV, V.I. and LANGE, T. (1998): The duality principle in learning for patternrecognition (in Russian).Kibernetika i Vytschislit’elnaya Technika, 121, 7–16.CUESTA-ALBERTOS, J.A. and NIETO-REYES, A. (2008): The random Tukeydepth.Computational Statistics and Data Analysis, 52, 4979–4988.
KeywordsCLASSIFICATION, SUPERVISED LEARNING, DATA DEPTH, TUKEY DEPTH,OUTSIDERS
Universität zu Köln, Albertus-Magnus-Platz, 50923 Köln, Germany.{mozharovskyi,mosler}@statistik.uni-koeln.de ·Hochschule Merseburg, Geusaer Straße, 06217 Merseburg, [email protected]
155
Comparing High-Dimensional Classifiers: Abuse andDangers of Overall Accuracy
A. Pedro Duarte Silva
Abstract
Statistical classification has a respected tradition in thesupport of medical diagnosis.Early applications relied on classical methodologies thatassumed training samples withmore patients than disease predictors and understood that simple performance measures,that do not take into account disease prevalence and the different costs of negative andpositive predictions, have serious limitations.
More recently, new classification methodologies have been applied to large genomicdata bases where thousands of genes are measured on a few dozen patients. However,many of the studies that have evaluated these proposals employed only overall accuracymeasures. This practice is potentially misleading, as it isknown that changing priorprobabilities and/or cost assumptions can strongly affectthe relative standing of tradi-tional classification rules.
This presentation describes a study on the consequences of comparing high-dimensio-nal classification rules by different performance measures. It will be argued that mea-sures based on expected utilities or decision curves, that focus on the precision of riskestimates near the optimal threshold, should be preferred to overall accuracy. Further-more, it will be shown that when samples proportions are not close to true disease prob-abilities corrected by misclassification costs, the use of overall accuracy can indeed leadto incorrect rankings of high-dimensional classifiers.
ReferencesBAKER, S.G.; COOK, N.R., VICKERS, A. and KRAMER, B.S. (2009): Using rela-tive utility curves to evaluate risk prediction.Journal of the Royal Statistical Society.A, 172, 729–748.DUARTE SILVA, A.P.; STAM, A. and NETER, J. (2002): The Effects of Misclassi-fication Costs and Skewed Distributions in Two-Group Classification.Communica-tions in Statistics: Simulation and Computation 31, 401–423.
KeywordsCLASSIFIER EVALUATION, DECISION CURVES, HIGH DIMENSIONALCLAS-SIFICATION, MISCLASSIFICATION COSTS
Catholic University of Portugal, Faculdade de Economia e Gestão and CEGE, RuaDiogo Botelho 1327, 4169-005 Porto, [email protected]
156
Divisive Latent Class Modeling as a Density Estimation Tool:The Estimation Algorithm and an Application to IncompleteData.
Daniel W. van der Palm1, L. Andries van der Ark1, and Jeroen K. Vermunt1
Abstract
Traditionally, latent class (LC) analysis is used as a statistical method to identify sub-stantively meaningful groups from multivariate data. Morerecently, the LC model hasalso been used as a tool for density estimation. However, theperformance of the LCmodel as a density estimation tool depends on how well the model fits the data. Thus,the optimal number of latent classes must be determined.
A typical model-fit strategy is to start with a 1-class model,a 2-class model, and soon, until the best fitting model has been found according to a certain criterion. How-ever, such a model-fit strategy may require an excessive amount of computation time,especially for datasets containing a large number of variables. Furthermore, during thesearch for the best fitting LC model, numerous LC models may have to be estimated andcompared manually, which may be an obstacle to researchers and practitioners. Van derPalm, Van der Ark, and Vermunt (2013) have developed a divisive latent class (DLC)model that addresses the above two problems. A DLC model is a top-down cluster-ing of respondents into latent classes. It is obtained by estimating a series of one-classand two-class models. Because a DLC model is estimated sequentially, the computationtime is greatly reduced in comparison to a standard LC model.In addition to faster re-sults, a DLC model produces the best fitting latent class model in a single run, withoutthe need for human intervention during the estimation process. In this presentation, wediscuss the estimation algorithm of the DLC model, and an application to the problemof missing data.
ReferencesVan der Palm, D. W., Van der Ark, L. A., and Vermunt, J. K. (2013). Divisive LatentClass Modeling as a Density Estimation Tool.Submitted.
Tilburg University, Tilburg, The [email protected]
157
Determining the Number of Clusters in Categorical Data
Cláudia Silvestre1, Margarida Cardoso2, and Mário Figueiredo3
Abstract
Cluster analysis for categorical data has been an active area of research. A well-knownproblem in this area is the determination of the number of clusters, which is unknownand must be inferred from the data.
In order to estimate the number of clusters, one often resorts to information criteria,such as BIC (Bayesian information criterion), MML (minimum message length, pro-posed by Wallace and Boulton, 1968), and ICL (integrated classification likelihood). Inthis work, we adopt the approach developed by Figueiredo andJain (2002) for cluster-ing continuous data. They use an MML criterion to select the number of clusters and avariant of the EM algorithm to estimate the model parameters. This EM variant seam-lessly integrates model estimation and selection in a single algorithm. For clusteringcategorical data, we assume a finite mixture of multinomial distributions and implementa new EM algorithm, following a previous version (Silvestreet al., 2008).
Results obtained with synthetic datasets are encouraging.The main advantage of theproposed approach, when compared to the above referred criteria, is the speed of exe-cution, which is especially relevant when dealing with large data sets.
ReferencesFIGUEIREDO, M. and JAIN, A. (2002): Unsupervised Learning of Finite MixtureModels.IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 381-396.SILVESTRE, C., FIGUEIREDO, M., and CARDOSO, M. (2008): Clustering withFinite Mixture Models and Categorical Variables. In: P. Brito, Physica-Verlag:Pro-ceedings in Computational Statistics 2008. Porto, Portugal, 213.WALLACE, C. and BOULTON, D. (1968): An information measure for classifica-tion. The Computer Journal, 11:195-209.
KeywordsCLUSTER ANALYSIS, MODEL SELECTION, CATEGORICAL VARIABLES
ESCS, Portugal. [email protected] · BRU-UNIDE, ISCTE-IUL, Portugal. [email protected] · IT, IST, Portugal.mario.a.t.figueiredo@gmail
158
Identifying Mixtures of Mixtures Using Bayesian Estimation
Gertraud Malsiner-Walli1, Sylvia Frühwirth-Schnatter2, and Bettina Grün1
Abstract
In a mixture of mixtures model the cluster distributions areapproximated by a mixturedistribution. However, identifying the components forming one cluster is in general notstraight-forward. To identify the cluster distributions,previous approaches combinedmixture components to form clusters after first having selected the total number of com-ponents of a suitably fitting model. In our approach the number of clusters and theircorresponding cluster distributions are directly estimated during MCMC sampling byimposing suitable priors. In particular, we use informative hierarchical priors for themixture parameters to encourage the components assigned tothe same cluster to haveoverlapping distributions and to approximate a connected and dense cluster distribution.Using a mixture of mixtures of Gaussian distributions we apply a Bayesian estimationscheme based on MCMC methods and Gibbs sampling to automatically fit a suitablemixture model to each cluster and determine the mixture model on the cluster level. Weevaluate our proposed approach in a simulation setup with artificial data and by applyingit to benchmark data sets.
ReferencesBAUDRY, J.-P., RAFTERY, A., CELEUX, G., LO, K. and GOTTARDO,R. (2010):Combining Mixture Components for Clustering.Journal of Computational andGraphical Statistics, 19(2), 332–353.FRÜWIRTH-SCHNATTER, S. (2011): Label Switching Under Model Uncertainty.In: K.L. Mengerson, C.R. Robert and D.M. Titterington (Eds.): Mixtures: Estimationand Application. Wiley, 213–239.HENNIG, C. (2010): Methods for Mixing Gaussian Mixture Components,Advancesin Data Analysis and Classification, 4(1), 3–34.
KeywordsBAYESIAN FINITE MIXTURE MODEL, MULTIVARIATE NORMAL DISTRI BU-TION, HIERARCHICAL PRIOR, NUMBER OF COMPONENTS
Johannes Kepler University Linz, Department of Applied Statistics, Austria,[email protected], [email protected] · WUWirtschaftsuniversität Wien, Institute for Statistics and Mathematics, Wien, Austria,[email protected]
159
Logratio Methodology Applied To Model-Based Clustering
M. Comas-Cufí1, G. Mateu-Figueras1 and J.A. Martín-Fernández1
Abstract
According to Martín et al. (1998) and Palarea-Albaladejo etal. (2012), logratio method-ology is appropriate when data to be clustered are vector of proportions, i.e. composi-tional data (CoDa). Model-based clustering with CoDa are common in many fields aschemotaxonomy, archaeometry, forensic sciences or geochemistry, among others (e.g.Varmuza and Filzmoser, 2009). This work focuses on finite gaussian mixture modelsdefined in the simplex, the sample space of CoDa, i.e., when each cluster is assumed tobe represented by one or several multivariate logratio normal distributions. In addition,we show that any model-based cluster analysis applied to anytype of data, not nec-essarily CoDa, are enriched when the vector of mixing proportions and the vectors ofindividual’s conditional or posterior probabilities (group memberships) are consideredelements of the simplex.
ReferencesMARTÍN, J.A., BARCELÓ, C. and PAWLOWSKY, V. (1998): Critical Approach toNon-Parametric Classification of Compositional Data. In: A. Rizzi, M. Vichi, andH.H. Bock (Eds.):Advances in Data Science and Classification. Springer-Verlag,Berlin, 49–56.PALAREA-ALBALADEJO, J., MARTÍN-FERNÁNDEZ, J.A. and SOTO,J.A.(2012): Dealing with Distances and Transformations for Fuzzy C-Means Clusteringof Compositional Data.Journal of Classification 29(2), 144–169.VARMUZA, K., FILZMOSER, P. (2009):Introduction to Multivariate StatisticalAnalysis in Chemometrics. CRC Press, Boca Raton,FL,USA, 321 pp.
AcknowledgmentsProjects MTM2012-33236 (MSI) and 2009SGR424 (AGAUR).
KeywordsCOMPOSITIONAL DATA, ISOMETRIC LOGRATIO TRANSFORMATION
Departament of Computer Science, Applied Mathematics and Statistics, Univ. ofGirona, Campus de Montilivi, 17071 Girona, [email protected]
160
Model-based Clustering Of Multivariate Longitudinal Data
Laura Anderlucci, Angela Montanari, and Cinzia Viroli
Abstract
Multivariate longitudinal data arise when different individual characteristics are investi-gated over time. When modeling this kind of data, correlation between measurements oneach individual should be taken into account. In this work, we considered the problem ofclustering longitudinal data on multiple response variables. The issue can be addressedby means of matrix-normal distributions (Viroli, 2011). Anexplicit assumption of thisapproach is that the total variability can be decomposed into a within multiple attributes’and a ‘between times’ component. This gives body to a separability condition of the to-tal covariance matrix into two covariance matrices, one referred to the attributes and theother one to the times. According to McNicholas and Murphy (2010) we parameterizethe class conditional ‘between’ matrices through the modified Cholesky decomposition(Newton, 1988). This mixture model can be fitted using an expectation-maximization(EM) algorithm and the model selection can be performed by the BIC and the AIC infor-mation criteria. Effectiveness of the proposed approach has been tested through a largesimulation study and application to a sample of data from theHealth and RetirementStudy (HRS) survey.
ReferencesMcNICHOLAS, P. and MURPHY, B. (2010): Model-based clustering of longitudinaldata,The Canadian Journal of Statistics, 38, 153-168.NEWTON, H.J. (1988):TIMESLAB: A Time Series Analyis Laboratory, PacificGrove, CA: Wadsworth & Brooks/Cole.VIROLI, C. (2011): Finite mixtures of matrix normal distributions for classifyingthree-way data,Statistics and Computing, 21, 511-522.
KeywordsMULTIVARIATE LONGITUDINAL DATA, MIXTURE MODELS, THREE-WA Y DATA
Department of Statistical Sciences ‘P.Fortunati’ - University of [email protected],[email protected],[email protected]
161
A Bayesian Multilevel Modeling of Longitudinal data:Application to Hygroscopic Expansion in Composite Resins
Nasim Vahabi1, Mahmood Reza Gohari2, and Ali Azarbar3
Abstract
Hierarchically structured and correlated data, particularly longitudinal data, are widelyused in many areas of scientific research. Multilevel models(ML), also known as hi-erarchical linear mixed models or random coefficient modelsare utilized for analyzingclustered data which contain subject-specific random effects defined over the clusters,as well as covariate effects at any level. For the past 10 years at least, application ofBayesian multilevel models (BML), which is the focus of thispaper, have considered inmany studies, which have benefits in ensuring that all sources of uncertainty is reflectedin the posterior inferences. The application of BML is illustrated with, laboratory den-tal data from Endodontics Research Center of Shahid Beheshti University of MedicalSciences.
ReferencesBrowne, W.J. and Draper D. (2006): A comparison of Bayesian and likelihoodbasedmethods for ftting multilevel models.Bayesian Analysis, 1(3), 473-514.Diggle, P.J., Liang, K.Y. and Zeger, S.L. (2000):Analysis of Longitudinal Data. Ox-ford University Press, London.Goldstein, H. (2003):Multilevel statistical models. Thousand Oaks, CA.Verbeke, G. and Molenberghs, G. (2000):Linear Mixed Models for LongitudinalData. Springer, New York.
KeywordsBAYESIAN MULTILEVEL MODEL, LONGITUDINAL DATA, HYGROSCOPIC EX-PANSION
Tehran University of Medical Sciences, [email protected] · Tehran Uni-versity of Medical Sciences, [email protected] · Alborz University, [email protected]
162
A New Approach To Analyse Longitudinal EpidemiologicalData With An Excess Of Zeros
A.S. Spriensma123, T.R.S. Hajos34, M.R. de Boer25, M.W. Heijmans123, and J.W.R.Twisk123
Abstract
Within longitudinal epidemiological research, ‘count’ outcome variables with an ex-cess of zeros frequently occur. Although these outcomes arefrequently analysed witha linear mixed model, or a Poisson mixed model, a two-part mixed model would bebetter in analysing outcome variables with an excess of zeros. Therefore, objective ofthis research was to introduce the relatively ‘new’ method of two-part joint regressionmodelling in longitudinal data analysis for outcome variables with an excess of zeros,and to compare the performance of this method to current approaches.
Within an observational longitudinal dataset, we comparedthree techniques; two‘standard’ approaches (a linear mixed model, and a Poisson mixed model), and a two-part joint mixed model (a binomial/Poisson mixed distribution model), including ran-dom intercepts and random slopes. Model fit indicators, and differences between pre-dicted and observed values were used for comparisons. The analyses were performedwith STATA using the GLLAMM procedure.
Regarding the random intercept models, the two-part joint mixed model (bino-mial/Poisson) performed best. Adding random slopes for time to the models changedthe sign of the regression coefficient for both the Poisson mixed model and the two-partjoint mixed model (binomial/Poisson) and resulted into a much better fit.
This research showed that a two-part joint mixed model is a more appropriate methodto analyse longitudinal data with an excess of zeros compared to a linear mixed modeland a Poisson mixed model. However, in a model with random slopes for time a Poissonmixed model also performed remarkably well.
KeywordsTWO-PART JOINT MODEL, EXCESS OF ZEROS, COUNT, MIXED MODELLING,LONGITUDINAL, STATISTICAL METHODS
Department of Epidemiology and Biostatistics, VU University Medical Center, Amster-dam, The [email protected] · Department of Methodology andApplied Biostatistics, Faculty of Earth and Life Sciences,Institute for Health Sciences,VU University, Amsterdam, The Netherlands· EMGO+ Institute for Health and CareResearch, Amsterdam, The Netherlands· Department of Medical Psychology, VU Uni-versity Medical Centre, Amsterdam, The Netherlands· Department of Health Sciences,University Medical Centre Groningen, University of Groningen, The Netherlands
163
A Linear Mixed Model with a Mixture of Smooth RandomEffects Distributions
Berrie Zielman
Abstract
Longitudinal data, where data are recorded on a series of time points are often collectedin medicine, microeconomics, biology, pharmacokinetics and other fields. The linearmixed effects model is a popular model for the analysis of longitudinal data. Thesemodels incorporate fixed effects and random effects. The random effects are drawn froma distribution, which is usually the normal distribution. The assumption of a normaldistribution is not always realistic, and sometimes replaced by a mixture distribution(Verbeke and Lesaffre, 1996).
A smooth random effects distribution in a linear mixed modelis proposed that issimilar to the one in Ghidey, Lesaffre and Eilers (2004). Ourapproach differs fromtheirs in that penalized estimation is not used and that the the parameters of the gridin the random effects distribution are estimated from the data and not specified by theuser. The random effects distribution in the model is build up from a mixture of normaldistributions with equally spaced means between them. The model contains the linearmixed model as a special case when the estimated distances between the means arezero. By imposing constraints on the mixture probabilitiesand the means of the normaldistributions we obtain a mixture of smooth distributions.
ReferencesGHIDEY, W., LESAFFRE, E. and EILERS, P. (2004): Smooth Random Effects Dis-tribution in a Linear Mixed Model.Biometrics, 60, 945–953.VERBEKE, G. and LESAFFRE, E. (1996): A Linear Mixed Model with Hetero-geneity in the Random-effects Population.Journal of the American Statistical Asso-ciation, 91, 217–221.
KeywordsLONGITUDINAL DATA, SKEWED DISTRIBUTION, LINEAR MIXED MODEL,MIXTURE OF SMOOTH RANDOM EFFECTS DISTRIBUTIONS
Netherlands Court of Audit, Lange Voorhout 8, Den Haag, [email protected]
164
Longitudinal IRT Modelling compared with MultilevelAnalysis in estimating Development Over Time In Data FromThree Likert-Item Questionnaires
R. Gorter13, M.R. de Boer234, M.W. Heijmans123, and J.W.R. Twisk123
Abstract
The objective was to compare the outcomes of Multilevel (ML)modelling with Mul-tilevel Item Response Theory (ML IRT) modelling when estimating development overtime in ordinal questionnaire data when applied to a longitudinal cohort study. Datafrom the Longitudinal Aging Study Amsterdam (LASA) were obtained, an observa-tional cohort study among the elderly (n=2987). The two models were fit to the data andare compared in the performance of analysing development over time by means of pa-rameter estimates and observed-predicted plots. We found that the ML IRT model givesa more accurate prediction of the data when compared to the MLmodel in all threequestionnaires. Subsequently, we found differences in theestimated time effects. TheML IRT and the ML model give different results in terms of predicted values and timeeffect estimates, when applied to the LASA questionnaire data. The difference betweenboth models is most evident in the HADS questionnaire which is heavily skewed to theright. The differences in results between the models may lead to incorrect conclusionswith respect to the development over time when using the ML model.
KeywordsML IRT, LONGITUDINAL, ORDINAL DATA
Department of Epidemiology and Biostatistics, VU University Medical Centre, Amster-dam, The [email protected] · Institute for Health Sciences, Facultyof Earth and Life Sciences, VU University, Amsterdam, The Netherlands · EMGO+Institute for Health and Care Research, Amsterdam, The Netherlands · Department ofHealth Sciences, Univerity Medical Centre Groningen, University of Groningen, TheNetherlands.
165
Mutual Information, Chi-Squared And Model-BasedClustering For Co-Clustering Of Contingency Tables
Mohamed Nadif1 and Gérard Govaert2
Abstract
Given a data matrix defined on two sets I and J, co-clustering considers simultaneouslythe two sets and organizes the data into homogeneous blocks.Different approachesand algorithms were proposed. For co-occurrence data matrices, Dhillon et al. (2003)proposed an information-theoretic co-clustering algorithm that presents a non-negativematrix as an empirical joint probability distribution of two discrete random variables.They set co-clustering problem under an optimization problem in information theoryand developed a popular algorithm, termed ITCC. This latterconsists in maximizing amutual information associated to a couple of partitions.
In this work, we embed the co-clustering problem in model-based clustering. Twoapproaches are considered: the first one, calledblock model, assumes that the partitionsare unknown parameters and the second one, calledlatent block model, assumes thatthe partitions are considered as latent variables (Govaertand Nadif, 2008, 2010). Wedevelop the two approaches, propose models and algorithms,and establish the connec-tions with ITCC and other algorithms.
ReferencesDHILLON, I.S., MALLELA, S. and MODHA, D.S. (2003): Information-theoreticco-clustering. In:Proceedings of the ninth ACM SIGKDD, 89–98.GOVAERT, G. and NADIF, M. (2008): Block clustering with Bernoulli mixture mod-els: Comparison of different approaches.Computational Statistics & Data Analysis,52, 6, 3233–3245,GOVAERT, G. and NADIF, M. (2010): Latent Block Model for Contingency Table.Communications in Statistics–Theory and Methods39, 3, 416–425.
KeywordsCO-CLUSTERING, BLOCK MODEL, LATENT BLOCK MODEL
LIPADE, University of Paris Descartes, 75006 Paris, France,[email protected] · HEUDIASYC, CNRS 7253,University of Technology of Compiègne, 60205 Compiègne, France,[email protected]
166
Parsimonious Estimation And Testing Of Two-WayInteraction By Means Of Two-Mode Clustering
Jan Schepers
Abstract
We consider the problem of estimating and testing two-way interaction in between-subjects factorial designs involving two factors and a continuous response variable. Ex-cept for 2×2 designs, the classical (ANOVA) omnibus F-test for two-wayinteractionmay imply too many parameters being estimated. A new method is therefore proposedin which two-mode clustering is applied to capture only the most salient interactions. AnF-like test statistic calculated from the parameter estimates of this two-mode clustering,and a resampling approach to estimate its null distribution, are discussed. Simulationssuggest that this new method has an empirical Type I error rate close to the nominallevel, and an empirical Type II error rate lower than that of the classical omnibus F-testif the numbers of clusters for the two factors are large enough. After interaction hasbeen detected, the method can also be used to interpret it because the two-mode cluster-ing indicates which tetrad differences (i.e.,µi j −µi′ j −µi j ′ +µi′ j ′) differ from zero, andwhich do not.
ReferencesGOLLOB, H.F. (1968): A Statistical Model Which Combines Features of FactorAnalytic and Analysis of Variance Techniques.Psychometrika, 33, 73–115.CALINSKI, T. and CORSTEN, L.C.A (1985): Clustering Means inANOVA by Si-multaneous Testing.Biometrics, 41, 39–48.VAN ROSMALEN, J., GROENEN, P.J.F., TREJOS, J. and CASTILLO,W. (2009):Optimization Strategies for Two-Mode Partitioning.Journal of Classification, 26,155–181.
KeywordsTWO-WAY INTERACTION, ANOVA, PARSIMONIOUS ESTIMATION, TWO-MODECLUSTERING, TYPE I AND II ERROR RATE
Faculty of Psychology and Neuroscience, Maastricht [email protected]
167
A general Model for Two-mode Clustering
Maurizio Vichi1
Abstract
For big data, represented by matrices with a huge number of rows and columns, fre-quently the main analysis is a two-mode clustering (co-clustering), trying to mine andsynthesize the relevant information by reducing the size ofthe data to a matrix of com-pact dimensions formed by prototype objects and variables.This is achieved by simul-taneous grouping rows and columns so that results are informative and easy to interpret,denoting compressed, but relevant representation of the big data, while trying to pre-serve most of the original information. The reduction is generally soft to obtain a lightcompression of the multivariate data in order to allow the successive application of othermultivariate statistical methods that are computationally prohibitive for large data sets.
A general two-mode clustering technique is proposed. A coordinate descent algo-rithm is developed. The applications on both, synthetic andreal datasets, validate theperformance and applicability of the new algorithm.
KeywordsTWO-MODE CLUSTERING, DOUBLE K-MEANS, DISJOINT PRINCIPAL COM-PONENT ANALYSIS, ROBUSTNESS
Department of Statistics, Sapienza University of [email protected]
168
Comprehensive Calculations of the Sensitivity and Specificityof Diagnosis Using Bile Cytological Data
Tatsunami S.1, Hayakawa C.2, Koike J.2, Hoshikawa, M.2, and Ueno T.1
Abstract
Some of pathological data for clinical diagnosis are composed of dichotomous variablesonly. Values of 1/0 represent positivity/negativity of specific characteristics for the bi-ological item of interest. Because the number of items in therow of such data is notvery large, the variety of the possible combinations of candidate items that should beused for diagnosis is not very large. We tried to compute the sensitivity and specificityof diagnosis for all possible combination patterns.
We used bile cytological data that are used for the diagnosisof cholangiocarcinoma.The number of items in a data row was 21, from item 1 to item 21, for each patient.The sensitivity and specificity of tentative diagnostic criteria were computed in all ofthe possible patterns of positive item combinations used for the diagnosis. Results werecompared to those from multivariate analyses.
When we used positivity of item 16 and item 18 as the diagnostic criteria of cholan-giocarcinoma, the sensitivity and specificity were 0.78 and0.97. Worse results appearedfrom combinations of any other two items’ positivity and combinations of three or moreitems. The sensitivity and specificity from the logistic regression method were 0.96 and0.97. Both the logistic regression method and Hayashi’s Q2 method showed clear-cutdiscriminant ability by using more than six items.
Although the final diagnosis for a patient is obtained by alsoreferring to other data,improvement of the accuracy of diagnosis using cytologicaldata alone is expected inthe clinical field. The present results showed the potentialpossibility of cytological di-agnosis. However, at the same time, it was clearly suggestedthat diagnosis by simplecombinations of two or three items’ positivity will not provide reliable criteria of diag-nosis in the case of bile data.
KeywordsDICHOTOMOUS DATA, DIAGNOSIS, CYTOLOGY, SENSITIVITY
Unit of Medical Statistics, Faculty of Medical Education and Culture,St. Marianna University School of Medicine, Kawasaki, Japan [email protected] · Department of Pathology, Kawasaki Munic-ipal Tama Hospital, St. Marianna University School of Medicine, Kawasaki, Japan214-8525
169
Diagnostics for the Risk Prediction of Each Type of EndoleakFormation after TEVAR Using Statistical DiscriminantAnalysis
Kuniyoshi Hayashi1,5, Fumio Ishioka2,5, Bhargav Raman3, Daniel Y. Sze3, HiroshiSuito1,5, Takuya Ueda4,5, and Koji Kurihara1,5
Abstract
A quantitative assessment of results obtained by statistical analysis has the potentialto generate findings that will enable better therapy planning by doctors. However, wefeel that the usefulness and impact of its application in real data analyses in the field ofmedicine has not been widely and clearly shown. In this study, we particularly selectthoracic endovascular aortic repair (TEVAR), a minimally invasive technique involvingstent-graft placement. Based on Nakatamari et al. (2011), we use linear discriminantanalysis to evaluate the risk of formation of each type of endoleak, which is a clinicalside effect of TEVAR. Next, we utilize sensitivity analysiswith influence functionsto identify influential patients for risk prediction. Finally, we investigate the findingsobtained on the basis of an analysis of their characteristics.
ReferencesNAKATAMARI, H., UEDA, T., ISHIOKA, F., RAMAN, B., KURIHARA, K., RU-BIN, G.D., ITO, H., SZE, D.Y. (2011): Discriminant analysisof native thoracic aorticcurvature: risk prediction for endoleak formation after thoracic endovascular aorticrepair.Journal of Vascular and Interventional Radiology, 22, 974–979.
KeywordsINFLUENCE FUNCTIONS, LINEAR DISCRIMINANT ANALYSIS, QUANTITA-TIVE ANALYSIS OF AORTIC MORPHOLOGY
Graduate School of Environmental and Life Science, [email protected], [email protected],[email protected] · School of Law, Okayama [email protected] · Department of Radiology, Stanford Univer-sity School of [email protected], [email protected] ·Department of Radiology, St. Luke’s International Hospital [email protected] ·CREST, Japan Science and Technology Agency
170
Extension Of A Multilingual Medical Lexicon By CombinedFeature Extraction Methods
Wiebke Petersen1, Denis Anuschewski1, Pascal Chave1, and Philipp F. Zeitz2
Abstract
The 2011 digital, multilingual dictionary of ophthalmology by the practicing ophthal-mologist Zeitz with its more than 24.000 medical terms in 13 languages, arranged bysynonymy, was developed to support the practicing physician
http://zeitzfrankozeitz.de/index.php/Dictionary_of_Ophthalmology.html
who, in a time of increased mobility of people, often has to translate international medi-cal reports. These translations can involve severe translation errors (cf. Zeitz & Petersen2013). Alas, not all languages are covered to the same extent, c.f. German: 6584 terms,Russian: 252 terms. In our talk we introduce an approach to semi-automatically enrichthe dictionary’s structure and content. We tag medical terms with attributes correspond-ing to shared features and calculate the corresponding concept lattice (cf. Ganter & Wille1999). Following Janssen’s ideas we use concept lattices asan Interlingua in our dic-tionary in order to facilitate browsing through its contentand to fill the lexical gaps byparaphrases based on the attribute tags (cf. Janssen 2004).We aim at tagging the termssemi-automatically by extracting attributes from: (a) thelinguistic contexts in whichthey occur, (b) the morphemes of which they are composed, and(c) their position ingiven hierarchical categorizations. Terms that are similar with respect to these aspectsshare a common attribute, and the automatically extracted attributes can be translatedand controlled by human experts in a second step. To this end,several sources for au-tomatic extraction of attributes were explored: classifications such as the InternationalClassification of Diseases ICD-10 and English language Wikipedia articles on ophthal-mology.
ReferencesGANTER, B. and WILLE, R. (1999). Formal concept analysis: mathematical foun-dations. Berlin: Springer.JANSSEN, M. (2004): Multilingual Lexical Databases, Lexical Gaps, and SIMuL-LLDA. International Journal of Lexicography, 17, 136-154ZEITZ, P. F. and W. PETERSEN: Übersetzungsfehler in der Augenheilkunde.Klin-ische Monatsblätter für Augenheilkunde, 230(3), 275-277.
KeywordsMULTILINGUAL LEXICON, FORMAL CONCEPT ANALYSIS, FEATURE EX-TRACTION
Institute of Linguistic and Information, University of Düsseldorf· Praxis Zeitz FrankoZeitz, Praxis für Augenheilkunde, Düsseldorf
171
The Joy of Fuzzy
Michael Greenacre1
Abstract
Canonical correspondence analysis and redundancy analysis are two methods of con-strained ordination regularly used in the analysis of ecological data when several re-sponse variables (for example, species abundances) are related linearly to several ex-planatory variables (for example, environmental variables, spatial positions of samples).In this talk I demonstrate the advantages of the fuzzy codingof explanatory variables:first, nonlinear relationships can be diagnosed; second, more variance in the responsescan be explained; and third, in the presence of categorical explanatory variables (forexample, years, regions) the interpretation of the resulting triplot ordination is unifiedbecause all explanatory variables are measured at a categorical level.
Background material and references for the topic of this talk can be found in Asanand Greenacre (2010) and Greenacre (2013).
ReferencesASAN, Z. and GREENACRE, M. (2013): Biplots of fuzzy coded data. Fuzzy Setsand Systems 183, 57–71.GREENACRE, M. (2013): Fuzzy coding in constrained ordinations.Ecology 94(2),280–286.
KeywordsFUZZY CODING, CANONICAL CORRESPONDENCE ANALYSIS, MIXED-SCALEPREDICTORS IN CONSTRAINED ORDINATIONS
Universitat Pompeu [email protected]
172
Fast Iterative Implementation of Correspondence Analysis
Alfonso Iodice D’Enza1, Patrick J. Groenen2 and Michel van de Velden2
Abstract
The eigenvalue decomposition (EVD) and the related singular value decomposition(SVD) are the core step of several dimension reduction methods, such as principal com-ponents analysis (PCA; Jolliffe, 2002) and multiple correspondence analysis (MCA;Greenacre, 2007), that apply to quantitative and qualitative data, respectively. Most ofthe modern applications have in common the large amount of data to be analyzed. Theapplication of standard implementations of both EVD and SVDbecomes unfeasible dueto their high computational cost. To this end, several algorithms have been proposed inthe literature that aim to increase the computational speedand efficiency of EVD andSVD. The majority of the proposed procedures focus on the quantitative variables case.Dealing with binary variables, however, a further peculiarity arises and has to be takeninto account, that is the data sparsity. In case of data flows analysis or of interactive datavisualization, repeated analyses are needed to keep the solution up-to-date when newdata comes in, or in case of user interactions, respectively. A further case is the assess-ment of significance of MCA solution requires repeated analyses of bootstrap replicatesof data. In a common setting, these methods are unfeasible for large data sets.In the present paper an efficient implementation of the MCA isproposed that addressesthe sparsity of data and the need of computational speed in the case of repeated analy-ses, exploiting both enhanced sparse matrix computations and fast iterative methods formatrix decompositions.
ReferencesGREENACRE, M.J. (2007):Correspondence Analysis in Practice, 2nd edition.Chapman & Hall/CRC.JOLLIFFE, I.T. (2002):Principal Component Analysis, 2nd edition. Springer.
KeywordsCORRESPONDENCE ANALYSIS, EIGENVALUE DECOMPOSITION, POWERMETHOD
Università di Cassino e del Lazio Meridionale, Cassino, [email protected] · Erasmus University of Rotterdam, Rotterdam, [email protected], [email protected]
173
Inverse Multiple Correspondence Analysis
Michel van de Velden1, Patrick Groenen2, and Wilco van den Heuvel3
Abstract
The inverse correspondence analysis (CA) problem can be described as follows. Givena k-dimensional CA solution, find the set of (nonnegative) data matrices that yields thisk-dimensional CA solution. Groenen and van de Velden (2004)showed that the set ofpermissable data matrices is characterized by a set of vertices corresponding to a setof inequalities. Furthermore, it was shown that any convex combination of the obtainedvertices produces a candidate data matrix that, when applying CA to it, contains in theirsolution the original k-dimensional CA solution. An algorithm was proposed to obtainall vertices as well as a heuristic to quickly obtain a (sub)set of vertices. A popularextension of CA concerns multiple correspondence analysis(MCA). In MCA, the datamatrix is a so-called indicator or super-indicator matrix;a concatenation of dummy vari-ables where for each individual the observed category is indicated by a one in a columncorresponding to that category, and zeros in the other columns. The rows of such a ma-trix correspond to individuals, and the columns to categories. Such an indicator matrixis typically much larger than a CA contingency matrix. Consequently, the exact inverseCA approach of Groenen and van de Velden (2004) cannot be usedto find the set of ver-tices rendering the MCA solution. The heuristic can perhapsbe applied, however, it isnot clear whether the thus obtained set of vertices is complete. Using specific propertiesof MCA, we explore new ways for obtaining a meaningful set of vertices in the contextof MCA.
ReferencesGROENEN, P.J.F. and VAN DE VELDEN, M. (2004): Inverse CorrespondenceAnalysisLinear Algebra and its Applications, 388, 221–238.
KeywordsCORRESPONDENCE ANALYSIS, MULTIPLE CORRESPONDENCE ANALYSIS,INVERSE PROBLEMS
Erasmus University [email protected] · Erasmus Uni-versity Rotterdam [email protected] · Erasmus University [email protected]
174
Tracking Association Structures in Categorical Data Flows
Alfonso Iodice D’Enza1 and Angelos Markos2
Abstract
In modern applications, such as in signal processing and social network analysis, dataare produced at a high rate and the association structures change over time. MultipleCorrespondence Analysis (MCA) is a well-established dimension reduction methodaiming to explore the underlying structure of categorical data sets (Greenacre, 2007).A critical step of the MCA algorithm is the singular value decomposition (SVD) oreigenvalue decomposition (EVD) of a suitably transformed matrix. The high compu-tational and memory requirements of ordinary SVD and EVD makes their applicationimpractical on massive or sequential data sets. Several enhanced SVD/EVD approacheshave been recently introduced in an effort to overcome theseissues. The aim of thepresent contribution is to extend MCA to allow for incremental updates (downdates)of existing MCA solutions, which lead to an approximate yet highly accurate solution.For this purpose, two incremental EVD and SVD (Hallet al., 2002; Rosset al., 2008)approaches with desirable properties are revised and embedded in the context of MCA.The proposed method is evaluated in terms of discrepancy from a classic MCA solutionand applied to a real dataset.
ReferencesGREENACRE, M.J. (2007):Correspondence Analysis in Practice, 2nd edition,Chapman & Hall/CRC.HALL, P., MARSHALL, D. and MARTIN, R. (2002): Adding and subtractingeigenspaces with eigenvalue decomposition and singular value decomposition.Im-age and Vision Computing, 20, 1009–1016.ROSS, D., LIM, J., LIN, R.S. and YANG, M.H. (2008): Incremental Learning forRobust Visual Tracking,International Journal of Computer Vision, 77, 125–141.
KeywordsMULTIPLE CORRESPONDENCE ANALYSIS, SINGULAR VALUE DECOMPOSI-TION, INCREMENTAL METHODS
Università di Cassino e del Lazio Meridionale, [email protected] · Dem-ocritus University of Thrace, [email protected]
175
Determining the Number of Clusters: a Problem of Definitionor Estimation?
Giovanna Menardi1
Abstract
The problem of determining the optimal number of clusters ina set of data has beenaddressed according to several perspectives, ranging fromthe naive “elbow”-rule ofthumb to its formalized version GAP statistic, or to more refined criteria as those basedon evaluating the stability of a partition. Nonetheless, the question is far from havingan undisguised answer. Wondering what configuration is optimal becomes a wild-goosechase if a true (albeit unknown) population structure, representing the ideal partitionthat clustering methods should try to approximate, is not specified. Therefore, beforeaddressing the problem of determining the right number of clusters, we need to defineproperly what a cluster is.
A precise statistical notion, unshared by most of clustering methods, is provided bythe density-based approach, assuming that clusters are associated to some specific char-acteristic of the probability distribution underlying thedata. Parametric methods usuallyassociate clusters to homogeneous distributions which arecombined in a mixture model,while methods following a nonparametric approach draw a correspondence between thegroups and the modes of the density underlying the data. An appealing implication ofthe density-based formulation is that the number of clusters is conceptually well de-fined. Moreover, it follows that the ill-specified tasks of cluster detection and evaluationof cluster quality can be regarded to as more circumscribed problems of estimation andgoodness of fit.
However, the density based framework is far from being an easy answer to the cluster-ing problem. In this work the approach is critically reviewed from both a conceptual andan operational point of view and focusing, in particular, onthe nonparametric perspec-tive. Some connections with alternative formulations of the problem are enlightened asweel as the main challenges and directions of further research.
KeywordsCLUSTER, DENSITY ESTIMATION, MIXTURE MODELS, MODE SEEKING
Department of Statistical Sciences, University of Padua, via C. Battisti 241, [email protected]
176
Enhancing The Selection Of A Number Of Clusters InModel-Based Clustering With External Qualitative Variabl es
AJ.-P. Baudry, M. Cardoso, G. Celeux, M.J. Amorim, and A.S. Ferreira
Abstract
Usual criteria to select a number of clusters in model-basedclustering, such as BIC forexample, can sometimes lead to an unclear, uncertain, choice. We propose a criterionthat takes into account a classification of the data which is knowna priori and may shednew light on the data, help to drive the selection of a number of clusters and make itclearer, without involving it in the design of the clustering itself. The variables used tobuild the clustering and the (qualitative) variables used as ana priori clustering haveto be chosen carefully and with respect to the modeling purpose. This is an illustrationof how the modeling purpose is directly involved in what a “goo” number of clustersshould be and to what extend the last should be thought of as dependent on the context.
177
Choosing the Number of Clusters after, before, and whileClustering
B. Mirkin
Abstract
Methods for determining the number of clusters in data can becategorized in three fol-lowing types: (1) post-processing; (2) pre-processing; and (3) (2) at-processing methods.
The “post-clustering” type methods are most popular: a number of partitions are gen-erated, after which a procedure is run to determine those most suitable; the latter wouldusually be based on either: (i) a “tightness” criterion, or (ii) a “stability” criterion, or,more recently, (iii) consensus approach. In a series of experiments with synthetic data ofGaussian clusters with varying spread, intermix and “elongation”, Chiang and Mirkin(2010) came up with a winner among seven or eight methods - the“rule of thumb” byHartigan (1975) involving a relative change in the value of the square error clusteringcriterion. Unfortunately, the rule has been much less successful in cluster recovery.
The “pre-processing” type methods explore the structure ofdata by putting hypo-thetical centers of clusters relatively far from each otherin the data set, after which aclustering procedure applies. Of a number heuristics, mostpromising, in the author’sview, is the “anomalous pattern” method by Mirkin (1987, 2005) which positions thehypothetical center points relative to a “reference” pointin such a way that they arecentral to some dense parts of the data. This method involvesa granularity parameter, athreshold on the cluster cardinality to decide should the center be discarded at all. Whenthe threshold is 1 (discarding singletons only), this leadsto overestimating the numberof clusters at synthetic data (Chiang, Mirkin, 2010), yet atrealworld data of moderatesizes it works well.
For the “at-processing” approach, the divisive clusteringis probably the only optionat which further clustering can be stopped at any step (of sequential divisions). Kovalevaand Mirkin (2013) show that the rule by Tasoulis, Tasoulis and Plagianakos (2010) issuperior to many other popular options. The rule involves projection of all the data in acluster onto the first principal component, building a Parzen-type density function andtesting whether it has any minima at all. If not, the cluster is not split anymore. The ruleis by far superior over the popular statistics test of a single Gaussian against a mix ofGaussians. It survives introduction of noise in data, except insertion of random objects.In the latter case, the rule should be applied to random projections of clusters as well(Kovaleva, Mirkin 2013).
KeywordsCLUSTERING, NUMBER OF CLUSTERS, HARTIGAN’S RULE
NRU Higher School of Economics, Moscow, RF and Birkbeck University of London,UK [email protected]
178
Competitions in Machine Learning: the Fun, the Art, and theScience
Isabelle Guyon1
Abstract
Challenges have recently proved a great stimulus for research in machine learning, pat-tern recognition, and robotics. Robotics contests seem to be particularly popular, themost visible ones probably being the DARPA grand challengesof autonomous groundvehicle navigation and RoboCup featuring several challenges for robots including play-ing soccer or rescuing people. The European network of excellence PASCAL has ac-tively sponsored a number of challenges around hot themes inmachine learning, held inconjunction with workshops at major international conference, including KDD, ICML,and NIPS. These contests are oriented towards scientific research and the main rewardfor the winners is to disseminate the product of their research and obtain recognition.In that respect, they play a different role than challenges like the Netflix prize, whichoffer large monetary rewards for solving a task of value to the Industry (movie referralin than particular case), but are narrower scope. Attracting hundreds of participants andthe attention of a broad audience of specialists as well as sometimes the general public,these events have been important in several respects: (1) pushing the state-of-the art,(2) identifying techniques which really work, (3) attracting new researchers, (4) raisingthe standards of research, (5) giving the opportunity to non-established researchers tomake themselves rapidly known. Since 2003, we have been organizing challenges inmachine learning. We addressed problems of both fundamental and practical interest inmachine learning, data mining or statistics, illustrated with data from various domains.For instance,
• in 2003 we organized a challenge on feature selection and in 2009 on sample selec-tion (active learning),
• in 2006, 2007 and 2010, we organized a series of challenges onmodel constructionand selection, including agnostic methods and methods using
• prior knowledge or knowledge transfer, between 2008 and 2013, we organized threechallenges on causality.
Our challenge platforms, which remain open for post-challenge submissions, are con-stantly in use by students and have been used in practical work in our own classes andthose of other professors throughout the world. We take great care of giving to theparticipants opportunities publish in reputable conferences proceedings or journals likeJMLR. We think of challenges as a means of carrying out research in machine learn-ing by focusing the mental energy of brilliant researchers around the world. But, whatmakes a good challenge that provides conclusive results having an important impact?This presentation will review the main findings of our past challenges and look uponthem with a critical eye to identify strength and weaknessesand new directions.
ChaLearn, Berkeley, California
179
Playing with Data–or How to Discourage Incorrect DataAnalysis
Klaas Sijtsma1
Abstract
Recent fraud cases in psychological and medical research have emphasized the needto pay attention to Questionable Research Practices (QRPs). Deliberate or not, QRPsusually have a deteriorating effect on the quality and the credibility of research results.QRPs must be revealed but prevention of QRPs is more important than detection. Isuggest two policy measures that I expect to be effective in improving the quality ofpsychological research. First, the research data and the research materials should bemade publicly available so as to allow verification. Second,researchers should morereadily consider consulting a methodologist or a statistician. These two measures aresimple but run against common practice to keep data to oneself and overestimate one’smethodological and statistical skills, thus allowing secrecy and errors to enter researchpractice.
Tilburg School of Social and Behavioral Sciences, Tilburg University
180
A Study on Small-Area Geographical Analysis of ResidentialCharacteristics after the Great Hanshin-Awaji Earthquakeby two Individual Differences Model
Mitsuhiro Tsuji, Hiroshi Kageyama1 and Toshio Shimokawa2
Abstract
We discuss several approaches to realize geographical small-area statistics by usingmultidimensional scaling (the INDSCAL model) and clustering (the INDCLUS model),which assumes that the objects (geographical areas) are embedded in a continuous ordiscrete space common to all data, including individual differences obtained by weight-ing each dimension.
We apply some effective geographical approaches using two methods to performsome structural analysis for some residential characteristics (damage, population changesand so on) after the Great Hanshin-Awaji Earthquake.
The saling and clustering of geographical space consider: 1) the characteristics of thefeature space (continuous); 2) the spatial nature of the objects to be clustered geomet-rically (discrete); 3) the latent structure between earthquake damages and residentialcharacteristics.
KeywordsSMALL-AREA STATISTICS, GREAT HANSHIN-AWAJI EARTHQUAKE, INDSCALMODEL, INDCLUS MODEL
Kansai University, Takatsuki, Osaka, [email protected] · Universityof Yamanashi, Kofu, Yamanashi, [email protected]
181
Author Identification of Japanese Classical Literature byQuantitative Analysis
Gen Tsuchiyama1 and Masakatsu Murakami2
Abstract
Singular authorship of The Tale of Genji, the most famous andgreatest accomplish-ment in Japanese classical literature of the Heian period (between 794 and 1185), iscontentious. While literary scholars have long debated theauthorship of this work, theissue has been largely ignored by Japanese statisticians. Therefore, in this study, westatistically analyze whether the author of the last ten chapters of The Tale of Ganji,collectively titled Uji Jugo, also wrote the previous chapters.
In quantitative analyses of texts composed in the Japanese language, when the fre-quency of the function word of certain documents substantially differs from that ofother documents, the difference is generally attributed tovarying author style. Thus, ouranalysis is based on word frequency.
Word frequency throughout The Tale of Genji was analyzed by principal componentanalysis and random forests. No obvious difference in word usage was observed be-tween Uji Jugo and the other chapters. Thus, we conclude thatThe Tale of Genji waslikely composed by a single author.
ReferencesBREIMAN, L. (2001): Random Forests.Machine Learning , 45, 5–32.JIN, M. and MURAKAMI, M. (2007): Authorship Identification Using RandomForests.Proceedings of the Institute of Statistical Mathematics , 55, 255–268.
KeywordsQUANTITATIVE THEORY OF VOCABULARY, PRINCIPAL COMPONENT ANAL-YSIS, RANDOM FORESTS, JAPANESE LITERATURE
Graduate School of Culture and Information Science, Doshisha University, Kyoto, [email protected] · Faculty of Culture and Information Science,Doshisha University, Kyoto, [email protected]
182
A Latent Class Approach for Estimating Labour MarketMobility in the Presence of Multiple Indicators andRetrospective Interrogation
Francesca Bassi1, Marcel Croon2, and Davide Vidotto1
Abstract
Measurement errors can induce bias in the estimation of transitions, leading to erro-neous conclusions about labour market dynamics. A large body of literature on grossflows estimation is based on the assumption that errors are uncorrelated over time. Thisassumption is not realistic in many contexts, because of survey design and data collec-tion strategies. We use a model-based approach to adjustingobserved gross flows forclassification errors, eventually correlated. A convenient framework is provided by la-tent class Markov models (Biemer and Bushery, 2000). We refer to data collected withthe Italian Continuous Labour Force Survey, which is cross-sectional, quarterly, witha 2-2-2 rotating design. The questionnaire allows to dispose of multiple indicators oflabour force condition for each quarter: two collected in the same interview and a thirdone collected after one year. Our approach provides a means to estimate labour mar-ket mobility taking into account correlated errors and the rotating design of the survey.Specifically the best fitting model is a mover-stayer latent class Markov model with co-variates affecting latent transitions and correlated errors among indicators. A secondaryresult of our research is that the mover-stayer model and thelatent class Markov esti-mate the same amount of measurement error in the data. The better fit of the mixturespecification is all due to more-accurately estimated latent transitions. This evidencecontradicts results in previous literature (see, for example, Magidson et al., 2007).
ReferencesBIEMER, P.P. and BUSHERY, J.M. (2000): On the validity of Markov latent classanalysis for estimating classification errors in labour force data.Survey Methodology,26, 139-152.MAGIDSON, J., VERMUNT, J.K. and TRAN B. (2007): Using a mixture of latentMarkov model to analyze longitudinal U.S. employment data involving measurementerror. In: K. Shigemasu, A. Okada, T. Imaizumi and T. Hoshino(Eds.):New trendsin Psychometrics. Universal Academy Press, 235-242.
KeywordsMIXTURE LATENT CLASS MODEL, GROSS FLOWS, CORRELATED ERRORS
Department of Statistical Sciences, University of [email protected] ·Methodology Department, University of Tilburg, NL
183
On Finite Mixtures of Skew Distributions
Geoff McLachlan and Sharon Lee
Abstract
Non-normal mixture distributions have received increasing attention in recent years. Fi-nite mixtures of multivariate skew symmetric distributions, in particular, the skew nor-mal and skewt-mixture models, are emerging as a promising extension to the traditionalnormal andt-mixture modelling. Most of these parametric families of skew symmetricdistributions are closely related. In this talk, we give a brief overview of various existingproposals for multivariate skew distributions. We consider a classification of them intofour forms, namely, the restricted, unrestricted, extended, and generalised forms, basedon their characterizations. We compare the relative performance of restricted and unre-stricted skew mixture models in clustering and density estimation on four real datasets.We also compare their performance with mixtures having other non-normal componentdistributions.
Geoff McLachlan· Sharon LeeUniversity of Queensland
184
Classification via Mixtures of Shifted Asymmetric Laplaceand Mixtures of Generalized Hyperbolic Distributions
Paul D. McNicholas1, Ryan P. Browne1, and Brian C. Franczak1
Abstract
The recent burgeoning of non-Gaussian approaches to model-based classification in-cludes work on the multivariatet-distribution, the skew-normal distribution, and theskew-t distribution, as well as other approaches. We add to the richness of the pal-let of non-Gaussian mixture model-based approaches to classification by introducing amixture of shifted asymmetric Laplace distributions and a mixture of generalized hyper-bolic distributions. The mathematical development of eachmixture model relies on itsrelationship with the generalized inverse Gaussian distribution. Parameter estimation isoutlined within the expectation-maximization framework before the performance of ourmixture models is illustrated on simulated and real data. Weconclude with discussionon the anticipated impact of these models and details of someongoing work.
ReferencesBARNDORFF-NIELSEN, O. (1978): Hyperbolic distributions and distributions onhyperbolae.Scandinavian Journal of Statistics, 5, 151–157.FRANCZAK, B.C., BROWNE, R.P. and McNICHOLAS, P.D. (2012): Mixtures ofshifted asymmetric Laplace distributions. Arxiv preprintarXiv:1207.1727v3.KOTZ, S., KOZUBOWSKI, T.J. and PODGORSKI, K. (2001):The Laplace Distri-bution and Generalizations: A Revisit with Applications toCommunications, Eco-nomics, Engineering, and Finance. Birkhauser, Boston.JØRGENSEN, B. (1982):Statistical Properties of the Generalized Inverse GaussianDistribution. Springer-Verlag, New York.
KeywordsASYMMETRIC LAPLACE, GENERALIZED HYPERBOLIC, GENERALIZEDIN-VERSE GAUSSIAN, MIXTURE MODELS
Department of Mathematics and Statistics, University of Guelph, Ontario, N1G 2W1,Canada.{pmcnicho,rbrowne,bfrancza}@uoguelph.ca
185
Gaussian And Distance Based Clustering InHigh-Dimensional Space: Differences And Common Aspects
Francesco Palumbo1, Cristina Tortora2, and Paul McNicholas2
Abstract
Non-hierarchical cluster analysis aims at identifying theoptimalk groups partition ina multivariate data sets. Most recent contributions in the field are focused on the prob-abilistic (Celeux and Goveart, 1995) and distance based (Ben-Israel and Iyigun; 2008)mixture model approach which ensures good performances under a wide range of hy-pothesis. The former assumes clusters are derived under thesame probability function(Gaussian, generally) with different parameters for each group, the latter is a distribu-tion free and units are assigned to the groups according to a distance function. This talkaims at presenting and discussing two extensions of the above mentioned approacheswhen the high dimensionality of the space requires the feature reduction. In particular,we focus on advantages and drawbacks of the following clustering approaches that in-tegrate clustering and dimensionality reduction: Mixtureof Factor Analyzers, Mixtureof Parsimonious Gaussian Mixture models (McNicholas and Murphy; 2008), Mixtureof High-Dimensional mixture models, Discriminative latent mixture models (Bouvey-ron and Brunet-Saumard; 2012) and Factor PD-clustering (Tortora et al. 2011). Overallperformances are compared using simulated and real data.
ReferencesBen-Israel, A. and Iyigun, C. (2008): Probabilistic d-clustering.Journal of Classifi-cation, 25(1):5–26.Bouveyron, C. and Brunet-Saumard, C. (2012): Model-based clustering of high-dimensional data: A review.Computational Statistics and Data Analysis.Celeux, G. and Goveart, G. (1995): Gaussian parsimonious clustering models.Pat-tern Recognition, 28(5):781–793.McNicholas, P.D. and Murphy, D. (2008): Parsimonious Gaussian Mixture models.Statistics and computing, 18(3):285–296.Tortora, C., Gettler Summa, M., and Palumbo, F. (2011). Factor PD-clustering.Pro-ceedings of the Joint Conference of the German Classification Society.
KeywordsMODEL BASED CLUSTERING, DISTANCE BASED CLUSTERING, SIMULA-TION STUDY
Università di Napoli Federico II, [email protected] · University of Guelph,[email protected], [email protected]
186
Clustering and Dimension Reduction using Non-GaussianMixtures
Katherine Morris and Paul McNicholas
Abstract
We introduce a dimension reduction method for model-based clustering using non-Gaussian distributions, specificallyt, shifted asymmetric Laplace, and generalized hy-perbolic distributions. The approach is analogous to existing work within the Gaussianparadigm. By employing sliced inverse regression, the method relies on identifying areduced subspace of the data by considering the extent to which group means and groupcovariances vary. This subspace contains linear combinations of the original data, whichare ordered by importance via the associated eigenvalues. Observations can be projectedonto the subspace and the resulting set of variables captures most of the clustering struc-ture available in the data. Our clustering approaches are illustrated on simulated and realdata, and compared to each other as well as their Gaussian counterpart.
ReferencesANDREWS, J. L. and MCNICHOLAS, P. D. (2012): Model-based Clustering, Clas-sification, and Discriminant Analysis via Mixtures of Multivariatet-distributions:ThetEIGEN Family.Statistics and Computing, 22(5), 1021–1029.FRANCZAK, B., BROWNE, R. P. AND MCNICHOLAS, P. D. (2012): Mixtures ofShifted Asymmetric Laplace Distributions.IEEE Transactions on Pattern Analysisand Machine Intelligence, 5, 263–286.LI, K. C. (1991): Sliced Inverse Regression for Dimension Reduction (with discus-sion).Journal of the American Statistical Association 86, 316–342.SCRUCCA, L. (2010): Dimension Reduction for Model-based Clustering.Statisticsand Computing, 20(4), 471–484.
KeywordsDIMENSION REDUCTION, MIXTURE MODELS, MODEL-BASED CLUSTER-ING
Department of Mathematics & Statistics, University of Guelph, Ontario, Canada{kmorri09, pmcnicho}@uoguelph.ca
187
Comparison of Spatial Clusters between Suicide Data and ItsIncrease-decrease Rates in Japan
Makoto Tomita1, Takafumi Kubota2, Fumio Ishioka3 and Toshiharu Fujita2
Abstract
Our data are the numbers of suicides with 6 periods of every 5 years (only the 1st pe-riod has 10 years) between 1973 and 2007 in Japan, and with 348secondary medicalcare zones they were brought together from municipality units. This data was formed aspart of medical planning. There are several approaches to detect hotspots from differentkinds of spatial data. A spatial scan statistical method forfinding hotspot areas basedon a likelihood ratio has been a very common and useful method. However, this methodtends to detect hotspots much larger than the true hotspot. Therefore it does not alwaysdetect hotspots with high relative risk. Echelon analysis is a useful technique for system-atically and objectively investigating the phase-structure of spatial lattice data. In thispaper, we have studied space-time clusters as well as space clusters of each increase-decrease rate from a period to the next period to evaluate these data using Echelonanalysis.
Acknowledgement
This was a part of funded research from National Institute ofMental Health, Na-tional Center of Neurology and Psychiatry and was partiallysupported by KAKENHI24500337, KAKENHI 21700317 and KAKENHI 21700305.
ReferencesIshioka F. and Kurihara K. (2012) Hotspot Detection Using Scan Method Based onEchelon Analysis.Proceedings of the Institute of Statistical Mathematics60(1):93–108.
KeywordsSPATIAL DATA, SUICIDE DATA, SPACE CLUSTERS, SPACE-TIME CLUSTERS
Tokyo Medical and Dental University, Tokyo, [email protected] ·The Institute of Statistical [email protected] · School of law,Okayama [email protected]
188
Detection of Spatial Clusters for High and Low Suicidal RiskAreas in Japan
Takafumi Kubota1, Makoto Tomita2, Fumio Ishioka3, Tomokazu Fujino4 and HiroeTsubaki5
Abstract
This study detected spatial clusters with both high and low suicidal risks. Small arealdata in Kanto district of Japan from "Statistics of Community for the Death from Sui-cide" were used to calculate SMR of suicide and non-suicide.Then, they were appliedto find out statistically high candidate areas of both SMRs toscan their areas by spatialscan statistics. Finally, the detected areas of both high and low suicide areas were com-pared with the previous study of Kubota et al. (2011) to discuss the risks of suicide intheir areas and to present the interpretations of them.
Acknowledgement
This is a part of funded research from National Institute of Mental Health, NationalCenter of Neurology and Psychiatry and is also partially supported by KAKENHI21700305, KAKENHI 24500337 and KAKENHI 23500358.
ReferencesFujita, T. (2009):Statistics of Community for the Death from Suicide. National Insti-tute of Mental Health, National Center of Neurology and Psychiatry, Japan.Kubota, T., Tomita, M, Ishioka, F. and Fujita, T. (2011): Spatial AutocorrelationStatistics and Spatial Clustering in the Areas in Japan withLow Suicide Rates.JointMeeting of 7th Conference of the Asian Regional Section of the IASC and 2011 TaipeiInternational Statistical Symposium, 99-100.
KeywordsSPATIAL CLUSTERING, SUICIDE DATA, SMALL AREA DATA
The Institute of Statistical [email protected] · Tokyo Medical andDental University· Okayama University· Fukuoka Women’s University· The Instituteof Statistical Mathematics
189
Patterns of Cultural Practices and Characteristics of theCultural Omnivore
Miki Nakai
Abstract
In this paper, we attempt to examine how styles of cultural consumption is classifiedand characterized. In sociological argument focusing on cultural stratification, it hasbeen theorized that participation in highbrow culture is a feature of the elite in the so-ciety (Bourdieu 1979). Other researchers, on the other hand, have been argued that theomnivorous taste pattern shows up in numerous countries (Peterson and Simkus 1992)and this hypothesis has been becoming prevailing. However,characteristics of omnivo-rousness and how omnivore- and univore- types of cultural clusters are associated withsocioeconomic status have received less consideration in Japan (Nakai 2011). Using thedata from a national sample in Japan in 2005 (N=2915), patterns and determinants ofcultural consumption are examined. Our findings of latent class analysis seem to revealthat there are small number of notable groups in terms of cultural practices (Vermunt1997). These include the omnivore class as well as the inactive class. The resultant cul-tural clusters seem consistent with the omnivore-univore hypothesis.
ReferencesBOURDIEU, P. (1979).La Distinction: Critique Sociale du Jugement.Paris: Minuit.NAKAI, M. (2011): Social Stratification and Consumption Patterns: Cultural Prac-tices and lifestyles in Japan. In: S. Ingrassia, R. Rocci, and M. Vichi (Eds.):NewPerspectives in Statistical Modeling and Data Analysis. Springer, Berlin, 211–218.PETERSON, R. A. and SIMKUS, A. (1992): How Musical Taste Groups Mark Oc-cupational Status Groups. In: M. Lamont and M. Fournier (Eds.): Cultivating Differ-ences. Chicago, IL: Univ. of Chicago Press.VERMUNT, J.K. (1997):LEM: A General Program for the Analysis of CategoricalData. Department of Methodology and Statistics, Tilburg University.
KeywordsCULTURAL PRACTICES, OMNIVORE, SOCIAL STRATIFICATION
Department of Social Sciences, College of Social Sciences,Ritsumeikan University,56-1 Toji-in Kitamachi, Kyoto 603-8577 [email protected]
190
The Structure Of Subjective Social Status In Japan: AnApproach Based On Latent Class Model
Yusuke Kanazawa1
Abstract
Previous studies have studied subjective social status of people based on two differentapproaches. The first one is the social psychological approach which explores the rela-tionship between subjective social status and other kinds of social consciousnesses (e.g.Nakao 2002). The second one is the social class approach which explains subjectivesocial status based on people’s objective social status (Hodge and Treiman 1968; Hout2008). This study integrates these two approaches by using latent class models.
First, I analyzed the relationship between subjective social status and other kinds ofsocial consciousnesses such as life satisfaction, satisfaction with own socio-economicstatus (SES) and change in life standard by latent class analysis (McCutcheon 1987),using national representative dataset (the 2010 Stratification and Social Psychology In-terview Survey). As a result, I extracted four latent classes; (a) subjective middle class(32.7% ), who identify themselves as “middle” in society andsatisfied with their life butthink their life unchanged in recent years, (b) subjective upper class(28.3%), who iden-tify themselves as “upper” in society , satisfied with their life and own SES and thinktheir life changed better, (c) subjective lower class(21.4%), who identify themselves as“lower” in society , dissatisfied with their life and own SES and think their life changedworse, and (d) neutral response group(17.6%), who answer “middle” in the questionof subjective social status and choose the neutral response(i.e. the center of responsecategory) in other questions.
Next, I analyzed the relationship between four classes and respondents’ objective so-cial status by multinomial logit latent-class regression analysis (Yamaguchi 2000). Theresults were as follows. (A) Compared to subjective middle class, subjective upper classattains higher levels of education and income. (B) Comparedto subjective middle class,subjective lower class attains lower levels of education, income and occupational pres-tige. (C) There is no difference between neutral response group and subjective middleclass in objective social status. However, neutral response group show lower levels ofcooperation toward the survey than subjective middle class.
ReferencesHodge, R. W. and Treiman, D. J. (1968): Class Identification in the United States.American Journal of Sociology, 73: 535-47.Hout, M. 2008: How Class Works: Objective and Subjective Aspects of Class Since1970s. In: A. Lareau and D. Conley (Eds.):Machine Learning: The Art and Scienceof Algorithms that Make Sense of DataRussel Sage Foundation, New York, 25-89.McCutcheon, A. L. (1987):Latent Class Analysis. Sage, Thousand Oaks.Nakao, K. (2002): Status Identification and Perception about Standard of Living.Sociological Theory and Methods, 17, 135-149. [in Japanese]
Center for Statistics and Information, Rikkyo [email protected]
191
Yamaguchi, K. (2000): Multinomial Logit Latent-Class Regression Models: AnAnalysis of the Predictors of Gender Role Attitude Among Japanese Women.Amer-ican Journal of Sociology, 105: 1702-40.
KeywordsSUBJECTIVE SOCIAL STATUS, SOCIAL SURVEY, LATENT CLASS ANALYSIS,MULTINOMIAL LOGIT LATENT-CLASS REGRESSION MODEL
192
Reference Set Selection for Multivariate Statistical ProcessMonitoring with Biplots
RF Rossouw1, RLJ Coetzer1, and NJ Le Roux2
Abstract
The fundamental approach of almost all of the multivariate process monitoring proce-dures is to first specify an historical reference set that is within statistical control. How-ever, current literature is focused on multivariate statistical monitoring of many processvariables simultaneous for a single process. The selectionof a reference set that is withinstatistical control or conforms to some specified accepted performance measure(s) formultiple production processes simultaneously has received very little attention. The se-lection of the most optimal reference set for a monitoring biplot for multiple processeshas to our knowledge not been discussed previously. Therefore, in this paper we presenta methodology for selecting a reference set for multivariate process monitoring of manyprocess variables using the biplot (Gower et al., 2011), andallows for efficient moni-toring of multiple production processes. It will be demonstrated how a combination ofGeneralized Orthogonal Procrustes Analysis (Gower and Dijksterhuis, 2004), and bi-plot methodology (Arnold et al., 2007) can be used to find boththe optimal productionprocess and the optimal period for the reference set.
ReferencesArnold, G. M., Gower, J. C., Gardner-Lubbe, S., and le Roux, N. J. (2007). Biplots offree-choice profile data in generalized orthogonal Procrustes analysis.Applied Statis-tics, 56, 445-458.Gower, J. C. and Dijksterhuis, G. B. (2004).Procrustes Problems. Oxford, UK: Ox-ford University Press.Gower, J.C., Lubbe, S. and Le Roux, N.J. (2011).Understanding Biplots. Chichester,UK: John Wiley & Sons.
KeywordsBIPLOTS, PROCESS MONITORING, PROCRUSTUS ANALYSIS
Sasol Technology Research and Development, Sasol, PrivateBag 1, Sasolburg, 1947,South Africa [email protected], [email protected] ·Department of Statistics and Actuarial Science, Stellenbosch University, Private BagX1, Matieland, 7602, South [email protected]
193
PLS Biplot: Another Graphical Tool for Multivariate Data
Opeoluwa V.F. Oyedele1 and Sugnet Lubbe2
Abstract
In multivariate analysis, data matrices are often very large and therefore it is difficultto describe the structure and make a visual inspection of therelationship between theirrespective rows (samples) and columns (variables). For this reason, biplots, the jointgraphical display of rows and columns of a data matrix, can bea useful tool for anal-ysis. Biplots have been employed in a number of multivariatemethods such as Corre-spondence Analysis, Principal Component Analysis, Canonical Variate Analysis, andDiscriminant Analysis, as a form of graphical display of data.
Another (popular) multivariate method is Partial Least Squares (PLS). Introduced byWold (1966) as a regression method, PLS is more flexible than multivariate regression,but better suited for the prediction of a set of response variables from a large set of pre-dictors than Principal Component Regression. Different iterative algorithms have beenproposed for estimating the PLS regression coefficients. The most popular algorithmsare the NIPALS (Nonlinear Iterative PArtial Least Squares), Kernel and SIMPLS (Sta-tistical Inspired Modification to Partial Least Squares).
In this paper the biplot is employed in the form of thePLS biplot, a new addition to thebiplot family. Akin to the advantages of biplots, the PLS biplot demonstrates, in graphicform, the association between samples and (or) variables aswell as provides a singlegraphical representation for displaying results from the PLS regression analysis. Twodifferent forms of the PLS biplot are discussed. First, in typical Gower and Hand (1996)biplot style with calibrated biplot axes. Second, the area biplot introduced by Gower,Groenen and Van de Velden (2010) is utilised to ease representation of the matrix ofPLS regression coefficients.
ReferencesGOWER, J.C., GROENEN, P.J.F. and VAN DE VELDEN, M. (2010): Area Biplots.Journal of Computational and Graphical Statistics, 19, 46–61.GOWER, J.C. and HAND, D.J. (1996):Biplots. Monographs on Statistics and Ap-plied Probability. Chapman & Hall, London.WOLD, H. (1966): Estimation of Principal Components and Related Models by Iter-ative Least Squares. In P.R. Krishnaiah (Ed.):Multivariate Analysis. Academic Press,New York, 391–420.
KeywordsAREA BIPLOT, BIPLOT, PARTIAL LEAST SQUARES REGRESSION
University of Cape Town, Cape Town, South [email protected] · University of Cape Town, Cape Town,South [email protected]
194
Variable Selection for Regression and PLS using GenericAlgorithms and Particle Swarm Optimization: AComparison between the Two Methods
Martin Philip Kidd1 and Martin Kidd2
Abstract
Genetic Algorithms(GA) and Particle Swarm Optimization(PSO) (Moraglio et.al 2008)has previously been shown to be successful in the role of variable selection in a regres-sion setting (Talbi et.al. 2008). In this presentation we share some of our experienceswhen applying these techniques to simulated and actual datafor multiple regression andPartial Least Squares(PLS). For PLS the optimal number of components was imple-mented as part of the optimization algorithm, and for both methods, the optimal numberof variables was also implemented as part of the optimization.
A further adaption was made to the optimization algorithm, called hybrid GA(PSO).Each member of the population (outer algorithm) is used as input to another GA(PSO)algorithm (inner algorithm). The outer algorithm focuses on diversification while theinner algorithm focuses on intensification.
For multiple regression and PLS, simulated data sets were constructed with only asmall number of significant predictors from a "large" pool (in excess of 500) predic-tor variables. The time taken for the algorithms to find the significant variables wererecorded. Results will be shown for various different selections of population (swarm)sizes and other tuning parameters.
In similar fashion, comparisons of the algorithms on real data will also be reported.
ReferencesMoraglio, A, Di Chio, C, Togelius, J, Poli R. (2008): Geometric parti-cle swarm optimization.Journal of Artificial Evolution and Applications, Vol2008,doi:10.1155/2008/143624.Talbi, E-G., Jourdan, L., Garcia-Nieto, J., Alba, E. (2008): Comparison of populationbased metaheuristics for feature selection: Application to microarray data classifi-cation.2008 IEEE/ACS INTERNATIONAL CONFERENCE ON COMPUTER SYS-TEMS AND APPLICATIONS, VOLS 1-3 Book Series: InternationalConference onComputer Systems and Applications Pages: 45-52.
KeywordsGENETIC ALGORITHMS, PARTICLE SWARM OPTIMIZATION, PLS, REGRES-SION, VARIABLE SELECTION
Operations Research Group, Dipartimento di Elettronica, Informatica eSistemistica (DEIS),Universita degli Studi di Bologna, Bologna, [email protected] · Centre for Statistical Consultation (CSC),Stellenbosch University, Stellenbosch, South [email protected]
195
Classification with Hyperspheres
Morné Lamont
Abstract
The classification of observations plays a very important role in many applied researchareas. The most well-known classification (discriminant) technique was proposed byFisher (1936) and is called Fisher’s linear discriminant analysis. Many traditional sta-tistical techniques such as Fisher’s linear discriminant analysis have been kernelized(Mika et al., 1999). Other kernelized methods include, kernel principal component anal-ysis, kernel ridge regression and kernel clustering (Cristianini and Shawe-Taylor, 2004).Kernel-based multivariate techniques have gained popularity in statistics over the pastfew decades. The most well-known kernel-based technique isprobably the support vec-tor machine (Boser et al., 1992), which is known for its state-of-the-art performance. Inthis paper, another kernel-based technique called the smallest enclosing hypersphere isreviewed. This technique was used by Tax and Duin (1999) to develop an outlier detec-tor. In this paper we will use the smallest enclosing hypersphere for statistical classifica-tion (called nearest hypersphere classification or NHC). Wewill give an explanation ofhow the NHC is performed. The NHC is compared to other popularstatistical classifi-cation methods in a simulation study and on two real-world datasets. The properties andadvantages of NHC will also be highlighted. NHC is a non-parametric approach to clas-sification and provides more advantages and flexibility thanthe traditional classificationmethods.
ReferencesBOSER, B.E., GUYON, I.M. and VAPNIK, N.V. (1992): A trainingalgorithm foroptimal margin classifiers. In: D. Haussler (Eds.).Proceedings of the 5th annualACM workshop on Computational Learning Theory, 144–152.CRISTIANINI, N. and SHAWE-TAYLOR, J. (2004):Kernel Methods for PatternAnalysis. Cambridge University Press, New York.FISHER, R.A. (1936): The use of multiple measurements in taxonomic problems.Annals of Eugenics, 7, 179–188.MIKA, S., RÄTSCH, G., WESTON, J., SCHÖLKOPF, B. and MÜLLER, K.-R.(1999): Fisher discriminant analysis with kernels. In: Y.-H. Hu, J. Larsen, E. Wil-son and S. Douglas (Eds.):Neural Networks for Signal Processing IX. IEEE, 41–48.TAX, D.M.J. and DUIN, R.P.W. (1999): Support vector domain description.PatternRecognition Letters, 20, 11–13.
KeywordsDISCRIMINANT ANALYSIS, HYPERSPHERE, KERNEL FUNCTION, SUPPORTVECTORS
Department of Statistics and Actuarial Science, Stellenbosch University, Private BagX1, 7602, South Africa,[email protected]
196
Separation And Convexity Properties Of Hierarchical AndNon Hierarchical Clustering
Patrice Bertrand1 and Jean Diatta2
Abstract
Weak hierarchies and paired hierarchies both extend the well known hierarchical clus-tering structure. Weak hierarchies are collections of clusters such that the intersectionof any three clusters is the intersection of some two of them.They play a central rolein the study of theoretical properties of arbitrary clusterstructures. Paired hierarchiesare a type of weak hierarchy, and they are represented by planar graphs which are verysimilar to dendrograms. Like in a hierarchy, each cluster ofa paired hierarchy is dis-played as an interval of some linear ordering of the data set,the only difference beingthe possible existence of cluster overlaps, at most one for each cluster. The purpose ofthis presentation is to characterize the previously mentioned cluster structures, namelyhierarchies, weak hierarchies and paired hierarchies, both in terms of ternary separationrelation, on the one hand, and, on the other hand, in terms of some abstract convexitywhich depends on the type of cluster structure being considered.
ReferencesBANDELT, H.J. and DRESS, A.W.M. (1989): Weak hierarchies associated with sim-ilarity measures : an additive clustering technique.Bull. Math. Biology 51, 113–166.BERTRAND, P. (2008): Set systems for which each set properlyintersects at mostone other set - Application to cluster analysis.Discrete Applied Mathematics 156(8),1220–1236.DIATTA, J. and FICHET, B. (1998): Quasi-ultrametrics and their 2-ball hypergraphs.Discrete Mathematics 192, 87–102.POWERS, R.C. (2007): Hierarchies and ternary separation.Applied MathematicsLetters 20(3), 279–283.
KeywordsTERNARY SEPARATION, ABSTRACT CONVEXITY, HIERARCHY, WEAK HIER-ARCHY, PAIRED HIERARCHY
CEREMADE, Université Paris Dauphine, Paris, [email protected] · LIM-EA2525, Université de la Réu-nion, Saint-Denis, [email protected]
197
Latticial Approach for Perfect Phylogeny Problems
François Brucker and Pascal Préa
Abstract
We present a combinatorial model which generalizes phylogenetic trees. This modellinks together a graph model (strongly chordal graphs), a lattice model (crown-freelattices) and a clustering model (chordal quasi-ultrametrics). This structure allows tomodel phylogenetic networks and to associate attributes toa phylogenetic tree.
In classification, this kind of approximation yields a global visualization of the clus-ters and their relationships through dedicated 2-dimensional or 3-dimensional represen-tations. It can be seen as a compromise between hierarchies (simple structure; easy tointerpret) and general lattices (rich interactions between elements; hard to interpret).
ReferencesBRUCKER, F. and GÉLY, A. Crown-free Lattices and Their Related GraphsOrder,28:443–454, 2010.FARBER, M. Characterizations of strong chordal graphs.Discrete Mathematics,43:173–189, 1983.KELLY, D. and RIVAL, I. Crowns, fences, and dismantable lattices.Canadian Jour-nal of Mathematics, 26:1257–1271, 1974.SPINRAD, J. P.Efficient Graph Representations. American Mathematical Society,Providence Rhode Island, 2003.
KeywordsPERFECT PHYLOGENY, CROWN-FREE LATTICES, DISSIMILARITY, STRONGLYCHORDAL GRAPHS
Laboratoire LIF, UMR 7279, École Centrale Marseille, 38 rueJoliot-Curie - F-13451 Marseille [email protected];[email protected]
198
Some Aspects of Formal Concept Analysis in HierarchicalClassification and Data Analysis
Mehdi Kaytoue1, Sergei O. Kuznetsov2, and Amedeo Napoli3
Abstract
In Formal Concept Analysis (FCA [1]), the formalization of aclassification problemrelies on a formal contextK = (G,M, I) whereG is a set of objects,M a set of at-tributes andI ⊆ G×M a binary relation describing links between objects and attributes.Then a formal concept corresponds to a maximal set of objects–the extent– associatedwith a maximal set of attributes –the intent. Formal concepts are ordered within a com-plete lattice thanks to a subsumption relation based on extent inclusion. The standardFCA formalism can be extended to deal with complex data such as numbers, intervals,strings, and even graphs, within the so-called pattern structures [3]. In addition, a simi-larity between objects based on the closeness of attribute values can be considered andformalized as a tolerance relation, i.e. reflexive and symmetric [2].
In our presentation, we would like to emphasize the links existing between FCA vari-ations and (hierarchical) clustering methods in data analysis. The framework of FCAoffers many possibilities w.r.t. classification and data analysis, e.g. a powerful and di-verse algorithmic machinery of FCA for dealing with large and complex data. Moreover,the joint use of pattern structures and similarities materializes a convergence betweensymbolic classification (e.g. FCA) and numerical classification methods.
References1. B. Ganter and R. Wille.Formal Concept Analysis. Springer, 1999.2. M. Kaytoue, Z. Assaghir, A. Napoli, and S.O. Kuznetsov. Embedding Tolerance
Relations in Formal Concept Analysis – An Application in Information Fusion. InProceedings of CIKM, pages 1689–1692. ACM, 2010.
3. M. Kaytoue, S.O. Kuznetsov, and A. Napoli. Revisiting Numerical Pattern Miningwith Formal Concept Analysis. InProceedings of IJCAI, pages 1342–1347, 2011.
KeywordsFORMAL CONCEPT ANALYSIS, CLASSICATION, PATTERN STRUCTURES, SIM-ILARITY
LIRIS/INSA Lyon [email protected] · HSE [email protected] · LORIA (CNRS – Inria Nancy – U. de Lorraine)[email protected]
199
Which Movie Shall I Watch? Ultrametric BasedRecommendation System
Pedro Contreras1, Fionn Murtagh1, and Javier Pereira2
Abstract
In previous work we have shown how an ultrametric (Murtagh etal, 2008. Pereira etal, 2010. Contreras et al, 2012) can be used to create hierarchical clusters in constantalgorithmic time. In particular we make use of the Baire metric or the longest commonprefix to construct our classification trees. Sometimes whena technique to reduce thedata dimensionality was needed we opted to project the data randomly to one dimension(Murtagh et al, 2008).
Our aim in this work is to show how the Baire metric can be used to classify,match and retrieve categorical data. We demonstrate this bycreating a movie rec-ommendation system based in the Baire metric and using the MovieLens dataset(http://www.grouplens.org/node/73).
ReferencesCONTRERAS, P. and MURTAGH. F. (2012): Fast, Linear Time Hierarchical Clus-tering Using the Baire Metric. In: Journal of Classification, 29(2):118–143.MURTAGH, F., DOWNS, G. and CONTRERAS P. (2008): Hierarchical Clusteringof Massive, High Dimensional Data Sets by Exploiting Ultrametric Embedding. In:SIAM Journal on Scientific Computing, 30(2):707–730.PEREIRA, J., SCHMIDT, F. CONTRERAS, P., MURTAGH, F. and H. ASTUDILLO(2010): Clustering and Semantics Preservation in CulturalHeritage InformationSpaces. In: RIAO’2010, 9th International Conference on Adaptivity, Personalizationand Fusion of Heterogeneous Information, 100–105. Paris, France.
KeywordsULTRAMETRIC, BAIRE METRIC, CLUSTERING, RECOMMENDATION SYS-TEMS, INFORMATION RETRIEVAL.
Royal Holloway, University of London. Egham Hill, Egham. England. TW20 [email protected], [email protected] · Universidad Diego Portales.Avenida Ejército 441. Santiago, [email protected]
200
Automatic Annotation and Classification of newPapillomavirus genomes
Mohamed Amine Remita, Ahmed Halioui and Abdoulaye Baniré Diallo
Abstract
Papillomaviruses (PVs) are a group of viruses harboring a circular dsDNA and caus-ing cutaneous and mucosal epithelial lesions in several vertebrate species. Since the lasttwo decades, improvements in cloning and sequencing technologies permit the mas-sive sequencing of new complete PV genomes (about 8kb nucleotides, conserved struc-ture and complex set of protein-coding genes [1]). The annotation of these genomesand their classification within known PV-types [2] (genotyping) constitute an importantasset for knowledge discovery in the mechanisms of disease diagnosis as well as thewhole PV classification. However, public PV genotyping and annotation tools yet lackaccuracy (derived only from sequence similarity searches). Here, we propose a methodthat exploits both statistical and similarity-based methods to automatically annotate andgenotype genomes. Our approach is composed of two main modules. Theannotationmodulecan detect protein-coding regions based on conservation patterns among alignedgenomes and accurately identify complex features such as overlapping genes and ribo-somal frameshifts. ThePV genotyping moduleis derived from a supervised machinelearning approach relying on decision tree learned from multiple features such as thegenome annotation data from the first module, statistical evidence (nucleotide frequen-cies in each codon positions, GC content, etc.), physical and chemical characteristics ofproteins, and other genome features like restriction fragment length polymorphism.
References1. Zheng, Z. M. and Baker, C. C. (2006): Papillomavirus genome structure, expression,
and post-transcriptional regulation.Front. Biosci., 11, 2286−2302.2. Bernard, H.U., Burk, R.D., Chen, Z., van Doorslaer, K., Hausen, H.Z., and de Vil-
liers, E.M. (2010): Classification of papillomaviruses (PVs) based on 189 PV typesand proposal of taxonomic amendments.Virology 401, 70−79.
KeywordsBIOINFORMATICS, PAPILLOMAVIRUS, ANNOTATION, CLASSIFICATION,KNOWLEDGE DISCOVERY, MACHINE LEARNING
Department of Computer Science, Université du Québec à Montréal,P.O. Box 8888 Downtown Station, Montreal, Quebec, H3C 3P8, [email protected]
201
Different Approaches To Modeling Family Data In GWAS:Application To Cannabis Use
Camelia C. Minica1, Conor V. Dolan1,2, Jouke-Jan Hottenga1, Dorret I. Boomsma1 andJacqueline M. Vink1
Abstract
Power in genome-wide association studies (GWAS) of complextraits has gained muchimportance lately given that the causal genes are commonly assumed to have small ef-fects, requiring large samples for detection. Despite their potential to increase power,the cohorts followed longitudinally in twin registries remain largely unexploited, as theincorporation of many genetic variants in complex models that explicitly account forkinship among individuals poses computational challenges. Hence, one strategy is tolimit the association analysis to unrelated individuals. Given the availability of familydata, it is of interest to determine which analytic strategyis most efficient in the contextof GWAS, where power and computational tractability are both important. We comparedthe performance of three approaches: (a) analysis limited to unrelated, versus analysis offamily data (b) by using a robust estimator (Huber, 1967), or(c) by employing a mixed-effects approach (Guo and Wang, 2002). We evaluated these approaches by consideringfeatures of samples typically collected in Twin registries: a large number of clusters,that are small (i.e., clusters of sibs, sibs and monozygoticand dizygotic twins, with orwithout parents), varying in size and may include a wide range of phenotypic correla-tions (from .1 to .7). In addition we expect the individuals within the cluster to displaysex, age and generation effects. The performance of the three approaches was assessed,first, in simulated data. Next, the most efficient strategy was applied in a GWAS wherewe used genotypes and lifetime cannabis use data collected in 2619 families with up to4 siblings from the Adult Netherlands Twin Register.
ReferencesGUO, G. and WANG, J. (2002): The mixed or multilevel model forbehavior geneticsanalysis.Behavior Genetics, 32, 37–49.HUBER, P.J. (1967): The behaviour of maximum likelihood estimates under non-standard conditions.The 5th Berkeley Symp on Math Stat and Prob, I, 221–233.
KeywordsPOWER, ROBUST ESTIMATOR, MIXED MODEL
Vrije Universiteit Amsterdam, Department of Biological Psychology, Van der Boe-chorststraat 1, 1081 [email protected] · Universiteit van Amsterdam , De-partment of Psychology, Weesperplein 4, 1018 XA
202
Utilization Of Machine-Learning Methodologies In Order ToUnderstand Complex Evolutionary And Functional LinksAmong Bacterial Genomes
Olivier Poiron1 and Benedicte Lafay2
Abstract
We are searching for evolutionary trends among genome maintenance-related genespresent on the replicon sets (i.e., chromosomes and plasmids) of bacterial genomes.Traditional bioinformatic and phylogenetic methods are not adapted to large scale andhigh-dimensional study. We thus developed a semi-supervised analytical pipeline re-lying on data-mining methodologies. Generic unsupervised(SOM, K-means, SUB-CLU, Bayesian networks) and supervised (SVM,decision trees) classication methodswere combined with specific bioinformatic algorithms basedon sequence homologysearch (BLAST). Through this approach, important evolutionary processes could becharacterized among genome-integrated plasmids and chromosomes. We here report onthe inherent difficulties (input data bias, high-dimensional analysis, noise) and the ap-plied methodology, and conclude on the significance of the data-mining methodologyin knowledge discovery.
KeywordsCOMPARATIVE GENOMICS, HOMOLOGY SEARCH, CLASSIFICATION, ANA-LYTICAL PIPELINE
Laboratoire AMPERE Ecole Centrale de Lyon, [email protected] · Laboratoire AMPERE Ecole Centrale deLyon, [email protected]
203
Application of a Bayesian Artificial Neural Network to theBreast Cancer Survival Data
Masoud Salehi1 and Mahmood Reza Gohari2
Abstract
To imitate the function of the brain, ANNs were first developed in the 1940s. Theyhave been more popular in the last two decades because of the development of newtechniques and increases in computational power in different fields as prediction tools.ANNs are mathematical models that contain a number of processing units called nodes,which accomplish limited and simple computations. Moreover, ANNs are consideredas nonparametric and distribution free models, which can beused for prediction andtreated as linear or nonlinear regression models. Multi-layer perceptrons (MLPs) arethe most popular and widely used ANN among the different types of them which areseparated in terms of structure and type of operation. Bayesian framework for trainingand selecting the complexity of ANNs based on Markov chain Mont Carlo (MCMC)techniques has benefits in ensuring that uncertainty into ANNs is reflected in the poste-rior information. Real data of breast cancer were used to illustrate the application of theBayesian Artificial Neural Networks.
ReferencesBishop, C.M. (2006):Patern Recognition and Machine Learning. Springer, NewYork.Mackay, D.J.C. (1995): Probable networks and plausible predictionsâASa review ofpractical Bayesian methods for supervised neural networks. Network Computation inNeural System, 6, 469-505.McCulloch, W.S. and Pitts, W. (1943): A logical calculus of the ideas immanent innervous activity.Bulletin of Mathematical Biophysics, 5, 115-133.
KeywordsBAYESIAN ARTIFICIAL NEURAL NETWORKS, MCMC, BREAST CANCER
Tehran University of Medical Sciences, [email protected] · Tehran Uni-versity of Medical Sciences, [email protected]
204
Achieving Near-perfect Classification for Functional Data
Peter Hall (and Aurore Delaigle)1
Abstract
It can be shown that, in supervised classification problems involving functional data,asymptotically perfect classification is possible, makinguse of the intrinsic very highdimensional nature of functional data. This performance isoften achieved by linearmethods, which are optimal in important cases. The results point to a marked differ-ence between classification for functional data and its counterpart in conventional mul-tivariate analysis, where dimension is kept fixed as sample size diverges. In the lattersetting, linear methods can sometimes be quite inefficient,and there are no prospectsfor asymptotically perfect classification, except in pathological cases where, for exam-ple, a variance vanishes. By way of contrast, in finite samples of functional data, goodperformance can be achieved by truncated versions of linearmethods. Truncation canbe implemented by partial least-squares or projection ontoa finite number of principalcomponents, using, in both cases, cross-validation to determine the truncation point.
Department of Mathematics and Statistics, The University of [email protected]
205