14
COMBO-17 Galaxy Dataset Colin Holden COSC 4335 April 17, 2012

COMBO-17 Galaxy Dataset Colin Holden COSC 4335 April 17, 2012

Embed Size (px)

DESCRIPTION

NrRmagApDRmagrsMAGe.rsMAGUbMAGe.UbMAGBbMAGe.BbMAGVnMAGe.VbMAGS280MAGe.S280MAW420FEe.W420FE E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E-03

Citation preview

Page 1: COMBO-17 Galaxy Dataset Colin Holden COSC 4335 April 17, 2012

COMBO-17 Galaxy Dataset

Colin HoldenCOSC 4335

April 17, 2012

Page 2: COMBO-17 Galaxy Dataset Colin Holden COSC 4335 April 17, 2012

• Contains data on 3,462 objects which have been classified as Galaxies in the Chandra Deep Field South which is basically a patch of sky that lies in the Fornax constellation.

• There is 65 columns of data in this dataset ranging from luminosities in 10 different bands of the spectrum to size and brightness. However the website mentions how a vast majority of these attributes are redundant and not independent.

• Focusing on three main attributes of this dataset.– Total R (red band) magnitude is a measure of brightness of the

galaxy. These are done in inverted logarithmic measurements. So a galaxy with R=21 is 100 more times brighter then one with R=26.

– ApDRmag is the difference between the total and aperture magnitude in the R band. This is a rough measure of the size of the galaxy.

– rsMAG which is the magnitude of the vector coming from the galaxy. Roughly a vector measurement of distance.

Page 3: COMBO-17 Galaxy Dataset Colin Holden COSC 4335 April 17, 2012

Nr Rmag ApDRmag rsMAG e.rsMAG UbMAG e.UbMAG BbMAG e.BbMAG VnMAG e.VbMAG S280MAG e.S280MA W420FE e.W420FE

6 24.995 0.935 -17.97 0.25 -17.76 0.14 -17.53 0.25 -17.76 0.25 -18.22 0.17 6.60E-04 3.85E-03

9 25.013 -0.135 -18.43 0.55 -18.36 0.22 -17.85 0.55 -18.19 0.55 -17.97 0.54 3.24E-04 3.19E-03

16 24.246 0.821 -20.71 0.14 -19.82 0.14 -19.89 0.14 -20.4 0.14 -19.77 0.12 1.30E-02 4.11E-03

21 25.203 0.639 -17.89 0.31 -17.92 0.17 -17.38 0.31 -17.67 0.31 -18.12 0.28 1.19E-02 2.70E-03

26 25.504 -1.588 -19.88 0.83 -17.76 0.42 -18.35 0.83 -19.37 0.83 -13.93 45.11 1.35E-03 3.71E-03

29 23.74 -1.636 -19.05 1.37 -19.3 0.16 -18.08 1.37 -18.69 1.37 -19.18 0.41 3.24E-03 3.02E-03

45 25.706 0.199 -16.39 1.94 -17.19 0.3 -16.05 1.94 -16.22 1.94 -17.81 0.39 8.98E-03 2.88E-03

49 25.139 -0.31 -17.32 1.81 -16.95 0.44 -16.46 1.81 -17.01 1.81 -14.37 14.19 4.36E-03 4.26E-03

50 24.699 0.268 -18.1 0.32 -17.76 0.15 -17.66 0.15 -17.86 0.32 -17.86 0.23 1.44E-02 3.84E-03

51 24.849 0.399 -11.6 0.14 -11.73 0.16 -11.13 0.19 -11.31 0.14 -12.22 0.21 2.00E-02 2.83E-03

60 25.309 0.03 -17.93 0.64 -17.68 0.23 -16.96 0.64 -17.58 0.64 -17.75 0.49 4.52E-03 3.27E-03

62 24.091 0.098 -14.68 0.12 -13.84 0.17 -13.97 0.15 -14.41 0.12 -13.85 0.38 4.75E-03 3.55E-03

64 25.219 0.316 -18.97 0.28 -18.66 0.28 -18.48 0.28 -18.75 0.28 -18.74 0.19 7.46E-03 3.26E-03

66 26.269 0.672 -12.19 0.31 -11.03 0.6 -10.28 1.08 -11.81 0.31 -10.16 2.93 1.45E-03 5.15E-03

71 23.596 -0.084 -19.57 0.13 -17.82 0.11 -18.18 0.11 -19.11 0.13 -17.32 0.25 2.75E-03 3.17E-03

72 23.204 -0.026 -20.51 0.15 -20.6 0.15 -20.13 0.15 -20.29 0.15 -20.96 0.11 4.87E-02 3.31E-03

75 25.161 -0.028 -16.47 2.16 -17.97 0.28 -16.12 2.16 -16.3 2.16 -18.66 0.34 7.01E-03 3.68E-03

83 22.884 -0.097 -19.93 0.14 -19.91 0.1 -19.41 0.14 -19.63 0.14 -20.24 0.11 4.31E-02 3.77E-03

84 24.346 -0.046 -18.81 0.22 -18.21 0.14 -18.1 0.14 -18.58 0.22 -18.31 0.2 1.18E-02 3.39E-03

87 25.453 0.159 -18.28 0.42 -17.86 0.23 -17.79 0.42 -18.06 0.42 -18.42 0.34 3.58E-03 2.93E-03

88 25.911 0.787 -18.43 0.35 -17.41 0.22 -17.56 0.35 -18.11 0.35 -17.63 0.37 7.98E-03 4.69E-03

89 26.004 0.662 -17.53 0.66 -17.47 0.24 -16.96 0.66 -17.29 0.66 -16.98 0.59 5.15E-03 2.65E-03

91 26.803 0.471 -19.68 0.47 -16.74 0.56 -18.13 0.47 -19.17 0.47 7.66E-04 3.34E-03

95 25.204 -1.157 -20.13 0.53 -18.12 0.38 -18.63 0.53 -19.63 0.53 -18.01 1.19 4.54E-04 2.87E-03

97 25.357 0.484 -17.84 0.38 -17.73 0.2 -17.47 0.38 -17.67 0.38 -18.29 0.25 1.26E-02 4.32E-03

105 24.117 0.066 -16.75 0.42 -16.74 0.22 -16.59 0.12 -16.83 0.12 -17.08 0.22 1.80E-02 4.11E-03

107 26.108 0.807 -15.83 1.58 -16.35 0.31 -16.37 0.31 -15.54 1.58 -16.64 0.41 8.76E-03 3.39E-03

108 24.909 -0.012 -18.14 0.33 -17.17 0.18 -17.19 0.18 -17.85 0.33 -17.06 0.42 2.93E-03 3.03E-03

117 24.474 -0.1 -19.43 0.26 -19.04 0.15 -18.94 0.26 -19.22 0.26 -19.26 0.24 1.44E-02 3.19E-03

Page 4: COMBO-17 Galaxy Dataset Colin Holden COSC 4335 April 17, 2012

15 17 19 21 23 25 27

-6

-5

-4

-3

-2

-1

0

1

2

Brightness vs. Size

Total R (red band) Magnitude (Measure of Brightness)

ApDR

mag

(Rel

ative

size

of G

alax

y)

Page 5: COMBO-17 Galaxy Dataset Colin Holden COSC 4335 April 17, 2012

• At first glance, Data appeared to have some sort of linear relationship. Started with the Pearson Correlation Coefficient to test for such a relationship.

• The Pearson Correlation Coefficient Calculated was about .6789.

• The Pearson Correlation Coefficient assumes the data is normally distributed, which may not be the case, but this was just a first step and the data seem to have a slightly linear relationship.

• The brightness of the galaxy seems to decrease as the size grows.

Page 6: COMBO-17 Galaxy Dataset Colin Holden COSC 4335 April 17, 2012

15 17 19 21 23 25 27

-6

-5

-4

-3

-2

-1

0

1

2

Brightness vs. Size

Total R (red band) Magnitude (Measure of Brightness)

ApDR

mag

(Rel

ative

size

of G

alax

y)

Page 7: COMBO-17 Galaxy Dataset Colin Holden COSC 4335 April 17, 2012

K Means Clustering

• Attempt to break the data set into smaller data sets.

• Number of Clusters was chosen to be 5.• Had to limit the number of iterations of when

to stop trying to improve the centroid for each cluster.

• Initial centroids were chosen to be the first 5 records.

Page 8: COMBO-17 Galaxy Dataset Colin Holden COSC 4335 April 17, 2012

15 17 19 21 23 25 27 29

-5

-4

-3

-2

-1

0

1

2

K Means Cluster

Total R (red band) Magnitude (Measure of Brightness)

ApDR

mag

(Rel

ative

size

of G

alax

y)

Page 9: COMBO-17 Galaxy Dataset Colin Holden COSC 4335 April 17, 2012

Hierarchical Clustering

• Chose to stop at 5 clusters to have comparison with the K-Means results.

• Proximity using Euclidean Distance.• Used Ward’s Method to determine cluster

similarity when merging clusters.• Computationally Expensive

Page 10: COMBO-17 Galaxy Dataset Colin Holden COSC 4335 April 17, 2012

15 17 19 21 23 25 27 29

-5

-4

-3

-2

-1

0

1

2

Hierarchical Clustering

Total R (red band) Magnitude (Measure of Brightness)

ApDR

mag

(Rel

ative

size

of G

alax

y)

Page 11: COMBO-17 Galaxy Dataset Colin Holden COSC 4335 April 17, 2012

K Means with 3 Variables

• Wanted to see what kind of results would be yielded from choosing 3 Variables to cluster against.

• Same parameters for the previous K- Means algorithms.

• Chose Brightness, Size, and Distance from Earth as the 3 Variables.

• Difficult to present graphically.

Page 12: COMBO-17 Galaxy Dataset Colin Holden COSC 4335 April 17, 2012

Observation Class Distance to centroid

Obs1 1 0.864

Obs2 1 0.525

Obs3 2 1.648

Obs4 1 0.551

Obs5 2 2.159

Obs6 2 1.466

Obs7 1 1.754

Obs8 1 0.916

Obs9 1 0.627

Obs10 3 2.021

Obs11 1 0.202

Obs12 3 1.244

Obs13 1 0.894

Obs14 3 2.247

Obs15 2 0.303

Obs16 2 1.193

Obs17 1 1.644

Obs18 2 1.051

Obs19 2 0.901

Obs20 1 0.230

Obs21 1 0.941

Obs22 1 1.036

Obs23 1 2.198

Obs24 2 1.741

Obs25 1 0.431

Obs26 4 1.311

Obs27 1 2.494

Obs28 1 0.434

Obs29 2 0.657

Obs30 2 1.319

Obs31 4 1.676

Obs32 4 1.355

Obs33 5 1.052

Obs34 2 0.682

Obs35 3 1.295

Obs36 1 1.597

Obs37 1 0.580

Obs38 2 0.143

Obs39 1 0.115

Obs40 3 0.649

Obs41 1 0.861

Obs42 3 3.572

Obs43 2 0.895

Obs44 2 0.967

Obs45 2 1.277

Obs46 2 2.152

Obs47 4 1.040

Obs48 4 0.238

Obs49 4 2.020

Obs50 1 0.956

Page 13: COMBO-17 Galaxy Dataset Colin Holden COSC 4335 April 17, 2012

Conclusions

• Got to see how the affects of outliers can affect the clustering algorithms for AHC vs K-Means. K-Means was more sensitive to outliers.

• Also got to see how these cluster analysis can be so versatile with lots of different options i.e. value for K, number of attributes to compare etc.– The lots of options can be a downfall of clustering also

in that one small change can yield very different results.

Page 14: COMBO-17 Galaxy Dataset Colin Holden COSC 4335 April 17, 2012

Afterthoughts

• I would have done another K-Means clustering analysis after removing the outliers from my original data and see how the difference in the clusters and their centroids.

• I would have experimented with different values of K and looked at the results.