Upload
kathryn-hubbard
View
220
Download
0
Embed Size (px)
DESCRIPTION
NrRmagApDRmagrsMAGe.rsMAGUbMAGe.UbMAGBbMAGe.BbMAGVnMAGe.VbMAGS280MAGe.S280MAW420FEe.W420FE E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E-03
Citation preview
COMBO-17 Galaxy Dataset
Colin HoldenCOSC 4335
April 17, 2012
• Contains data on 3,462 objects which have been classified as Galaxies in the Chandra Deep Field South which is basically a patch of sky that lies in the Fornax constellation.
• There is 65 columns of data in this dataset ranging from luminosities in 10 different bands of the spectrum to size and brightness. However the website mentions how a vast majority of these attributes are redundant and not independent.
• Focusing on three main attributes of this dataset.– Total R (red band) magnitude is a measure of brightness of the
galaxy. These are done in inverted logarithmic measurements. So a galaxy with R=21 is 100 more times brighter then one with R=26.
– ApDRmag is the difference between the total and aperture magnitude in the R band. This is a rough measure of the size of the galaxy.
– rsMAG which is the magnitude of the vector coming from the galaxy. Roughly a vector measurement of distance.
Nr Rmag ApDRmag rsMAG e.rsMAG UbMAG e.UbMAG BbMAG e.BbMAG VnMAG e.VbMAG S280MAG e.S280MA W420FE e.W420FE
6 24.995 0.935 -17.97 0.25 -17.76 0.14 -17.53 0.25 -17.76 0.25 -18.22 0.17 6.60E-04 3.85E-03
9 25.013 -0.135 -18.43 0.55 -18.36 0.22 -17.85 0.55 -18.19 0.55 -17.97 0.54 3.24E-04 3.19E-03
16 24.246 0.821 -20.71 0.14 -19.82 0.14 -19.89 0.14 -20.4 0.14 -19.77 0.12 1.30E-02 4.11E-03
21 25.203 0.639 -17.89 0.31 -17.92 0.17 -17.38 0.31 -17.67 0.31 -18.12 0.28 1.19E-02 2.70E-03
26 25.504 -1.588 -19.88 0.83 -17.76 0.42 -18.35 0.83 -19.37 0.83 -13.93 45.11 1.35E-03 3.71E-03
29 23.74 -1.636 -19.05 1.37 -19.3 0.16 -18.08 1.37 -18.69 1.37 -19.18 0.41 3.24E-03 3.02E-03
45 25.706 0.199 -16.39 1.94 -17.19 0.3 -16.05 1.94 -16.22 1.94 -17.81 0.39 8.98E-03 2.88E-03
49 25.139 -0.31 -17.32 1.81 -16.95 0.44 -16.46 1.81 -17.01 1.81 -14.37 14.19 4.36E-03 4.26E-03
50 24.699 0.268 -18.1 0.32 -17.76 0.15 -17.66 0.15 -17.86 0.32 -17.86 0.23 1.44E-02 3.84E-03
51 24.849 0.399 -11.6 0.14 -11.73 0.16 -11.13 0.19 -11.31 0.14 -12.22 0.21 2.00E-02 2.83E-03
60 25.309 0.03 -17.93 0.64 -17.68 0.23 -16.96 0.64 -17.58 0.64 -17.75 0.49 4.52E-03 3.27E-03
62 24.091 0.098 -14.68 0.12 -13.84 0.17 -13.97 0.15 -14.41 0.12 -13.85 0.38 4.75E-03 3.55E-03
64 25.219 0.316 -18.97 0.28 -18.66 0.28 -18.48 0.28 -18.75 0.28 -18.74 0.19 7.46E-03 3.26E-03
66 26.269 0.672 -12.19 0.31 -11.03 0.6 -10.28 1.08 -11.81 0.31 -10.16 2.93 1.45E-03 5.15E-03
71 23.596 -0.084 -19.57 0.13 -17.82 0.11 -18.18 0.11 -19.11 0.13 -17.32 0.25 2.75E-03 3.17E-03
72 23.204 -0.026 -20.51 0.15 -20.6 0.15 -20.13 0.15 -20.29 0.15 -20.96 0.11 4.87E-02 3.31E-03
75 25.161 -0.028 -16.47 2.16 -17.97 0.28 -16.12 2.16 -16.3 2.16 -18.66 0.34 7.01E-03 3.68E-03
83 22.884 -0.097 -19.93 0.14 -19.91 0.1 -19.41 0.14 -19.63 0.14 -20.24 0.11 4.31E-02 3.77E-03
84 24.346 -0.046 -18.81 0.22 -18.21 0.14 -18.1 0.14 -18.58 0.22 -18.31 0.2 1.18E-02 3.39E-03
87 25.453 0.159 -18.28 0.42 -17.86 0.23 -17.79 0.42 -18.06 0.42 -18.42 0.34 3.58E-03 2.93E-03
88 25.911 0.787 -18.43 0.35 -17.41 0.22 -17.56 0.35 -18.11 0.35 -17.63 0.37 7.98E-03 4.69E-03
89 26.004 0.662 -17.53 0.66 -17.47 0.24 -16.96 0.66 -17.29 0.66 -16.98 0.59 5.15E-03 2.65E-03
91 26.803 0.471 -19.68 0.47 -16.74 0.56 -18.13 0.47 -19.17 0.47 7.66E-04 3.34E-03
95 25.204 -1.157 -20.13 0.53 -18.12 0.38 -18.63 0.53 -19.63 0.53 -18.01 1.19 4.54E-04 2.87E-03
97 25.357 0.484 -17.84 0.38 -17.73 0.2 -17.47 0.38 -17.67 0.38 -18.29 0.25 1.26E-02 4.32E-03
105 24.117 0.066 -16.75 0.42 -16.74 0.22 -16.59 0.12 -16.83 0.12 -17.08 0.22 1.80E-02 4.11E-03
107 26.108 0.807 -15.83 1.58 -16.35 0.31 -16.37 0.31 -15.54 1.58 -16.64 0.41 8.76E-03 3.39E-03
108 24.909 -0.012 -18.14 0.33 -17.17 0.18 -17.19 0.18 -17.85 0.33 -17.06 0.42 2.93E-03 3.03E-03
117 24.474 -0.1 -19.43 0.26 -19.04 0.15 -18.94 0.26 -19.22 0.26 -19.26 0.24 1.44E-02 3.19E-03
15 17 19 21 23 25 27
-6
-5
-4
-3
-2
-1
0
1
2
Brightness vs. Size
Total R (red band) Magnitude (Measure of Brightness)
ApDR
mag
(Rel
ative
size
of G
alax
y)
• At first glance, Data appeared to have some sort of linear relationship. Started with the Pearson Correlation Coefficient to test for such a relationship.
• The Pearson Correlation Coefficient Calculated was about .6789.
• The Pearson Correlation Coefficient assumes the data is normally distributed, which may not be the case, but this was just a first step and the data seem to have a slightly linear relationship.
• The brightness of the galaxy seems to decrease as the size grows.
15 17 19 21 23 25 27
-6
-5
-4
-3
-2
-1
0
1
2
Brightness vs. Size
Total R (red band) Magnitude (Measure of Brightness)
ApDR
mag
(Rel
ative
size
of G
alax
y)
K Means Clustering
• Attempt to break the data set into smaller data sets.
• Number of Clusters was chosen to be 5.• Had to limit the number of iterations of when
to stop trying to improve the centroid for each cluster.
• Initial centroids were chosen to be the first 5 records.
15 17 19 21 23 25 27 29
-5
-4
-3
-2
-1
0
1
2
K Means Cluster
Total R (red band) Magnitude (Measure of Brightness)
ApDR
mag
(Rel
ative
size
of G
alax
y)
Hierarchical Clustering
• Chose to stop at 5 clusters to have comparison with the K-Means results.
• Proximity using Euclidean Distance.• Used Ward’s Method to determine cluster
similarity when merging clusters.• Computationally Expensive
15 17 19 21 23 25 27 29
-5
-4
-3
-2
-1
0
1
2
Hierarchical Clustering
Total R (red band) Magnitude (Measure of Brightness)
ApDR
mag
(Rel
ative
size
of G
alax
y)
K Means with 3 Variables
• Wanted to see what kind of results would be yielded from choosing 3 Variables to cluster against.
• Same parameters for the previous K- Means algorithms.
• Chose Brightness, Size, and Distance from Earth as the 3 Variables.
• Difficult to present graphically.
Observation Class Distance to centroid
Obs1 1 0.864
Obs2 1 0.525
Obs3 2 1.648
Obs4 1 0.551
Obs5 2 2.159
Obs6 2 1.466
Obs7 1 1.754
Obs8 1 0.916
Obs9 1 0.627
Obs10 3 2.021
Obs11 1 0.202
Obs12 3 1.244
Obs13 1 0.894
Obs14 3 2.247
Obs15 2 0.303
Obs16 2 1.193
Obs17 1 1.644
Obs18 2 1.051
Obs19 2 0.901
Obs20 1 0.230
Obs21 1 0.941
Obs22 1 1.036
Obs23 1 2.198
Obs24 2 1.741
Obs25 1 0.431
Obs26 4 1.311
Obs27 1 2.494
Obs28 1 0.434
Obs29 2 0.657
Obs30 2 1.319
Obs31 4 1.676
Obs32 4 1.355
Obs33 5 1.052
Obs34 2 0.682
Obs35 3 1.295
Obs36 1 1.597
Obs37 1 0.580
Obs38 2 0.143
Obs39 1 0.115
Obs40 3 0.649
Obs41 1 0.861
Obs42 3 3.572
Obs43 2 0.895
Obs44 2 0.967
Obs45 2 1.277
Obs46 2 2.152
Obs47 4 1.040
Obs48 4 0.238
Obs49 4 2.020
Obs50 1 0.956
Conclusions
• Got to see how the affects of outliers can affect the clustering algorithms for AHC vs K-Means. K-Means was more sensitive to outliers.
• Also got to see how these cluster analysis can be so versatile with lots of different options i.e. value for K, number of attributes to compare etc.– The lots of options can be a downfall of clustering also
in that one small change can yield very different results.
Afterthoughts
• I would have done another K-Means clustering analysis after removing the outliers from my original data and see how the difference in the clusters and their centroids.
• I would have experimented with different values of K and looked at the results.