I recently made use of k-means (in R) to cluster some data. While the actual implementation of model was fairly straightforward, I was at first a bit stumped as to how to decide what variables to include, and how to understand my resulting clusters.

Here are a few tricks I picked up.

Click permalink to read more...

Analysing your input variables

The first part of this is deciding which variables you think are actually important to your clusters. For instance, I was using demographic data and considering customers, so I selected fields which I thought would influence customer metrics. In my case I had age groups (at a very granular level: 0-4, 5-7, 8-9 etc), ethnicity, household types (married families with children, couples, retired, students etc), socio-economic status classification.

However, by the time I had included all the fields that I thought could be relevant, I had more than 20, which seemed like overkill. The next step helps to alleviate this, but first I'll explain why it's important to do so.

For the k-means model, each point is clustered based on the geodesic distance between it and each of the k cluster centers. This geodesic distance is actually just sqrt((x1-k1)^2 + (x2-k2)^2 + (x3-k3)^2 .... (xn-kn)^2), where x1,x2..xn are each of your input variables, and k1...kn are the corresponding values for cluster center to which you are measuring the distance.

This is why it is important to scale your input variables before running the model - otherwise if you have some input variables which vary between observations on a scale of 1000's and others which vary between just 0 and 1, the results of the model will be dominated by those larger scale dimensions. (Luckily, in R, we can just run scale(myData) to turn each variable into a z-score (i.e. (x-mu)/sigma), where myData is your dataset).

However, the story doesn't end there. Let's imagine we have all of our nicely scaled input variables, however, we accidentally duplicated up one of the input variables, and had it in the dataset twice. Clearly, the results are going to be slightly overfit for that variable, as the geodesic distance is going to factor that term twice. Now imagine, instead of duplicating a variable, we actually have two different variables, but they are very closely correlated. For instance, number of individuals between the ages of 2-3 could be very strongly correlated with the number of individuals between 3-4. Although these are different variables, because they are correlated, they reinforce each other in every case. So it has the effect of 'overfitting' for young children. In that case, we could consider just throwing one of the variables out of the analysis, or combining them both into a single variable of 2-4 (which will be nicely scaled).

How do we look for all the places where this happens? Well, we take all our the variables in our dataset and correlate them with all the other variables, then we look at each correlation score, and consider updating our input variables. Sounds difficult, but in R we can do it in one command 'cor'. This command correlates every column with every other (including itself..). Then we use another command to visualise the output as a heatmap, and look for any very red or white squares. Where we see these, we may wish the combine or drop variables.

<heatmap>

Using this approach, I combined or dropped about half of my input variables, knowing that I wasn't losing actual 'dimensions' but just scaling what goes in to avoid one or two features dominating the rest.

Analysing your clusters

OK, so having run the model (and using a number of well documented approaches such as SSE to find the optimum number of clusters), I got a set of roughly equal size clusters. What do they mean?

This is where I picked up another good visual method for looking at your results, which works something like this.

You have the complete set of data, and you have also classified each observation in that data into one of your clusters. Pick a set of key descriptors for your data. For my example, this was my original set of input variables (before I did all the scaling and combining and dropping).

Now calculate the aggregate (proportional) values of these descriptors for the complete data set, and again for each of the clusters. Now, in excel, put each of the descriptors down the left, and each of the clusters across the columns, like so:

  All Data Cluster 1 Cluster 2 Cluster 3
Age 0-2

 2%

10%   0.1% 0.3% 
Age 3-4  ...

...

 ...  ...
Age 5-7

 ...

 ...  ...  ...
         

Married w/kids

 ...  ...  ... ... 

Students

...  ... ...  ... 

Retired

 ... ...  ...  ... 

Finally, in excel, make another set of columns to the right of the first set, one for each of the clusters. The value in these cells should be equal to the corresponding value on the left, but divided by the value in the 'All data' column. Finally, use conditional formatting on these new columns, to 'heatmap' colour code these cells based on their value.

  All Data Cluster 1 Cluster 2 Cluster 3     Cluster1 Cluster 2 Cluster3
Age 0-2 2% 10% 0.1% 0.3%   5 0.05 0.15
Age 3-4 ... ... ... ...   ... ... ...
Age 5-7 ... ... ... ...   ... ... ...

Visually review the results!

Once you have a thorough list of descriptors down the sheet, and the appropriately coloured descriptors-versus-the-population-mean, understanding and naming your clusters becomes easy (and fun). Where before you saw lots and lots of similar numbers, now you see huge blotches of red yellow and green clustered together in different clusters, and the meaning behind your clusters tends to jump out at you.