Asado!

19 April, 2014

undefined

When I was at University, I went to Argentina for a few weeks. I saw and ate a lot of things when I was there, but one of the favourite things I was able to bring back with me. The Asado is a special Argentinian barbeque cooked over hot embers, which it's quite easy to replicate at home, provided you have a plentiful supple of firewood, meat and salt.

Click permalink to read more...

Coursera

19 April, 2014

I love Coursera! Here are a selection of courses I've taken, with a brief (personalised) outline:

Introduction to Data Science - This was the first course I took on Coursera, and it was a great eye opener into the world of free online courses. A great course which went from traditional databases to noSQL, to Hadoop on AWS, sentiment analysis and more, it really sparked my interest.

Social Network Analysis - This course provided a great introduction to some of the more interesting data mining you can do on graph data. From measuring influence with algorithms like betweeness and pagerank, to calculating the robustness of networks.

Machine Learning - A pivotal course by Andrew Ng, one of the founders of Coursera. This course has students building out a lot of machine learning models from first principles in MatLab to truly understand the mechanics behind them.

R Programming - I took this course to formalise my otherwise very 'organic' learning of R. A solid foundations course, it made me a lot more confident manipulating data in R.

Programming Android Application for Handheld Android Systems - Although I was comfortable in oo programming languages like Java, I struggled to find the time to learn the basics of app development. Quite simply, a few weeks on this course took me from unable to build apps, to being able to put together my first (drinking game) apps with relative ease.

It's also worth noting that while I really enjoy Coursera and the other sites like Cloudera, edX and Udacity, I really dislike the term MOOC, which sounds stupid.

SQL to JSON (SQL2JSON)

19 April, 2014

I tend to make SQL server the workhorse of most projects. Obviously it's a powerful tool, but it also works great as a central data store, on the basis that any tools worth their salt (e.g. python, R, octave, qlikvew, tableau etc etc) have mature odbc connectors that allow me to connect in, do work, and possibly deliver the results back to SQL with little bother.

However, since making use of more bespoke visualisation tools like D3 and Sigma.js, which tend to use the JSON file format for their online data storage, I've found myself awkwardly crafting JSONs with a lot of string manipulation and 'for xml path' tricks in SQL.

Recently I decided to fix this once and for all by knocking up a quick SQL->JSON converter in python.

Click permalink to read more...

Advanced de-duplication framework in SQL

19 April, 2014

A simple technical framework in SQL for de-deplicating data.

Introduction

When de-duplicating data, we could consider just doing a group-by on a number of fields, looking for results where two or more rows fall into the same group, and calling those duplicates. While this is computationally efficient, it is a little bit blunt, and means it will only capture duplicates if the grouped fields are identical for both records.

We can improve this situation a little using the concept of 'match keys'. This is where we create a new field in the data, which are modified or combined versions of existing fields. For instance, if we wanted to match individuals on first names, but we wanted to deal with the fact that the same name can be spelt differently - we could create a new field which is the soundex of the first name. Soundex is a very old algorithm for converting words into a 4 letter code which represents it's sound. You can read about the soundex algorithm online, but for now it's enough to know that Neal and Neil will both resolve to the sound-ex 'N400'. In this way, you can use the group-by method to look for duplicates in data, but cater for the cases where the underlying values aren't exactly the same between rows.

However, some more advanced tests exist which simply cannot be done using a group-by method. For example, consider the simple test: 'How many individual letter edits would it take to get from this piece text to that piece of text?'. (Various implementations of this test are known as 'Edit-distance' or the 'Levenshtein distance'). If I want to use this test when looking for duplicate individuals, say by applying it to first names, then I simply cannot use a group-by method - I need to consider individual pairs of records.

So, on the other end of the scale, you could considering comparing every row of data against every other row (in SQL I can do this by self-joining the a table of individuals to itself) - and running a selection of tests to compare every single pair of records. If we did this, we could do tests equivalent to the group-by method (e.g. does A.surname = B.surname), but also more advanced pairwise tests like 'what is the edit distance between rowA surname and rowB surname.'

So it seems like the self-join method is better. However, if we actually did this, and we had N records to start with, then the resulting output would have (N^2)/2 records - most of which would be pairs of records which very definitely aren't duplicates. And consequently we would have wasted a lot of time and disk space comparing every record when we only want to compare some sets of records together.

Instead, this framework uses a mix of these two methods. We use a group-by to create 'windows' of records to compare in a pair-wise fashion. Then we put the resulting output into a results table, and do it again, using a different group-by window.

We can do this as many times as we like, for each 'scenario' of duplicates we are considering, and finally, when we have finished, we can turn our attention to our big set of results, and the process of deciding which of those potential pairs are actually duplicates, based on the results of each of pairwise tests.

In this way, this framework separates the computationally expensive process of 'testing' the records from the interesting process of working out which pairs are actually duplicates. (This also leaves us free to apply some interesting machine learning to the second problem...)

Example

Let's imagine we have data about individuals with the following fields: Firstname, Surname, DOB, Address, and we are looking for duplicates. Let's imagine also that we have these four duplicate records for Neal Simpson in our data:

ID	Firstname	Surname	DOB
1	Neil	Simpson	08/06/1986
2	Neil	Simpson	08/06/1986
3	Neal	Simpson	08/06/1986
4	Neil	Simpson	09/06/1986

1. The basic method we could take is to group by Firstname, Surname and DOB.

This would correctly identify rows 1 and 2 as duplicates, but would miss 3 and 4.

2. Now we could try adding some match keys to the data, for instance, we could add a Soundex(Firstname) field:

ID	Firstname	Surname	DOB	SoundexFirst
1	Neil	Simpson	08/06/1986	N400
2	Neil	Simpson	08/06/1986	N400
3	Neal	Simpson	08/06/1986	N400
4	Neil	Simpson	09/06/1986	N400

Now we could group by 'SoundexFirst', Surname and DOB and the system would correctly identify 1, 2, and 3 as duplicates - but miss 4.

3. We could try to also catch row 4 in our group by approach, by grouping by the 'week' of the DOB, rather than the date itself - on the basis that if two dates are close, they are likely to be in the same week. However, in this case, the 8th is a Sunday, and the 9th is a Monday, so even though the dates are one apart, we would miss the match since they fall in different weeks. In order to capture the fact that the dates are one day apart, we must consider pairwise matching.

So now considering doing a pairwise match on these rows and putting the results into a potential match table. But we only want to consider pairs where SoundexFirst and Surname match:

INSERT INTO dbo.MatchResults
SELECT 
	A.ID AS [ID1],
        B.ID AS [ID2],
        CASE WHEN A.Firstname = B.Firstname THEN 1 ELSE 0 END AS [Firstname_Match]
        1 AS [SoundexFirst_Match],
	1 AS [Surname_Match],
	CASE WHEN A.DOB = B.DOB THEN 1 ELSE 0 END AS [DOB_Match],
	DATEDIFF(DAY, A.DOB, B.DOB) AS [DOB_DifferenceInDays],
FROM Indivisuals A
INNER JOIN Individuals B
ON A.ID < B.ID
AND A.SoundexFirst = B.SoundexFirst
AND A.Surname = B.Surname

This would make a results table something like this:

ID1	ID2	Firstname_Match	SoundexFirst_Match	Surname_Match	DOB_Match	DOB_DifferenceInDays
1	2	1	1	1	1	0
1	3	0	1	1	1	0
1	4	1	1	1	0	1
2	3	0	1	1	1	0
2	4	1	1	1	0	1
3	4	1	1	1	0	1

4. Now in a real situation we could consider a number of other potential situations.

To be continued...

K-means in Practice

19 April, 2014

I recently made use of k-means (in R) to cluster some data. While the actual implementation of model was fairly straightforward, I was at first a bit stumped as to how to decide what variables to include, and how to understand my resulting clusters.

Here are a few tricks I picked up.

Click permalink to read more...

Categories

Tag Cloud

About

Me

Latest posts

Asado!

Coursera

SQL to JSON (SQL2JSON)

Advanced de-duplication framework in SQL

K-means in Practice