As part of a recent project, I discovered the Sankey diagram implementation in D3. As it happened this graph perfectly fulfilled one of project objectives, which was the visualise the flow of candidates through a system. Here's an example:

Click permalink to read more...

The input of data into the diagram (by JSON) is very similar to that for a network graph, where each rectangle is a node, and each of the flows between them is an edge. However, there is one key distinction; in a Sankey diagram, the nodes are ordered. That means, in the diagram above, we can have an edge that goes from A->B, but if you include an edge from B->A, you will get an error. Likewise, if you include an edge between two 'nodes' on the same level, like C->D, you will get an error.

Clearly, before fitting your data into a diagram like this, you need to make sure it makes sense. Is there a direction to your data, so that nodes have an explicit order in which they flow? And if you chose to put two nodes on the same 'level', are they mutually exclusive (i.e. you can't have a flow splitting and going into both of them.)

If your data meets these criteria, then this code is for you. This snippet works on the assumption that you have data for each of your underlying observations (or in my case candidates) and information about whether or not each candidate passed through a particular stage.

Prior to this step, I have created a table exampleData with a row for each of my candidates, and a 1/0 column for each stage I want in my Sankey diagram.

ID A B C D E
1 1 0 1 0 1
2 1 1 1 0 0
3 1 0 0 1 1
4 1 1 0 1 0

Now, I know from my business understanding of the data that all candidates start in A, then some move through B, and then all candidates move through either C or D, and finally some will flow to E. So I know I want a Sankey diagram with A on the left, and E on the right, with nodes C and D on the same level. Therefore I create the following table exampleNodes, where the 'Level' column fixes the position of the nodes:

NodeID Name ColumnName Level
1 A A 1
2 B B 2
3 C C 3
4 D D 3
5 E E 4

Now I am ready to run my code which will create all of my nodes and edge data:

drop table outputNodes
drop table outputEdges
create table outputNodes (NodeID int, Name varchar(20), Size int)
create table outputEdges ([Source] int, [Target] int, Size int) 

declare @csr as cursor
declare @csr2 as cursor
declare @sql as varchar(max)
declare @Name as varchar(20)
declare @Name2 as varchar(20)
declare @ColumnName as varchar(20)
declare @Level as int
declare @NodeID as int
declare @NodeID2 as int
declare @ColumnName2 as varchar(20)
declare @Level2 as int
declare @nots varchar(1000)

declare csr cursor for select row_number() over (order by level, nodeid)-1 as [NodeID], Name, ColumnName, Level from exampleNodes order by NodeID
open csr 
fetch next from csr into @NodeID, @Name, @ColumnName, @Level
WHILE @@FETCH_STATUS = 0
BEGIN
	print @Name + '1'
	select @nots =
	(select ' AND [' + ColumnName + ']=0' as [text()] 
	from exampleNodes
	where Level = @Level 
	and Name <> @Name
	for xml path (''))

	print @nots + '2'
	set @sql = 'select ' + cast(@NodeID as varchar(10)) + ', ''' + @Name + ''', count(1) from exampleData where [' + @ColumnName + '] = 1 ' + isnull(@nots,'')
	print @sql
	insert into outputNodes
	exec(@sql)

	declare csr2 cursor for select row_number() over (order by level, nodeid)-1 as [NodeID], Name, ColumnName, Level from exampleNodes where Level < @Level
	open csr2
	fetch next from csr2 into @NodeID2, @Name2, @ColumnName2, @Level2
	WHILE @@FETCH_STATUS = 0
	BEGIN

		select @nots =
		(select ' AND [' + ColumnName + ']=0' as [text()] 
		from exampleNodes
		where Level between @Level2 and @Level
		and Name NOT IN (@Name, @Name2)
		for xml path (''))

		set @sql = 'select ' + cast(@NodeID2 as varchar(10)) + ', ' + Cast(@NodeID as varchar(10)) + ', count(1) from exampleData where [' + @ColumnName + '] = 1 AND [' + @ColumnName2 + '] = 1 ' + isnull(@nots,'')
		print '     ' + @sql
		insert into outputEdges
		exec(@sql)

	fetch next from csr2 into @NodeID2, @Name2, @ColumnName2, @Level2
	end
	close csr2
	deallocate csr2

fetch next from csr into @NodeID, @Name, @ColumnName, @Level
end
close csr
deallocate csr

select * from outputNodes
select * from outputEdges

Finally, I can use SQL2JSON to convert the data from SQL into a JSON ready for D3.