London Knowledge Lab - Clustering for Large-scale Social Networks

Collaborative media and social networks

Clustering for Large-scale Social Networks

Friday, 20 August 2010

Full time PhD student

Thidawan Klaysri

Email: This e-mail address is being protected from spam bots, you need JavaScript enabled to view it

Supervised by

Dr. Giovanna Di Marzo Serugendo

Prof. Mark Levene

PhD Research

Funded by

RMUTP & Science Ministry of Thailand

Start date: 20 April 2009

End date:19 April 2012

Keywords

Social Networks, Clustering, Graph Mining, Local Clustering, Bipartite graph

Scope and aims of the Research

The Wikipedia Category System combines the notions of a thesaurus and a folksonomy i.e. a collaborative tagging system. Categories in Wikipedia are created collaboratively by users. However, the number of categories and the hierarchical relationships between them have been increasing rapidly. In fact, it appears that the number of categories has grown more rapidly than the number of articles! We are looking at how to reduce the number of categories by clustering categories according to their similarity. To this end, we are applying a range of clustering techniques for large-scale social networks.

The first stage:

We are using the English Wikipedia Category dataset in our research. We have been applying several clustering algorithms based on shortest paths between categories connected by common pages as a measurement of their proximity.

Figure 1: a sample from English Wikipedia category network

Figure 1 shows a small sample of the English Wikipedia category network. It is an undirected graph containing 39 categories, article pages and the links between them. We see that it is a bipartite graph between categories and pages.

This multimode graph was transformed into a single-mode graph representing connections among categories having at least a page in common. Figure 2 shows this single mode graph.

The Geodesic matrix, as shown in Figure 3, was computed automatically based on the shortest path length, and it shows the distances between pairs of categories.

Figure 2: an undirected graph of Wikipedia category network

Figure 3: geodesic matrix of the network

Using the distances from the Geodesic matrix, we applied k-Means clustering to group the categories, and this resulted in two clusters as shown in Figure 4.

Figure 4: K-Means clustering

Next stage:

We will next apply this algorithm to the whole English Wikipedia Category network, and we will also consider alternative notions of distances. Then, we will apply our clustering algorithm to different social network datasets; for example, other datasets in Wikipedia and other social network sites such as Facebook, Yahoo or other networks related to Wikipedia.