Social Networks,
Clustering, Graph Mining, Local Clustering, Bipartite graph
Scope and aims of the Research
The Wikipedia
Category System combines the notions of a thesaurus and a folksonomy i.e. a
collaborative tagging system. Categories in Wikipedia are created
collaboratively by users. However, the number of categories and the
hierarchical relationships between them have been increasing rapidly. In fact,
it appears that the number of categories has grown more rapidly than the number
of articles! We are looking at how to reduce the number of categories by
clustering categories according to their similarity. To this end, we are applying
a range of clustering techniques for large-scale social networks.
The first stage:
We are using
the English Wikipedia Category dataset in our research. We have been applying
several clustering algorithms based on shortest paths between categories
connected by common pages as a measurement of their proximity.
Figure 1: a sample from English Wikipedia category network
Figure 1
shows a small sample of the English Wikipedia category network. It is an undirected
graph containing 39 categories, article pages and the links between them. We
see that it is a bipartite graph between categories and pages.
This multimode graph was transformed into a
single-mode graph representing connections among categories having at least a
page in common. Figure 2 shows this single mode graph.
The Geodesic matrix, as shown in Figure 3, was
computed automatically based on the shortest path length, and it shows the
distances between pairs of categories.
Figure 2: an undirected graph of Wikipedia category network
Figure 3: geodesic matrix of the network
Using the distances from the Geodesic matrix,
we applied k-Means clustering to group the categories, and this resulted in two
clusters as shown in Figure 4.
Figure 4: K-Means clustering
Next stage:
We will next apply this algorithm to the whole English Wikipedia
Category network, and we will also consider alternative notions of distances. Then,
we will apply our clustering algorithm to different social network datasets;
for example, other datasets in Wikipedia and other social network sites such as
Facebook, Yahoo or other networks related to Wikipedia.