Nowadays extracting information out of collected data is an important task in research and economy. Cluster analysis describes one way to gain such information. The task is to find a partition of the data set into several clusters, while satisfying the natural property that similar data points are most likely contained in the same cluster. In order to evaluate how well this property is fulfilled one can choose from a variety of objectives, with k-center, k-median and k-means being among the most popular ones. It is necessary to carefully choose the number of allowed clusters, which often depends on the specific application. However, there are situations in which it is not desirable to compute only a single clustering. A hierarchical clustering is a collection of clusterings of different sizes, such that for every pair of clusterings one is a refinement of the other. Phylogenetic trees for example exhibit such a hierarchical structure.
One main aspect of our research is the analysis of popular greedy heuristics to compute such hierarchical clusterings. Agglomerative methods start with the finest possible clustering and iteratively merge pairs of clusters until only one cluster remains. Among these, linkage algorithms get a linkage function which assigns a cost to every pair of clusters, and then greedily merge the cluster pair with smallest cost. The cost of every emerging clustering is then compared to the cost of an optimal clustering of the same size, which leads the approximation guarantee. Although linkage algorithms are popular in practice, there are only few theoretical results regarding the approximation guarantee of such algorithms. In our work we analyze two popular linkage algorithms, complete linkage and Ward's method.
Optimal clusterings are generally not hierarchically compatible. Thus, a solution with approximation factor one does not exist in general. We are interested in the existence or non-existence of hierarchical clusterings for a given approximation factor.