Visualizing Large Collections of URLs Using the Hilbert Curve
Art der Publikation: Conference Paper
Veröffentlicht auf / in: International Cross-Domain Conference for Machine Learning and Knowledge Extraction
Verlag (Publisher): Springer, Cham
Search engines like Google provide an aggregation mechanism for the web and constitute the main access point to the Internet for a large part of the population. For this reason, biases and personalization schemes of search results may have huge societal implications that require scientific inquiry and monitoring. This work is dedicated to visualizing data such inquiry produces as well as understanding changes and development over time in such data. We argue that the aforementioned data structure is very akin to text corpora, but possesses some distinct characteristics that requires novel visualization methods. The key differences between URLs and other textual data are their lack of internal cohesion, their relatively short lengths, and—most importantly—their semi-structured nature that is attributable to their standardized constituents (protocol, top-level domain, country domain, etc.). We present a technique to spatially represent such data while retaining comparability over time: A corpus of URLs in alphabetical order is evenly distributed onto the so-called Hilbert curve, a space-filling curve which can be used to map one-dimensional spaces into higher dimensions. Rank and other associated meta-data can then be mapped to other visualization primitives. We demonstrate the viability of this technique by applying it to a data set of Google search result lists. The data retains much of its spatial structure (i.e., the closeness between similar URLs) and the spatial stability of the Hilbert curve enables comparisons over time. To make our technique accessible, we provide an R-package compatible with the ggplot2-package.