A traditional approach to the investigation of unexplained phenomena is to infer laws from the patterns (information) in collected data. Presumably, the more information available with regard to a phenomenon, the better for the analysis, synthesis and modeling of patterns in data. However, with today's data acquisition technology, our ability to acquire data grows faster than our ability to process and analyze it. Specifically, high-dimensional and massive data sets are potentially a significant barrier to the investigation.
The understanding of patterns in high-dimensional and massive data sets is proving to be a key ingredient to knowledge discovery. For example, snapshots of clouds and sky give rise to a family of images with a common characteristic while at the same time exhibiting significant variations across the images. Likewise, a set of digital images of faces provides another example of an image family. While members of a family clearly possess distinct features, it is sensible to pose the question of whether an image, or set of unlabelled images, belongs to a certain family of images.
The analysis of patterns in data has typically been a subject in statistics and engineering. Recently, however, fundamental mathematical theory in areas such as differential geometry and topology has provided a new mathematical framework and insights for understanding large data sets residing in spaces of large ambient dimensions. The research at the Pattern Analysis Laboratory (PAL), lead by Professor Michael Kirby, emphasizes the transition of mathematical theory to efficient algorithms for exploring, understanding and modeling massive data sets.
For example, one of the approaches developed is geometric in nature and the main tool is the dimensionality reducing mapping. Specifically, each member of a family of patterns shares features common to the other members in the family. This correlation between images is a footprint of low dimensionality, or that a simpler representation exists. The question now becomes how does the data sit geometrically in its ambient space and how can this coherence be exploited for data reduction. The main techniques involved include optimal orthogonal expansions, Fourier analysis, radial basis functions, neural networks and wavelets.
The analysis of massive data sets involves a myriad of theoretical, algorithmic and computational challenges from which has emerged a new field of applied mathematics. It also promotes interdisciplinary collaboration, i.e., high-performance computing, for fast manipulation of high-dimensional and massive data sets.
|
|