|
My research is broadly in the area of large-scale distributed systems. Below you will find short descriptions of the ongoing projects. Please visit the respective project pages for the latest updates relating to publications, software releases, and ongoing research activities.
Granules supports the processing of data streams over a distributed collection of processing elements. Such streams can be generated in settings involving observational and monitoring equipment, simulations, and computational workflows. In Granules these computations can be long running, with multiple rounds of execution, with the ability to retain state across successive rounds. Granules allows a collection of related computations to be expressed as directed graphs that have cycles in them, and orchestrates the completion of such distributed processing. Granules manages the lifecycle and finite state machine associated with computations. The system can orchestrate such stream processing computations within traditional clusters, collection of desktops, or IaaS VM-based settings. The processing encapsulated within these computations can be arbitrary, and encoded in C, C++, C#, Java, R and Python. Granules also incorporates support for variants of the MapReduce paradigm that make it amenable for scientific applications. By abstracting the complexities of doing I/O and the vagaries of execution in distributed settings, Granules allows a domain scientist to focus on the problem on at hand and not on the artifacts related to deployments in large-scale distributed systems. A broad class of compute and data intensive applications can benefit from the capabilities available in Granules. Some of the application domains that Granules is currently deployed in include brain computer interfaces, epidemiological modeling, handwriting recognition, data clustering algorithms, and bio-informatics (mRNA sequencing).
The Glean project focuses on performing analytics at scale over Big Data. The datasets we consider are in the order of Petabytes and encompass billions of files representing trillions of observations, measurements, or simulation datapoints. Glean achieves this by combing innovations in large-scale storage systems, cloud computing, machine learning, and statistics. A particular focus of this effort is to perform analytics in real-time over streaming data representing time-series observations.
Time-series data occurs in settings such as observations initiated by radars and satellites, checkpointing data representing state of the system at regular intervals, and analytics representing the evolution of extracted knowledge over time. Galileo is a demonstrably scalable storage framework for managing such time-series data.The distributed storage system is incrementally scalable with the ability to assimilate new storage nodes as they become available. How data is stored and dispersed impacts the efficiency of subsequent retrievals. The data dispersion algorithm in Galileo stores similar data items in network proximity without introducing storage imbalances at individual storage nodes. This allows for a significant reduction in the search space for queries and acc that may be performed on the stored data.
The Spindle effort focusses on issues in migration of applications to Infrastructure-as-a-Service clouds. We construct performance models for applications based on profiling their constituent components during execution. We extract several features relating to CPU processing, memory consumption, and I/O. We then use these performance models to inform our VM placement and composition decisions while also reducing economic costs.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||