Shrideep Pallickara: Research Projects

Shrideep Pallickara [Research Projects]

CV
Research
	Publications
	Projects
	Themes
	Students
	Software
Teaching
Outreach
Personal
Home

My research is broadly in the area of large-scale distributed systems. Below you will find short descriptions of the ongoing projects. Please visit the respective project pages for the latest updates relating to publications, software releases, and ongoing research activities.

Sustain: Catalyzing Urban Sustainability Research at Scale
Sponsors:
	US National Science Foundation

The United States is highly urbanized with more than 80% of the population residing in cities. Cities draw from and impact natural resources and ecosystems while utilizing vast, expensive infrastructures to meet economic, social, and environmental needs. The NSF has invested in several strategic research efforts in the area of urban sustainability, all of which generate, collect, and manage large volumes of spatiotemporal data. Voluminous datasets are also made available by governmental agencies and NGOs in domains such as climate, ecology, health, and census. These data can spur exploration of new questions and hypotheses, particularly across traditionally disparate disciplines, and offer unprecedented opportunities for discovery and innovation. However, the data are encoded in diverse formats and managed using a multiplicity of data management frameworks all contributing to a Balkanization of the observational space that inhibits discovery. A scientist must reconcile not only the encoding and storage frameworks, but also negotiate authorizations to access the data. A consequence is that data are locked in institutional silos, each of which represents only a sliver of the observational space. This project, SUSTAIN (Software for Urban Sustainability to Tailor Analyses over Interconnected Networks), facilitates and accelerates discovery by significantly alleviating data-induced inefficiencies. This effort has deep, far-reaching impact. It transforms urban sustainability science by establishing a community of interdisciplinary researchers and catalyzing their collaborative capacity.

Granules: Distributed Real time Stream Processing
Sponsors:
	US National Science Foundation Department of Homeland Security

Granules supports the processing of data streams over a distributed collection of processing elements. Such streams can be generated in settings involving observational and monitoring equipment, simulations, and computational workflows. In Granules these computations can be long running, with multiple rounds of execution, with the ability to retain state across successive rounds. Granules allows a collection of related computations to be expressed as directed graphs that have cycles in them, and orchestrates the completion of such distributed processing. Granules manages the lifecycle and finite state machine associated with computations. The system can orchestrate such stream processing computations within traditional clusters, collection of desktops, or IaaS VM-based settings. The processing encapsulated within these computations can be arbitrary, and encoded in C, C++, C#, Java, R and Python. Granules also incorporates support for variants of the MapReduce paradigm that make it amenable for scientific applications. By abstracting the complexities of doing I/O and the vagaries of execution in distributed settings, Granules allows a domain scientist to focus on the problem on at hand and not on the artifacts related to deployments in large-scale distributed systems. A broad class of compute and data intensive applications can benefit from the capabilities available in Granules. Some of the application domains that Granules is currently deployed in include brain computer interfaces, epidemiological modeling, handwriting recognition, data clustering algorithms, and bio-informatics (mRNA sequencing).

Glean: Predictive Analytcis at Scale over Big Data [Effort with the Big Data group]
Sponsor:
	Department of Homeland Security US National Science Foundation Environmental Protection Agency

The Glean project focuses on performing analytics at scale over Big Data. The datasets we consider are in the order of Petabytes and encompass billions of files representing trillions of observations, measurements, or simulation datapoints. Glean achieves this by combing innovations in large-scale storage systems, cloud computing, machine learning, and statistics. A particular focus of this effort is to perform analytics in real-time over streaming data representing time-series observations.

Galileo: Managing multidimensional time-series data [Effort with the Big Data group]
Sponsor:
	Department of Homeland Security Environmental Protection Agency

Time-series data occurs in settings such as observations initiated by radars and satellites, checkpointing data representing state of the system at regular intervals, and analytics representing the evolution of extracted knowledge over time. Galileo is a demonstrably scalable storage framework for managing such time-series data.The distributed storage system is incrementally scalable with the ability to assimilate new storage nodes as they become available. How data is stored and dispersed impacts the efficiency of subsequent retrievals. The data dispersion algorithm in Galileo stores similar data items in network proximity without introducing storage imbalances at individual storage nodes. This allows for a significant reduction in the search space for queries and acc that may be performed on the stored data.

Spindle: Harnessing Virtualized Cloud Infrastructures

The Spindle effort focusses on issues in migration of applications to Infrastructure-as-a-Service clouds. We construct performance models for applications based on profiling their constituent components during execution. We extract several features relating to CPU processing, memory consumption, and I/O. We then use these performance models to inform our VM placement and composition decisions while also reducing economic costs.

NaradaBrokering: Dissemination of voluminous data streams
Sponsors:
	US National Science Foundation Open Middleware Infrastructure Institute, UK US Department of Energy

NaradaBrokering is an infrastructure for building large-scale distributed systems that are secure and failure-resilient. The NaradaBrokering substrate itself comprises a distributed network of cooperating broker nodes. The substrate places no constraints on the size, rate and scope of the interactions encapsulated within the streams, or on the number of entities within the system. It also incorporates several services to mitigate network-induced problems as streams traverse disparate domains during traversals. The system provisions these guarantees such that they are easy to harness, while delivering consistent and predictable performance that is adequate for use in real-time settings.

Research in NaradaBrokering has been funded through grants from funding agencies in the United States and the United Kingdom. The project is funded through two grants from the National Science Foundation, two grants from the Open Middleware Infrastructure Institute of the United Kingdom, and an STTR grant from the Department of Energy