My research encompasses methodological and algorithmic innovations in three broad areas: (1) spatiotemporal data management and analytics, (2) file systems, and (3) stream processing for Internet-of-Things and Cyber Physical Systems settings.
|
|
Spatiotemporal data analysis at scale We have designed a suite of algorithms and software to simplify voluminous spatiotemporal data management and analytics. These algorithms are data format agnostic and our reference implementations can cope with data stored in over 20 different formats that include inter alia CSV, netCDF, HDF, XML, GRIB, BUFR, DMSP, NEXRAD, SIGMET. These systems have been deployed in epidemiology, ecological monitoring, methane gas leak detections, and atmospheric sciences. |
|
* |
Galileo: Fast storage and retrievals of voluminous multidimensional time-series data. Galileo supports approximate, analytical, fuzzy, t-tests, and significance evaluation queries over Petascale datasets encompassing ~ trillion files and quadrillion observations.
[Effort with the Big Data group] |
|
* |
Synopsis: A distributed and scalable sketching algorithm for voluminous spatiotemporal data. The data that is sketched can either be real-time observational streams or on-disk data. The sketch is memory-resident and is an effective surrogate for on-disk data. |
|
* |
Glean: Predictive analytics at scale including correlation analysis, hypothesis testing, feature selection, and anomaly detection. Glean targets real-time analytics over time-series data and supports creation and online updates to statistical, ensemble, machine learning, and probabilistic models. [Effort with the Big Data group] |
|
* |
Symphony: This effort involves development of innovative tools that will allow assessing the epidemiological impact and economic costs of livestock disease outbreaks at the national scale while reconciling the heterogeneity of the U.S. mainland. This includes exploration of the consequences of countermeasures, vaccination strategies including vaccine effectiveness, and policy on disease spread, economic costs, and depletion of resources. |
|
|
|
|
Fiile Systems: My research targets the micro/macroscopic aspects of distributed file systems design. At the individual machine level this has targeted efficiency of disk scheduling algorithms and contention. At the distributed scales this has involved metadata management, query support, preservation of timeliness and throughput, and overlay design. Ongoing research in this area has focussed on designing file systems that faciliate high-performance training of ensemble-based data fitting algorithms such as random forests and gradient boosting. |
|
|
Galileo: Fast storage and retrievals of voluminous multidimensional time-series data. Galileo supports approximate, analytical, fuzzy, t-tests, and significance evaluation queries over Petascale datasets encompassing ~ trillion files and quadrillion observations. |
|
|
Minerva: This effort targets efficiency of file systems in virtualized cloud environments. Minerva proactively alleviates disk contention, amortizes I/O costs, and selectively prioritizes VMs based on access patterns and durations. |
|
|
Gossamer: This effort explores foundational issues in file systems design for data generated in continuous sensing environments. In particular, we are seeking to explore how distributed file systems can cope with situations where data arrival rates outpace write throughputs and disk capacities. |
|
|
Concerto: This a distributed file system designed specifically to simplify construction of analytical models using ensemble methods such as random forests and gradient boosting. In particular, the objective of this effort is to facilitate dispersion and data accesses in multidimensional data spaces to preserve accuracy while significantly reducing training times.
|
|
|
|
|
Real-time Stream Processing This effort targets processing data streams generated in IoT and Cyber Physical Systems settings. Optimal stream scheduling is NP-Hard and our algorithm based on interference scores and time-series models is currently the state-of-the-art for single-stage stream processing. Our efforts have targeted stream processing at edge devices such as the Raspberry Pi. |
|
* |
Granules: Granules is a cloud runtime with support for MapReduce and real-time processing of data streams generated by sensors, medical devices, & programs. Our new system, NEPTUNE, builds on Granules and targets extreme throughputs in sensing environments. This includes orchestration of packet processing while accounting for both the stream production patterns and the utilization of resources to ensure real-time, high-throughput processing. |
|
|
VitalHome: We have been experimenting with a diverse set of physiological and environmental sensors as part of our VitalHome instrumentation project to continuously and non-invasively harvest vital sign data to identify incipient signs of health problems. |
|
* |
Spindle: Autonomous deployment of multitier and streaming applications in the cloud. Spindle leverages spot instances and manages vertical and horizontal scaling across different VM appliance types to provide faster response times at cheaper (economic) costs. |
|
* |
NaradaBrokering: Scalable dissemination of voluminous streams with support for Web Services and JMS. NaradaBrokering has been used in commerical internet conferencing systems to support thousands of concurrent online meetings |
|