Current Projects

Intel Parallel Computing Center at Indiana University (IPCC@IU)

Through generous funding from Intel® Corporation, Indiana University Bloomington will become the newest Intel® Parallel Computing Center (Intel® PCC) this September. Intel® PCCs are universities, institutions, and labs identified as leaders in their fields, focusing on modernizing applications to increase parallelism and scalability through optimizations that leverage cores, caches, threads, and vector capabilities of microprocessors and coprocessors. 

This latest interdisciplinary center is led by Judy Qiu, an Assistant Professor in the School of Informatics and Computing. The work of Steven Gottlieb, a Distinguished Professor of Physics, is also supported by the center. Qiu's research focuses on novel parallel systems supporting data analytics, while Gottlieb focuses on adapting the physics simulation code of the MILC Collaboration to the Intel® Xeon Phi™ Processor Family. 

“The Intel® Parallel Computing Center highlights IU’s leadership and strength in high performance computing. It represents collaboration between industry and higher education, and across schools and departments within IU, that will benefit the research community and the private sector in a variety of important ways,” said School of Informatics and Computing Dean Bobby Schnabel. 

Indiana University will benefit from its role as an Intel® PCC by having access to Intel expertise, software tools, and advanced technologies. Qiu and Gottlieb also look forward to sharing the results of their work in collaboration with Intel at conferences such as the International Supercomputing Conference held in Europe, the SC conference held in the US and the Intel® Xeon Phi™ User Group meetings. This initial work could be followed by further projects with other IU faculty funded by this Intel® PCC.

Cloud-Based Perception and Control of Sensor Nets and Robot Swarms:

            Sponsored by: Air Force Office of Scientific Research (AFOSR)

            Abstract:

Cloud Computing has long being identified as a key enabling technology for Internet of Things applications. We have developed an open source framework called IoTCloud[1] to connect IoT devices to cloud services. IoTCloud was developed as part of a research funded by AFOSR for Cloud-Based Perception and Control of Sensor Nets and Robot Swarms. IoTCloud consists of; a set of distributed nodes running close to the devices to gather data; a set of publish-subscribe brokers to relay the information to the cloud services and a distributed stream processing framework (DSPF) coupled with batch processing engines in the cloud to process the data and return (control) information to the IoT devices. Real time applications execute data analytics at the DSPF layer achieving streaming real-time processing. Our open- source IoTCloud platform [2] uses Apache Storm[3] as the DSPF, RabbitMQ[4] or Kafka[5] as the message broker and an OpenStack academic cloud[6] (or bare-metal cluster) as the platform. To scale the applications with number of devices we need distributed coordination among parallel tasks and discovery of devices; both achieved with a ZooKeeper[7] based coordination and discovery service.

 

Developing Machine Learning Algorithms to Access Bedrock and Internal Layers in Polar Radar Imagery

            Sponsored by: NASA

            Abstract:

 

Planning Grant: I/UCRC for joining Center for Cloud and Autonomic Computing

            Sponsored by: National Science Foundation

            Abstract:

The proposed planning activity seeks to undertake planning of the establishment of a new Industry/University Cooperative Research Center (I/UCRC) site at Indiana University of the existing Center for Cloud and Autonomic Computing. The center currently involves Arizona, Florida, Mississippi State and Rutgers, and industry and government partners in the field. The Indiana University site intends to provide additional capabilities for the center to cover the areas of basic cloud infrastructure including tools to support the FutureGrid testbed, security, networking, storage, programming models and applications. The site intends to explore the utility of clouds to support scientific research including infrastructure tools and development of innovations in extensions of the MapReduce programming model, identification of science applications suitable for clouds, privacy preserving algorithms for health informatics on clouds, integration of high performance networking and data architectures. The planned site, in combination with the existing center plans to help build innovation capacity in cloud computing and its supporting science and engineering through cooperatively defined research on key challenges with industry. The outcomes from the center have the potential to broadly impact the public and private sectors. The site plans to have a significant impact on curriculum and student training at the undergraduate and graduate levels. The planned site intends to actively engage in outreach to Minority Serving Institutions through existing consortia.

 

 

Extensible Computational Services for Discovery of New Particles

            Sponsored by: National Science Foundation

            Abstract:

The origin of matter is one of the great puzzles of nature. Even though the visible component is only a small part of the mass/energy balance of the universe the strong force that permanently confines quarks and gluons to atomic nuclei remains mysterious. Understanding of the phenomenon of confinement is one of the fundamental questions in physics. Protons, neutrons and other hadrons are the bound states of quarks and gluons allowed by confinement. This project will pursue studies of individual hadrons and their interaction and is expected to offer a unique window into this fundamental phenomenon. In the process, this project will enable best practice spectroscopy analysis across a broad range of forthcoming experiments, particularly at the Jefferson Laboratory. Education activities are an integral part of this effort. Students will have ample training opportunities and will acquire computational and analytical skills that will serve them well in their future careers in the industry or academia. A vibrant set of summer undergraduate research experiences will be hosted by the investigators at the Jefferson Laboratory with both computer science and particle physics projects.

 

Developments in particle accelerators and detection techniques have led to a new generation of experiments in hadron physics that are flourishing around the world. An important set of experiments studies hadron spectroscopy, what hadrons exist and with what properties. The new experiments at these facilities generate complicated data sets, which demand a qualitatively new level of sophistication in analysis never achieved before. The objective of the project is to develop new theoretical tools and underlying computational services for analysis of large statistics data sets from current and future experiments in hadron spectroscopy with the goal to search for new hadrons, in particular the so called, hybrids and glueballs, which are expected to carry some unique signatures of the confinement phenomenon. The unique features of the underlying theory of strong interactions also make it an attractive template for constructing theories beyond the Standard Model of fundamental interaction. This project will enable such work.

 

XPS: FULL: DSD: Collaborative Research: Rapid Prototyping HPC Environment for Deep Learning

            Sponsored by: National Science Foundation

            Abstract:

The impact of Big Data is all around us and is enabling a plethora of commercial services. Further it is establishing the fourth paradigm of scientific investigation where discovery is based on mining data rather than from theories verified by observation. Big Data has established a new discipline (Data Science) with vibrant research activities across several areas of computer science. This “Rapid Python Deep Learning Infrastructure” (RaPyDLI) project advances Deep Learning (DL) which is a novel exciting artificial intelligence approach to Big Data problems, which also involves a sophisticated model and a corresponding “big compute” needing high end supercomputer architectures. DL has already seen success in areas like speech recognition, drug discovery and computer vision where self-driving cars are an early target. DL uses a very general unbiased way of analyzing large data sets inspired by the brain as a set of connected neurons. As with the brain, the artificial neurons learn from experience corresponding to a “training dataset” and the “trained network” can be used to make decisions. Trained on voices, the DL network can enhance voice recognition and trained on images, the DL network can recognize objects in the image. A recent study by the Stanford participants in this project trained 10 billion connections on 10 million images to recognize objects in an image. This study involved a dataset that was approximately 0.1% the size of data “learnt” by an adult human in their lifetime and one billionth of the total digital data stored in the world today. Note the 1.5 billion images uploaded to social media sites every day emphasize the staggering size of big data. The project aims to enhance by DL by allowing it to use large supercomputers efficiently and by providing a convenient DL computing environment that enables rapid prototyping i.e. interactive experimentation with new algorithms. This will enable DL to be applied to much larger datasets such as those “seen” by a human in their lifetime. The RaPyDLI partnership of Indiana University, University of Tennessee, and Stanford enables this with expertise in parallel computing algorithms and run times, big data, clouds, and DL itself.

RaPyDLI will reach out to DL practitioners with workshops both to gather requirements for and feedback on its software. Further it will proactively reach out to under-represented communities with summer experiences and DL curriculum modules that include demonstrations built as “Deep Learning as a Service”.

RaPyDLI will be built as a set of open source modules that can be accessed from a Python user interface but executed interoperably in a C/C++ or Java environment on the largest supercomputers or clouds with interactive analysis and visualization. RaPyDLI will support GPU accelerators and Intel Phi coprocessors and a broad range of storage approaches including files, NoSQL, HDFS and databases. RaPyDLI will include benchmarks as well as software and will offer a repository so users can contribute the high level code for a range of neural networks with benefits to research and education.

 

CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science

            Sponsored by: National Science Foundation

            Abstract:

Many scientific problems depend on the ability to analyze and compute on large amounts of data. This analysis often does not scale well; its effectiveness is hampered by the increasing volume, variety and rate of change (velocity) of big data. This project will design, develop and implement building blocks that enable a fundamental improvement in the ability to support data intensive analysis on a broad range of cyberinfrastructure, including that supported by NSF for the scientific community. The project will integrate features of traditional high-performance computing, such as scientific libraries, communication and resource management middleware, with the rich set of capabilities found in the commercial Big Data ecosystem. The latter includes many important software systems such as Hadoop, available from the Apache open source community. A collaboration between university teams at Arizona, Emory, Indiana (lead), Kansas, Rutgers, Virginia Tech, and Utah provides the broad expertise needed to design and successfully execute the project. The project will engage scientists and educators with annual workshops and activities at discipline-specific meetings, both to gather requirements for and feedback on its software. It will include under-represented communities with summer experiences, and will develop curriculum modules that include demonstrations built as 'Data Analytics as a Service.'

The project will design and implement a software Middleware for Data-Intensive Analytics and Science (MIDAS) that will enable scalable applications with the performance of HPC (High Performance Computing) and the rich functionality of the commodity Apache Big Data Stack. Further, this project will design and implement a set of cross-cutting high-performance data-analysis libraries; SPIDAL (Scalable Parallel Interoperable Data Analytics Library) will support new programming and execution models for data-intensive analysis in a wide range of science and engineering applications. The project addresses major data challenges in seven different communities: Biomolecular Simulations, Network and Computational Social Science, Epidemiology, Computer Vision, Spatial Geographical Information Systems, Remote Sensing for Polar Science, and Pathology Informatics. The project libraries will have the same beneficial impact on data analytics that scientific libraries such as PETSc, MPI and ScaLAPACK have had for supercomputer simulations. These libraries will be implemented to be scalable and interoperable across a range of computing systems including clouds, clusters and supercomputers.

 

 

International Summer School on Data Science for Scattering Reactions

            Sponsored by: National Science Foundation

            Abstract:

This award will provide partial support for a summer school focused on the theory and phenomenology of relativistic scattering and its practical implementations in data analysis of modern experiments in hadron physics. The organizers propose to bring together experts in hadrons spectroscopy and interested students, post-docs, staff and faculty for a series of lectures and laboratory sessions. The lectures given at the school will be recorded and used to develop online resources that can be used for future training purposes.

This Summer School is intended to address a critical need in the community with respect to the training of a new generation of physicists in advanced scattering theory in response to the specific needs of data analysis at experimental programs at facilities such as the Relativistic Heavy Ion Collider at Brookhaven and the Jefferson National Laboratory. The data sets generated of these facilities are of unprecedented quality and will demand a qualitatively new level of sophistication in analysis never before achieved. As such, the summer school will provide an unique opportunity for scientists in the field of hadronic physics to enhance their training by interacting with recognized experts in this field.

 

CAREER: Programming Environments and Runtime for Data Enabled Science

            Sponsored by: National Science Foundation

            Abstract:

This research is at the nexus of the data deluge in science and business and two major computing thrusts - clouds and exascale scientific systems which are unified with an interoperable runtime system. The project has the potential to transform the approach to applications that varies from data mining of genomic and proteomic data for science to data analytics for business. Computer science areas at the heart of the research - namely Iterative Map Collective runtime, fault tolerance, data-computing co-location and high level languages - will be advanced. Furthermore, the new applications enabled and new software paradigms will feed back into the architecture of cloud and exascale systems possibly suggesting particular storage and communication choices and new directions for the national infrastructure. The investigator will incorporate this novel research into courses and graduate and undergraduate research experiences at both Indiana University and with national and international collaborators. The work blends scientific research (computer science and applications) with mainstream commercial practice (clouds). Thus, curricula built around this research will motivate and inspire the entry of students into the workforce and so it has potential for supporting needed economic development.

The research is based on initial research on Iterative MapReduce with successful prototypes Twister (on HPC) and Twister4Azure (on clouds). The project will architect and prototype a Discovery Environments for Data-Enabled Science and Engineering with the following components developed: (1) a next generation Iterative MapReduce using a Map-Collective model as the runtime for data analysis (mining) interoperably between clouds and clusters; (2) polymorphic collective operations needed to support parallel linear algebra and other data analysis operations such as those in MapReduce; (3) a software message routing using publish-subscribe to scale to tens of thousands of nodes or above; (4) a storage model that builds on current object stores, data parallel file systems (as in Hadoop), and wide area models like Lustre but respects compute-data co-location; (5) a fault tolerance model implemented as a Collective operation with configurable settings that supports checkpointing between iterations for robustness and individual node failure without compromising performance. Later research objectives include security and a higher-level programming model that compiles to an iterative MapReduce runtime.

 

CSR:Medium:Collaborative Research: An Analytic Approach to Quantifying Availability (AQUA) for Cloud Resource Provisioning and Allocation

            Sponsored by: National Science Foundation

            Abstract:

Cloud computing will significantly transform the landscape of the IT industry and also impact the economy and society in many ways. The reliability and availability of cloud services, affected by various hardware and software component failures, becomes increasingly more critical, as government agencies, business and people are expected to rely more and more on these services. Lack of a guaranteed high availability of cloud services and applications is considered by many IT professionals as the top concern for preventing a successful implementation of cloud services, followed by device based security and cloud application performance. This project aims to predict the service availability for a given setting, and design effective resource provisioning and allocation algorithms to guarantee a high availability level required by cloud services. The project is expected to significantly advance the state-of-the-art by offering deep insights into the knowledge about accurate prediction and cost-effective guarantee of availability/reliability of cloud services. The outputs from this project can be used to 1) improve service availability, performance and resource utilization while minimizing the cost of overprovisioning, 2) reduce huge losses in revenue and productivity due to service outages while enabling new (mission-critical) applications and services.

The existing approaches to ensuring availability are qualitative in that they use heuristics to duplicate data or restrict the number of virtual machines (VMs) that should be placed in the same rack/server to improve reliability/availability of cloud services. However, it is essential to be able to quantify availability for a given setting. Quantifying availability for an often finite service duration via analysis (as opposed to measurement or qualitative evaluation) requires transient, instead of steady state probability analysis based on a wide range of failure and repair/backup models. This project takes a holistic approach to addressing the open challenges via both rigorous analysis and extensive experiments. More specifically, the project leverages two large-scale HPC/Cloud production systems at PI?s institutions to generate a rich set of fine-grained data about physical component failures (which is not available in the public domain). The data is then analyzed to build and verify/validate failure models. Based on the failure models and for a given Infrastructure-as-a-Service (IaaS) request for n virtual machines (VMs), a service duration of t time units and a desired availability level a < 1, the project develops an analytical model to predict the availability that can be achieved for the service duration (t), if an additional k backup VMs are allocated. The project also develops cost-effective, multi-objective optimization based cloud resource provisioning and allocation algorithms that determine the appropriate value for k (and the placement of these n+k VMs) in order to achieve the required availability level a.

 

CAREER: Observing the World through the Lens of Social Media

            Sponsored by: National Science Foundation

            Abstract:

Every day, millions of people across the world take photos and upload them to social media websites. Their goal is to share photos with friends and others, but collectively they are creating vast repositories of visual information about the world. Each photo is an observation of how the world looked at a particular point in time and space. Aggregated together, these photos could provide new sources of observational data for use in disciplines like biology, earth science, social science or history. This project is investigating the algorithms and technologies needed for mining these large collections of photographs and noisy metadata to draw inferences about the physical world. The project has four research thrusts: (1) investigating techniques for identifying and correcting noise in metadata like geo-tags and timestamps, (2) developing algorithms for extracting semantic information from images and metadata, (3) creating methods for robust aggregation of noisy evidence from multiple photos, (4) validating these techniques on interdisciplinary applications in biology, sociology, and earth science.

The project is laying the foundation for using visual social media as a new source of observational data for a variety of scientific disciplines. The educational component is preparing students for the next generation of "big data" jobs through new undergraduate and graduate courses and online instructional materials. Undergraduate students (particularly from under-represented groups) are recruited to participate in the research program and encouraged to pursue scientific careers. An annual workshop is planned to educate general audiences, particularly senior citizens, about data mining and social media. Source code, datasets, course materials, and other results of the project will be disseminated to the public via the project web site (http://vision.soic.indiana.edu/career/).

 

Technology Audit and Insertion Services (TAIS) for TeraGrid

            Sponsored by: University of Buffalo

            Abstract:

 

An Open Resource for Collaborative Biomedical Big Data Training

            Sponsored by: University of California, San Diego

            Abstract:

 

Center for Remote Sensing of Ice Sheets

            Sponsored by: University of Kansas (through NASA)

            Abstract:

 

Deployment of CReSIS Radar Instrumentation and data Management Activities in Support of Operation Ice Bridge

            Sponsored by: University of Kansas (through NASA)

            Abstract

 

Visual Analysis for Image Geo-location

            Sponsored by: Object Video, Inc.

            Abstract:

Streaming and Steering Applications: Requirements and Infrastructure

Sponsored by: National Science Foundation (Division of Advanced Cyberinfrastructure) and Department of Energy (Advanced Scientific Computing Research) and the Air Force Office of Scientific Research (AFOSR)

          Abstract: 

The workshops STREAM2015 and STREAM2016 cover a class of applications – those associated with streaming data and related (near) real-time steering and control – that are of growing interest and importance. The goal of the workshops is to identify application and technology communities in this area and to clarify challenges that they face. We will focus on application features and requirements as well as hardware and software needed to support them. We will also look at research issues and possible future steps after the workshop. We have surveyed the field and identified application classes including Internet of People, Social media, financial transactions, Industrial Internet of Things, DDDAS, Cyberphysical Systems, Satellite and airborne monitors, National Security, Astronomy, Light Sources, and Instruments like the LHC, Sequencers, Data Assimilation, Analysis of Simulation Results, Steering and Control. We also survey technology developments across academia and Industry where all the major commercial clouds offer significant new systems aimed at this area. The field needs such an interdisciplinary workshop addressing the big picture: applications, infrastructure, research and futures.

STREAM2015 will be the first in a series of two workshops and will have a focus on NSF applications and infrastructure. It will be held in Indianapolis, where two full meeting days (October 27-28) will be followed by a report writing day on October 29, 2015. The second workshop STREAM2016 will have a focus on DOE activities and applications as well as following up on ideas raised in STREAM2015. STREAM2016 will be held in Washington on March 21-24, 2016. We will produce separate reports on the discussions at each workshops that will be complete around two months after each event. The first workshop budget covers travel and meeting expenses for about 30 attendees.
We have identified an organizing committee expanding the core group – Fox (Indiana), Jha (Rutgers) and Ramakrishnan (LBNL) – proposing these two workshops. In selecting a list of attendees we will reach out to underrepresented communities, in particular women and ethnic minorities. The real time streaming of sessions for STREAM2015 will enhance opportunities for a broad community to engage at the meeting and we will support questions and comments from remote participants. The web site http://streamingsystems.org will support this workshop and contain the final report, presentations, position papers, archival copies of streamed video and a repository of useful documents and links.

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

            Sponsored by: National Science Foundation

            Abstract: 

            

----------