Leading cloud providers join with NSF to support data science frontiers
The National Science Foundation (NSF) is providing nearly $30 million in new funding for research in data science and engineering through its Critical Techniques, Technologies and Methodologies for Advancing Foundations and Applications of Big Data Sciences and Engineering (BIGDATA) program.
NSF's awards are paired with support from Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure, which have each committed up to $3 million in cloud resources for relevant BIGDATA projects over a three-year period, beginning with this year's awards. A key goal of this collaboration is to encourage research projects to focus on large-scale experimentation and scalability studies.
"NSF's participation with major cloud providers is an innovative approach to combining resources to better support data science research," said Jim Kurose, assistant director of NSF for Computer and Information Science and Engineering (CISE). "This type of collaboration enables fundamental research and spurs technology development and economic growth in areas of mutual interest to the participants, driving innovation for the long-term benefit of our nation."
The BIGDATA program funds novel research in computer science, statistics, computational science and mathematics that seeks to advance the frontiers of data science. The program also supports work on innovative applications that leverage data science advances to enhance knowledge in various domains, including the social and behavioral sciences, education, biology, physical sciences, and engineering.
Data used in research originate from a variety of disparate sources, including scientific instruments, social media, transactions, machine-generated information from the Internet of Things (IoT), administrative systems and large-scale simulations. BIGDATA-enabled innovations will provide new insights derived from the growing wealth of data in myriad fields, leading to improved problem-solving and real-time decision-making.
The new BIGDATA awards will benefit from the unique, new engagement between NSF and leading cloud providers to foster innovation and provide a platform for computation, storage and analytics at large scale. This collaboration will specifically provide BIGDATA projects with cloud credits — enabling access to cloud-based storage and computing.
NSF's collaboration with the technology industry through BIGDATA is vital, especially in the area of data science. In its first year, this collaboration is driving creative and principled approaches to address data management, modeling, and analysis of big data, and applying novel techniques to solve data-intensive domain science and engineering problems. Furthermore, NSF is actively seeking to expand this collaboration through a recently released Dear Colleague Letter.
The awards NSF is announcing today are part of a portfolio of over $100 million in big data and data science research, education, and research infrastructure across the agency in Fiscal Year 2017. The 21 new BIGDATA awards support foundational elements of data science — the theories, techniques and methodologies that use big data to solve problems — as well as the innovative applications that are enabled by these foundational advances. Of these, eight will benefit from additional cloud credits and resources made possible by the new participation by cloud providers.
Examples of this year's awards, including two that are receiving cloud credits, are:
- Scalable and Interpretable Machine Learning: Bridging Mechanistic and Data-Driven Modeling in the Biological Sciences, University of California, Berkeley
While predictive models are an important step in understanding complex systems, equally important is the ability of humans to interpret the results from such models. This project focuses on developing novel, scalable, statistical machine learning algorithms that can effectively guide human decision-making and discovery in biological systems. Leveraging the cloud computing resources provided via the BIGDATA program, and using Apache Spark, these approaches and methods will be implemented for massive datasets in genomics and beyond.
- Predictive Analytics of Driver's Engagement for Injury Prevention, Drexel University; Virginia Tech; Children's Hospital of Pennsylvania
This project aims to develop scalable predictive analytics that can detect driver disengagement and provide alerts capable of reducing motor vehicle crashes. Utilizing cloud computing resources provided via the BIGDATA program, researchers will analyze the federal Strategic Highway Research Program 2 (SHRP 2) dataset — a publicly available repository of driving patterns consisting of two petabytes of heterogeneous data spanning over 100 variables — to carry out instance-based learning and heterogeneous network mining, along with a novel distributed computing infrastructure.
- Foundations of Responsible Data Management, Drexel University; University of Washington; University of Massachusetts, Amherst; University of Michigan
Responsible data management, including fairness — and the related concepts of representativeness and diversity, transparency and accountability, and data protection — are often only considered in the final step of data analysis, such as data mining and machine learning. The objective of this project is to develop the conceptual frameworks and algorithmic techniques to support responsible data management through all stages of the data lifecycle, from data discovery and acquisition to cleaning, integration, querying and analysis.