Clemson scientists receive $2.95M to improve and simplify large-scale data analysis
CLEMSON, South Carolina — Clemson University scientists Alex Feltus and Melissa Smith have received a $2.95 million collaborative award from the National Science Foundation to develop cyberinfrastructure aimed at providing researchers around the nation and world with a more fluid and flexible system of analyzing large-scale data.
Biologists, hydrologists, computer engineers and computer scientists will join forces with Feltus and Smith to design a system called Scientific Data Analysis at Scale (SciDAS). Their goal is to help current researchers and future innovators discover data, move it smoothly across advanced networks and improve flexibility and accessibility to national and global resources.
Feltus is the principle investigator of the three-year project. Smith is a co-principal investigator, along with Claris Castillo and Ray Idaszak of Renaissance Computing Institute (RENCI) at the University of North Carolina, Chapel Hill; and Stephen Ficklin of Washington State University in Pullman.
"A key aspect of the SciDAS team is that we'll be processing scientific data at the same time that we're gluing together all the parts needed for a national cyberinfrastructure ecosystem," said Feltus, associate professor of genetics and biochemistry in Clemson University's College of Science. "We're trying to avoid the problem of 'if you build it they will come' and instead enlist the input of a variety of scientists to join us on the ground floor and help us build it. Thus, our software will be refined by using real data by real users with real habits."
Scientific discovery has become increasingly dependent on terascale (one trillion floating point operations per second) and even petascale (one quadrillion per second) data processing that only the world's fastest supercomputers can process. Fortunately, years of significant and strategic support from public and private sectors have created a distributed computational ecosystem to help meet these extraordinary demands. Available resources include high-speed networks like Internet2, open source scientific software packages, supercomputers in national labs, campus supercomputers, commercial cloud providers and deep data repositories like the National Center for Biotechnology Information. The Internet2 cyberteam will be assisting the research team in optimizing end-to-end data transfer rates.
"Many fields are awash with huge datasets. This is certainly true of biology and hydrology, but it also includes researchers who are studying satellite imagery, remote sensors and education analytics, to name a few," Feltus said. "Today's scientists are now required to understand both the underlying science and the cyberinfrastructure ecosystem to design and execute mind-bogglingly complex computations. SciDAS will combine new software with existing software to construct a system that will be efficient, practical and user-friendly."
SciDAS will enable a broad range of scientists to not only get information faster, but also to use much larger datasets and tease out information that they might not even know exists.
"The need for large data computing brings new challenges for scientists to be able to use complex systems efficiently and effectively," said Smith, associate professor in the Holcombe Department of Electrical and Computer Engineering in Clemson's College of Engineering, Computing and Applied Sciences.
"My specialty is in computing architectures, application optimization and machine learning. Using these tools and techniques, we're going to be building an infrastructure that is easier for data scientists to manage. We have a good body of software and data repositories already in place that have been individually tried and tested. We're going to bring these components together and make their use seamless for the scientist across existing cyberinfrastructure and also cyberinfrastructure that will be available in the future."
On a technical level, SciDAS will combine access to multiple national cyberinfrastructure resources, including NSF Clouds, the Open Science Grid, the Extreme Science and Engineering Discovery Environment, petascale supercomputers such as COMET, and a variety of nationwide university resources such as Clemson's Palmetto Cluster. The distributed and scalable nature of both the data-sharing and the computer infrastructure will be exploited to boost the performance of workflows and scientific productivity.
"Given the huge problems and opportunities at play in the 21st century, we intend to speed up the discovery process and complex end-to-end data analysis process through a tight coupling of science and cyberinfrastructure experts," Feltus said. "This is not about making one-size-fits-all software. Rather, we'll be binding together the national cyberinfrastructure ecosystem to focus real data of interest to practicing scientists."
RENCI will lead the effort to integrate existing cyber tools and technologies into the new SciDAS infrastructure that will be designed to support all aspects of distributed, data-driven research. Development of the SciDAS framework will involve integrating a number of NSF-funded cyberinfrastructure systems into one package.
"We will build on successful cyberinfrastructure projects developed here at RENCI, most of them with funding from the National Science Foundation," said Castillo, a senior computational and networked systems researcher at RENCI. "Through NSF support, RENCI has developed a number of tools and environments that make science more productive. SciDAS will integrate those tools and work environments into a unified cyberinfrastructure tailored to support science applications at scale. It is a win for scientists and a way to extend the value of our funded projects."
Ficklin, a computational biologist with the department of horticulture at Washington State University, will demonstrate the effectiveness of SciDAS by building gene co-expression networks for plants, animals, insects and people as a use-case for systems biology.
This data-intensive project, which maps the interactions of tens of thousands of genes in organisms, could help farmers breed new crops using traditional methods or aid scientists in finding new genes that influence plant and animal health.
"In the end, we will create the most complete repository of gene co-expression networks that exists anywhere," Ficklin said. "Improving our cyberinfrastructure helps make our country more competitive in research. It keeps us in the forefront of data science."
RENCI communications director Karen Green and Washington State communications coordinator Seth Truscott contributed to this article.
Internet2 is a nonprofit, member-driven advanced technology community founded by the nation's leading higher education institutions in 1996. Internet2 serves more than 94,000 community anchor institutions, 317 U.S. universities, 70 government agencies, 43 regional and state education networks, more than 900 InCommon participants, 78 leading corporations working with our community, and 61 national research and education network partners that represent more than 100 countries. Internet2 offices are located in Ann Arbor, Michigan; Denver; Emeryville, California; Washington, D.C.; and West Hartford, Connecticut. For more information, visit http://www.internet2.edu or follow @Internet2 on Twitter.
This material is based upon work supported by the National Science Foundation under Grant No. 1659300. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF. The exact amount of the grant is $2,952,217.