Three distinct problems in data science — trend identification in graphs, the quantitative study of scientific literature and evaluation of single-cell genomics — will all be addressed by new research in large-scale network analytics, jointly led by Distinguished Professor David Bader at New Jersey Institute of Technology.
Three distinct problems in data science — trend identification in graphs, the quantitative study of scientific literature and evaluation of single-cell genomics — will all be addressed by new research in large-scale network analytics, jointly led by Distinguished Professor David Bader at New Jersey Institute of Technology.
The problems have a common challenge of finding patterns, known as community detection, from inside incredibly large datasets. Work is funded by a $648,000 National Science Foundation grant, Cyber-Infrastructure for Community Detection, Extraction, and Search in Large Networks. Bader is a principal investigator, receiving $250,000 for research. He is working with George Chacko and Tandy Warnow, both of University of Illinois Urbana-Champaign.
What will happen now is the development of new algorithms to identify clusters within those graphs. Bader’s role is scalability and performance. He brings Arachne to the table, which is open-source software that he and colleagues have worked on since 2022, designed to organize graphs with trillions of vertices and edges while presenting a Python interface that almost anyone can learn. Chacko is working on project benchmarking and Warnow is developing new algorithms and interoperability with other methods.
They are planning to test the new method for various applications. In one field, single-cell genomics, it will involve “clustering very large networks whose vertices represent cells and where the weight of an edge between two vertices is computed based on gene expression profiles,” the team explained.
But it’s scientometrics that has Bader most excited. This is a meta-field where researchers study how to mathematically evaluate the impact of papers and citations.
“A lot of information can be learned by looking at the metadata of published literature,” said Bader, director of NJIT’s Institute for Data Science. “For instance, looking at which papers cite others, which authors co-published together, and so on. And when there are questions emerging, like who are the experts on the following topic, or what are the papers I need to read to really get an understanding of a particular field, that’s where scientometrics comes into play.”
“Right now, the data sets for scientometrics are very, very large in size, and current tools, for instance, NetworkX in Python, or other such packages, don’t scale to the size problems needed by the scientometrics community. So what this grant is enabling is new algorithms and new implementations that are scalable to be able to solve their problems. And so this is where we’re going to pick up Arkouda and Arachne to work on them,” he noted, referring to his own prior research.
So far, “There are prototypes for some of the algorithms that [Warnow] has developed, and they work on small test cases. So the idea is, we essentially have a proof of concept, but there’s sequential implementations, there’s slow implementations and our challenge is going to be able to get them to scale to the problem sizes that the scientometrics community needs for their data sets, and to be able to make it run fast.”
Discover more from Science
Subscribe to get the latest posts sent to your email.