Future-proofing ‘big data’ biological research depends on good digital identifiers

0
IMAGE

Credit: Julie McMurry and Lilly Winfree from the Monarch Initiative.

"Big data" research runs the risk of being undermined by the poor design of the digital identifiers that tag data. A group of worldwide researchers, led by Julie McMurry, at Oregon Health & Science University, has assembled a set of pragmatic guidelines to create, reference and maintain web-based identifiers to improve reproducibility, attribution, and scientific discovery. The guidance, publishing June 29 in the open access journal PLOS Biology helps address the frequent problems associated with persistent identifiers linked to scientific data.

Over the past decade, the life sciences have drastically changed as data continues to evolve to be larger, more interdependent and natively web-based. In this landscape, the broader scientific research community has struggled to engineer this data for the web so that it is persistently accessible, reusable and attributable.

Depending on the individual database involved, identifiers can signify a gene, a genome, a chemical, an organism, a set of experimental data, or even a published article. The usefulness of all these items depends on the robustness and uniqueness of their respective identifiers, enabling them to be linked and discovered in perpetuity. The authors point out that the organic way in which most identifiers have arisen threatens that usefulness, and recognise that it is difficult to create and sustain persistent identifiers or web addresses that won't break and that are used consistently.

This work calls on professionals to do a better job of identifier engineering – according to emerging community-developed conventions – so that data can be utilized more effectively for scientific discovery. It also calls on users to be aware enough of these conventions, and of available tooling, to not get burned by broken links and missed connections.

"As with plumbing fixtures, the question of how identifiers work should only need to be understood by those that build and maintain them. However, everyone needs to know how identifiers should be used, and this is where convention is important," said McMurry. "Through this work, we hope to encourage all participants in the scholarly ecosystem – including authors, data creators, data integrators, publishers, software developers, and resolvers – to adhere to best practice in order to maximize the utility and impact of life science data."

###

In your coverage please use this URL to provide access to the freely available article in PLOS Biology: https://doi.org/10.1371/journal.pbio.2001414

Press-only preview: http://plos.io/2t1XfVR

Contact: Julie McMurry, [email protected]

Citation: McMurry JA, Juty N, Blomberg N, Burdett T, Conlin T, Conte N, et al. (2017) Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data. PLoS Biol 15(6): e2001414. https://doi.org/10.1371/journal.pbio.2001414

Image Caption: In the life sciences, if individual communities think about identifiers at all, it is usually in the context of a single database 'hub' and a variety of cross-referenced 'spokes', or an aggregation of these; however, the real complexity of the inter-relationships is often overlooked–and with it, the importance of persistent identifiers to hold everything together. Identifier issues such as broken links undermine the flow and integrity of data for data providers and consumers alike.

Image Credit: Julie McMurry and Lilly Winfree from the Monarch Initiative.

Funding: NIH https://taggs.hhs.gov/Detail/AwardDetail?arg_AwardNum=R24OD011883&arg_ProgOfficeCode=205 (grant number R24OD011883 "Monarch Initiative"). Received by JA McMurry, CJ Mungall, MA Haendel, NL Washington. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

NIH https://taggs.hhs.gov/Detail/AwardDetail?arg_AwardNum=U41HG007822&arg_ProgOfficeCode=55 (grant number U41HG007822 "UniProt"). Received by MJ Martin. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

NIH https://taggs.hhs.gov/Detail/AwardDetail?arg_AwardNum=U24AI117966&arg_ProgOfficeCode=104 (grant number U24AI117966 "bioCADDIEfor"). Received by SA Sansone, A Gonzalez-Beltran, P Rocca-Serra, J McMurry, J Grethe, L Winfree, C Mungall, T Conlin, M Dumontier. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

NIH https://taggs.hhs.gov/Detail/AwardDetail?arg_AwardNum=U54AI117925&arg_ProgOfficeCode=104 (grant number U54AI117925 "CEDAR"). Received by M Dumontier, SA Sansone, A Gonzalez-Beltran, P Rocca-Serra. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

NIH https://taggs.hhs.gov/Detail/AwardDetail?arg_AwardNum=P41HG002273&arg_ProgOfficeCode=55 (grant number NHGRI P41HG002273-09 "Gene Ontology Consortium"). Received by CJ Mungall. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Department of Energy Received from the Director, Office of Science, Office of Basic Energy Sciences http://science.energy.gov/bso/contract-management/ (grant number DE-AC02-05CH11231). Received by CJ Mungall, NL Washington. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

The Drug Disease Model Resources http://www.imi.europa.eu/content/ddmore (grant number 115156 "Innovative Medicines Initiative"). Received by C Laibe. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

The European Commission http://cordis.europa.eu/projects/675728 (grant number 675728 "BioMedBridges project"). Received by JA McMurry, T Burdett, N Juty, S Jupp, C Morris. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

The European Commission http://cordis.europa.eu/projects/312455 (grant number 312455 "Infrastructure for Systems Biology–Europe ISBE"). Received by N Juty, H Hermjakob, C Goble. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

The European Commission http://cordis.europa.eu/projects/654248 (grant number 654248 "Coordinated Research Infrastructures Building Enduring Life-science services"). Received by C Goble. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

The European Commission http://cordis.europa.eu/projects/601043 (grant number 601043 "DIACHRONfor"). Received by S Jupp. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

The European Commission https://www.elixir-europe.org/about-us/how-funded (grant number "ELIXIR core funding"). Received by N Blomberg, R Jimenez. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

BBSRC http://www.bbsrc.ac.uk/research/grants-search/AwardDetails/?FundingReference=BB/L005069/1 (grant number BB/L005069/1 "ELIXIR-UK, Oxford"). Received by SA Sansone, A Gonzalez-Beltran, P Rocca-Serra. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

BBSRC http://www.bbsrc.ac.uk/research/grants-search/AwardDetails/?FundingReference=BB/M013189/1 (grant number BB/M013189/1 "DMM Core"). Received by C Goble, J Snoep, N Stanford. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

BBSRC http://www.bbsrc.ac.uk/research/grants-search/AwardDetails/?FundingReference=BB/K019783/1 (grant number BB/K019783/1 "Continued development of ChEBIfor"). Received by N Swainston. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

BBSRC http://www.bbsrc.ac.uk/research/grants-search/AwardDetails/?FundingReference=BBS/E/B/000C0419 (grant number BBS/E/B/000C0419 "A systems approach to understanding lipid, Ca2+ and MAPK signalling networks"). Received by N Le Novère. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

BBSRC http://www.bbsrc.ac.uk/research/grants-search/AwardDetails/?FundingReference=BB/M006891/1 (grant number BB/M006891/1 "EMPATHY"). Received by N Swainston. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

BBSRC http://www.bbsrc.ac.uk/research/grants-search/AwardDetails/?FundingReference=BB/M017702/1 (grant number BB/M017702/1 "SYNBIOCHEM"). Received by N Swainson, A Williams, D Fellows. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

BBSRC http://www.bbsrc.ac.uk/research/grants-search/AwardDetails/?FundingReference=BB/L005050/1 (grant number BB/L005050/1 "ELIXIR-UK, Manchester"). Received by SA Sansone, A Gonzalez-Beltran, C Goble. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

Media Contact

Julie McMurry
[email protected]

http://www.plos.org

Related Journal Article

http://dx.doi.org/10.1371/journal.pbio.2001414

Leave A Reply

Your email address will not be published.