The SciDataCon 2025 Programme is now published.

13–16 Oct 2025
Brisbane Convention & Exhibition Centre
Australia/Brisbane timezone

Co-Designing AI Readiness: CODATA’s Call to the Global Data Community

14 Oct 2025, 12:36
11m
Plaza Terrace Room (Brisbane Convention & Exhibition Centre)

Plaza Terrace Room

Brisbane Convention & Exhibition Centre

Presentation Rigorous, responsible and reproducible science in the era of FAIR data and AI Presentations Session 3: Rigorous, responsible and reproducible science in the era of FAIR data and AI

Speakers

Christine Kirkpatrick (San Diego Supercomputer Center / CODATA) Mercè Crosas (Barcelona Supercomputing Center) Simon Hodson (CODATA) Tyng-Ruey Chuang (Academia Sinica, Taiwan)

Description

Rapid advances in AI technology have the potential to ease or speed Research into research data management challenges. The CODATA and WDS communities are already coming up with ways to leverage AI for data stewardship. With much of the research data community dedicated to FAIR implementation and the colloquial second meaning of FAIR being ‘Fully AI Ready’, there is conflation and confusion about which of the FAIR Principles leads to data that can be consumed by AI or ‘AI Ready Data’. The data community had just begun to confront these questions as they relate to machine learning (ML) and then saw the emergence of deep learning technologies based on Large Language Models (LLMs) and Generative AI (Gen-AI). An even more recent development, the Model Context Protocol (MCP) brings a way to create agents and workflows on top of interactive LLMs. This provides great opportunities, such as simpler ways to work with external data and LLMs. At the same time, the introduction of MCP brings open questions about how best to harness these capabilities in data stewardship practices. There are wide gaps of understanding and ‘gut checks’ between the FAIR/research data community and the computer science community. For example, it is a closely held truth in computer science that more data is better for ML, and that lack of data quality can be overcome by having ’enough’ training data. In the research data community, a common assumption is that FAIR data means ‘AI Ready data’. We will discuss these assumptions and examine community norms and tenets in the light of existing scholarship (publications) and state of the art (current best practices). This sets the stage to provide practical advice that can be used by data stewards, researchers, decision makers allocating resources, funders, and policymakers.

The term ‘AI Ready Data’ seems to suggest the scenarios where the datasets have been prepared and are ready-to-use for various AI systems and applications. There is less discussion, however, on whether and how to shape research data workflows to meet general AI needs. In addition, trustworthy AI can only be driven by trustworthy data. The issues of data provenance, integrity and measures of data quality, will only become more pressing in the face of rapid data turnover and changing workflow. Going further, we can view research datasets not just as end products to be consumed by AI, but as the carriers of information in collaborative research networks aided by AI.

This presentation is CODATA’s call to the global data community to critically examine the roles of data in AI for the advancement of science. We shall illuminate the opportunities and challenges brought by AI as they relate to the CODATA community and current initiatives. Topics include a level-setting discussion and an explanation of an upcoming concept paper’s key sections and recommendations. The aim is to solicit feedback and to ask participants to help set priorities through interactive polls and collecting use cases of AI for data and data for AI.

The recommendations and future work should serve as a source for focusing new momentum at understudied and underdeveloped areas at the intersection of AI and data, while reorienting existing projects.

Primary authors

Christine Kirkpatrick (San Diego Supercomputer Center / CODATA) Mercè Crosas (Barcelona Supercomputing Center) Simon Hodson (CODATA) Steven McEachern (UK Data Service) Tyng-Ruey Chuang (Academia Sinica, Taiwan)

Presentation materials

There are no materials yet.