Frequently asked questions about data curation - Digital Humanities Data Curation

What is data curation?
We define data curation as the active and on-going management of data through its lifecycle of interest and usefulness to scholarly and educational activities across the sciences, social sciences, and the humanities (Cragin et al. 2007). Data curation activities enable data discovery and retrieval, maintain data quality, add value, and provide for re-use over time. This new field includes representation, archiving, authentication, management, preservation, retrieval, and use. See the Introduction to Humanities Data Curation article for more detail.

Whose responsibility is data curation?
Responsibility for data curation rests, in different ways, on a number of different professional roles. Increasingly, within the library, data curation responsibilities are being associated with specific jobs (with titles like “data curator” or “data curation specialist”), and the rise of specialized training programs within library schools has reinforced this process by providing a stream of qualified staff to fill these roles. At the same time, other kinds of library staff have responsibilities that may dovetail with (or even take the place of) these specific roles: for instance, metadata librarians are strongly engaged in curatorial processes, as are repository managers and subject librarians who work closely with data creators in specific fields. Finally, it is increasingly being recognized that the data creators themselves (faculty researchers or library staff) have a very important responsibility at the outset: to follow relevant standards, to document their work and methods, and to work closely with data curation specialists so that their data is curatable over the long term.

Where does data curation fit into the data creation process?
Data curation and data creation are closely linked. It is now commonplace to observe that data curation starts with the data creation process. But even if we distinguish these processes, the sooner data curation starts the more effectively it can operate to ensure that data creation processes are carefully documented and that appropriate standards and best practices are followed.

How is data curation different from data storage?
Data storage is confined to simply keeping data in existence and ensuring that it can be accessed when needed. It does not necessarily entail practices of refreshment or format migration (essential to maintaining the data in a usable form) nor does it entail higher-level curatorial practices such as enhancement of the data through added metadata, or migration from one representational standard to another. Data curation thus goes far beyond the scope of data storage.

Are data curation practices subject-specific?
Yes, they often are, although some data curation practices may apply very broadly (for instance, the creation and enhancement of metadata). Subject-specific practices might include migrating data from one standard to another (which might require detailed knowledge of subject-specific methods of data representation or familiarity with a subject-specific schema), or the creation of subject-specific metadata, such as topical keywords or identification of named entities.

Where, institutionally, should data curation be managed?
The institutional library is often a natural place for data curation activities, especially if the library is also responsible for curatorial infrastructure such as an institutional repository. The library also often serves as a hub that connects with all departments, which would help ensure consistency of data curation practices across the institution. However, in many institutions the IT organization may also have a crucial role to play, either as the chief entity responsible for managing data curation, or as a partner with the library.

What data and file formats pose greatest challenges for data curation?
Proprietary data and file formats pose significant challenges for data curation simply because they may not remain current, and the tools and software necessary to work with these formats may become inaccessible (for reasons of cost or obsolescence). Any data or file format that undergoes repeated, substantive changes is also challenging because data submitted over time is likely to vary, and will probably require frequent updating to maintain currency and consistency.