Data and Research Design

The quality of data management affects the quality of data and importantly determines the adequacy of research findings. This also applies at a more general level, namely to the building of a good environment for research work. Systematic data management is, above all, a necessary condition for preventing errors and false findings, but it can also save a lot of time and make research work both clearer and easier. An abundance of stories about ridiculous mistakes caused by analysing incorrect data circulate among researchers; many analyses have failed because of flawed datasets; and numerous researchers have spent long hours disentangling poorly documented data. Much research information has got lost due to disorder in datasets and a lack of effort in their documentation.

Higher demands are placed on data management, compared to independent research work, especially when we plan the long-term preservation and sharing of data between different research teams. In recent years, the requirements of archiving and providing open access to social science data for the purpose of secondary data analysis have become highly important parts of scientific work. Consequently, demands on professional data management have also risen.

Digital data in social sciences (see Mantra)

  • Documents (text, Word), spreadsheets
  • Laboratory notebooks, field notebooks, diaries
  • Questionnaires, transcripts, codebooks
  • Audiotapes, videotapes
  • Photographs, films
  • Test responses
  • Slides, artefacts, specimens, samples
  • Collection of digital objects acquired and generated during the research
  • Data files
  • Database contents (video, audio, text, images)
  • Models, algorithms, scripts
  • Contents of an application (input, output, logfiles for analysis software, simulation software, schemas)
  • Methodologies and workflows
  • Standard operating procedures and protocols

The Data Life Cycle

The requirements of data sharing have changed the functions of data management. When a dataset is being produced, one must count on archiving and publicising it without knowing by whom and for what purposes the data are going to be used. This naturally affects the preparation of datasets, the safeguards taken during that process and the requirements of data documentation. Moreover, not only individual tasks but also the entire concept of data management is affected.

Research takes the form of a cycle where the results of one research study feed back into the research process as background for other research studies. In an environment characterised by open access to data, secondary analysis of research data plays an important role in this cycle. This gives data a new function in the dissemination and reproduction of knowledge and this function must be reflected in the ways data are managed. The management of digital information is incorporated into a cyclical system of creating scientific knowledge. [see e.g. Humprey 2006: e-Science and the Life Cycle of Research]

Data life cycle management represents a comprehensive set of methods. It relies on a model of data utilisation in the course of its individual stages, with their different goals, functions and actors. Relations between the model’s elements are determined by the life cycle of scientific knowledge.

Such approaches are applied, for example, in the UK Data Archive’s data management methods guidelines or by the ICPSR consortium in the United States:

Data and Research Design

There are at least five good reasons to pay attention to data management from the very beginning of the planning of a proposal:

  1. Available existing databases can often be utilised as sources of empirical data, replacing or supplementing our own data collections. Moreover, a combination of several data sources may make time or cross-national comparisons possible.
  2. Such databases and especially their documentation can also be used in preparing the research instruments and the design of a new survey, in replicating or verifying some procedures, or as inspiration for some tasks of research planning.
  3. Data processing is related to a series of formal and legal conditions that must be met in order to implement a research study, especially in the fields of personal data disclosure and copyright.
  4. Systematic, well-prepared dataset management will improve data quality and thus the reliability of research results.
  5. Data operations are not for free and their costs should be considered in budgeting.

Reviewing data sources

Every proposal for empirical research should start not only by reviewing the relevant literature but also by carefully reviewing available data sources. This is equally important for studies that are based on their own data collection, because besides data one can also utilise the accompanying information about methodologies, procedures and research instruments from prior research. For that purpose, researchers planning a new proposal should always browse data archives and other scientific data sources for the following information:

  • the availability of any data relevant to their research questions,
  • the availability of documentation on the relevant data and research projects,
  • the complementarity and quality of the relevant data.

In addition, when planning a new survey:

  • the availability of research instruments and information on specific methodological procedures used in similar research projects.

Data management planning

Data management takes place in several phases of a research study and includes numerous, often interrelated, elements and processes. Omission of these elements and processes might result in significant negative effects for the utility of a database and for the course and results of the research process. For these reasons, we should proceed in a systematic and planned manner.

Many funding agencies require researchers to draft a formal data management plan in their grant application and use it to assess and check the project’s compliance with the principles of open access to data. This is the practice of the US National Science Foundation (NSF), UK Research Councils and many other funding agencies, e.g.:

The drafting of the plan, whether a funding agency wants it or not, can be done according to one of the following models, which provide a checklist of items that should not be omitted from the plan:

Budgeting

Data management entails financial expense. It is advisable not to ignore this fact in project budgeting. The UKDA has prepared a special instrument for planning data management expenses: UK Data Service – Data management costing tool and checklist In it, budgeting includes the following activities that may be relevant for different projects:

  • obtaining informed consent
  • anonymisation
  • data security and access (unauthorised access, personal data protection)
  • digitisation
  • transcription (e.g. of interviews)
  • formatting and organising files (formatting and changes in the arrangement of databases)
  • data labelling and coding
  • cleaning
  • data context description
  • documentation (obtaining documentation during or after the process)
  • metadata (creating data description/documentation)
  • file format (costly conversion of audiovisual data etc.)
  • planning, distribution of roles and responsibilities (collaboration between multiple institutions etc.)
  • operationalising (data management planning and implementation)

Prepared by: Jindrich Krejci, 2013