Data file structure
Data files may have different internal structures and a research study may encompass several data files in different relations to one another. The structure of a data file is also determined by the formatting and arrangement of variables. File structure choice largely depends on the requirements of the software we are using. On the other hand, decisions about structure that are necessarily taken when a data file is being created are going to define the possibilities of data processing and data analysis; once the structure has been filled with data, any changes to it are usually laborious and costly.
- Clarify the unit of analysis and define database records in line with the structure.
- Take into consideration your analytical objectives, as well as the methods and software we are going to apply. Special attention should be paid to any specific analytical procedures and/or specific software applications for implementing them.
- Consider the relations between variables including assumptions about variables that will be created during data entry.
- Consider possible follow-up surveys and related external data sources, their structure, and possibilite needs of compatibility. We may add new data to our collection, create cumulative files, or build interconnections with other databases and information sources. We will probably create different versions of the file in the course of our own analysis as well.
- Consider the number of variables, the number of cases, or the total size of the database.
Even when using very recent technologies, we may be faced with limits to available computing or storage capacity. Another issue is the blank cells in the file, which are multiplied geometrically in some file structures, resulting in the excessive growth of the size of the file. Given the structure and variable format selected, we often reserve a large space in the file for information whose slots will remain blank. An example of a structure with large size requirements is a hierarchically structured database of households which foresees up to eight members for each household and defines variables and positions in the file for each of those members. Most households have less than eight members, while the variables and positions remain in the file even if they are blank.
The following basic types of data files are distinguished in terms of structure [see, e.g., ICPSR 2009: 26–27]:
- Flat file: the data are organised in long rows, variable by variable. An ID number usually comes first. If variable values are organised column by column, we obtain a rectangular matrix. This is the simplest structure. For example, SPSS system files consist of one rectangular matrix with data, accompanied by variable labels and values.
- Hierarchical file: The file contains higher-order and lower-order records that are arranged in a hierarchical structure, i.e. several lower-order units may be linked to one higher-order unit. Such a structure may be used, for example, for household surveys where data on the household are recorded at one level and data on household members at another level. Different database applications such as MS Access or D-base often structure data in this way.
- Relational database: This is a system of data matrices and defined associations between them. For example, in a household survey, information about household members may be recorded in independent matrices that are interconnected by means of a household ID or a more complex parameter that represents not only the sharing of a household but also the type of family relationship between household members. For instance, users can search for rows with equal attributes in this type of database. Relational databases may also serve as a basis for creating files adapted for individual exercises by combining information from different matrices.
If we wanted to use a flat file for the above-mentioned household data exercise, we could add to the file a household ID variable and then organise records about the respondent and other members of his/her household row by row. This would create a set of individuals. Another possibility would be to organise records for all household members in long rows, which would create a set of households. Another alternative would be to create several interconnectable files.
The requirements and abilities of our software often dictate whether we create one compact but complicated and sizable file, or several simpler and smaller interconnectable files. In addition, large surveys often use several different databases. These may consist, for instance, of files from different waves of a survey, of databases with different types of research units, or, to give a concrete example, of data from the main questionnaire, different supplements, contact forms, contextual data etc. As a result, decisions about them are more complicated. A specific situation exists for data obtained from continuous surveys or several different sources. We should always define reliable and unique identifiers of records in different files so that we can interconnect files where links exist between them.
Variables
Variable names
The file’s structure is further shaped by the arrangement and labelling of variables. There are relationships between variables, e.g., an original and a derived variable or sets of variables linked to the measurement of the same phenomenon. At the same time, there are links to other elements of the research study such as the questionnaire, data source, or another dataset. Besides variables which comprise directly measured data, there are also derived variables. This all should be reflected in the location of variables in the data file and their labelling in order to help users better understand the contents of a database, gain orientation in it during analysis and avoid some mistakes.
Most data files also include auxiliary variables which facilitate orientation and management, ensure integrity, or are necessary for some analyses. As a rule, we should include a unique identifier of cases in the file and place it in the very beginning of it. Other variables may help us distinguish between different sources of information, methods of observation, temporal or other links. Yet others may provide information about the organisation of data collection such as interviewer ID or interviewing date, or distinguish cases which belong to various groups. It is absolutely necessary for an analysis to distinguish data that result from overrepresentation sampling strategies, different surveying waves etc., especially if groups of cases distinguished by them are to be analysed in different ways.
The naming and labelling of variables is as important as their location in the data file. Since variable names are also used as calling codes in software operations, they should be kept short and respect the usual requirement of standard software. Although the limits of contemporary software are usually looser, abidance by standards and thorough avoidance of specific arrangements are necessary for the reasons of transferability. Therefore, variable names should be no longer than eight characters, should start with a letter, not a number or other characters such as question marks or exclamation marks, and should not contain special characters such as #, &, $, @, which are often reserved for specific purposes in software applications. Do not use diacritics or national specific characters under any circumstances.
At the same time, variable names should not be completely meaningless since they can be used for better orientation in the file. Three basic ways of variable labelling are customary:
- a numeric code that reflects the variable’s position in a system (e.g. V001, V002, V003...),
- a code that refers to the research instrument (e.g. question number in a questionnaire: Q1a, Q1b, Q2, Q3a...),
- mnemonic names referring to the content of variables (BIRTH for year of birth, AGE for respondent’s age etc.), sometimes with prefixes, roots and suffixes to distinguish variables’ membership in groups or links between them (e.g., AGECAT for a categorised variable derived from AGE).
Varianle labels – documentation in the datafile
Specialised statistical software normally makes it possible to use not only variable names but also variable labels, which can be longer and help us better specify the variable’s content. Here we can enter more complete information, e.g. a short or full version of the question, or alternative labels such as question codes if the variable names are not constructed around them. Although more emphasis is placed on comprehensibility, and size limits are less strict here, it is advisable to keep variable labels rather brief or to find an adequate compromise.
- Excessively lengthy labels make analytic outcomes unclear and complicate format conversion. In some uses or after conversion between software applications, only a part of a lengthy label is kept and the loss of the rest of it may be to the detriment of comprehensibility.
- The use of diacritics should be considered carefully because they tend to cause similar problems. On the one hand, they increase the quality and comprehensibility of analytic outcomes written in Czech, but on the other hand, they may complicate conversion between software formats or work with results in a different language version of the same software.
Variable values and coding
There are different types of variable values. Variables may refer to numeric values, but most variables in a sample survey contain respondents’ oral responses to open-ended or closed-ended questions in the questionnaire. Additionally, datasets may contain photographs, video recordings, audio recordings or samples of different materials.
For the purpose of quantitative analysis, the information collected is usually represented by numeric codes. The fact of numeric coding is shared by all statistical software applications and, among other things, this facilitates data conversion and measurement comparisons. Our decisions about the specific structure of coded categories also affect the prevention of errors in data entry and data processing and define the basis of analysis. Generally speaking, coded categories should refer to the contents of the hypotheses tested. At the same time, one must take into consideration any possible uses of the data in future and maximise the data’s informational value.
The meaning of codes must be documented. Specialised analytic software lets the user assign value labels directly to variable values. The construction of value labels follows principles similar to those of variable labels. We should make them comprehensible but avoid wasting room in order to maintain the clarity of the analytic outcomes using them and the transferability of information between software. If the application does not allow us to assign codes directly to data, we have to document the values in a separate file or as part of the database’s more general documentation file.
In practice, coding takes several different forms. For closed-ended questions, the coding scheme is incorporated directly in the questionnaire and data are entered numerically. This process is automated in computer-assisted interviewing. More complex coding exercises, e.g. for textual answers to open-ended questions, require an independent coding process with a clearly defined design: a coding structure and a procedure and schedule of exercises if there are several coders. Answers can also be coded on paper questionnaires, when coders record codes in a designed spot of the questionnaire before they are entered into the computer. Such codes are then digitised along with other data. It is more advisable to enter the complete wording of textual answers and then conduct coding in the computer. This enhances control both during and after the process, facilitating the correction of mistakes or the making of subsequent changes to the coding system. Sometimes, a part of the process can be automated as well.
For example, the following coding recommendations are included in the best practices for creating data files defined by the ICPSR (see ICPSR, 2012, Guide to Social Science Data Preparation and Archiving. Best Practice Throughout the Data Life Cycle). I have slightly modified and simplified them for the purposes of this paper:
- Identification variables: Include identification variables at the beginning of each record to ensure unique identification of each case.
- Code categories: Code categories should be mutually exclusive, exhaustive, and precisely defined. You should be able to assign each response of the respondent into one and only one category.
- Preserving original information: Code as much detail as possible. Subsequently, your information can be converted to a less-detailed one, but this cannot be done the other way around. Less detail will limit the possibilities of secondary analysts.
- Closed-ended questions: For responses to survey questions that are coded in the questionnaire, you should digitise the coding scheme to avoid errors.
- Open-ended questions: Any coding scheme applied should be reported in the documentation.
- Recording responses as full verbatim text: Such responses must be reviewed for personal data protection.
- Check-coding: It is advisable to verify the coding of selected cases by repeating the process with an independent coder. This checks both the coder’s work and the coding scheme.
- Series of responses: If a series of responses requires more than one field, it is advisable to apply a common coding scheme distinguishing between major and secondary categories etc. The first digit of the code identifies a major category, the second digit a secondary category etc.
Equal coding structures can be used for several variables in one research study. So we can add one more recommendation:
Coding schemes for each variable should be constructed in consideration of the rest of the database, so that the character of the procedures does not deviate from each other excessively. For example, the response scales constructed should have the same direction. Such a rule facilitates coding, data processing and the prevention of errors (both in data entry and in analysis).
The coding of answers to fully standardised questions is usually a simple exercise. Yet open-ended questions may pose a methodological challenge. We must decide to what extent different answers to the same question are equivalent and can be assigned to the same categories. Thus, coding is a complicated cognitive process and the coder may exert a significant influence on the information that appears in the database, as well as a source of substantial mistakes and systematic errors of measurement. Moreover, in some cases, we need to code information in great detail, using a complicated scheme. Thus, the design of coding schemes requires a theoretically and empirically well-founded planning and testing. Similarly, coding procedures must be planned, which requires the establishment and implementation of a specific coding design and places specific demands on coders’ competences and training. A part of the coding procedures is concerned with reviewing the quality of the coding process, including, for instance, the assessment of coder variance.
The demands of research in this respect are somewhat alleviated by the fact that the same coding structures can be applied in different studies. The use of standardised classifications and coding schemes has multiple advantages, in particular (1) economy and quality as a result of adopting an existing structure which has a solid basis and has been verified in many studies, (2) comparability with data from other studies using the same concept, (3) comprehensibility for researchers used to working with these concepts. A disadvantage lies in the need to adapt one’s research intent.
Missing values
Not all the questions in a questionnaire are answered by all respondents, which results in missing values for the corresponding variables. It is advisable to distinguish between various missing data situations (see also ICPSR). As a rule, the following situations are distinguished (frequently used acronyms are bracketed):
- No answer (NA): The respondent did not answer a question when he/she should have.
- Refusal: The respondent explicitly refused to answer.
- Don’t Know (DK): The respondent did not answer a question because he/she had no opinion or did not know the information required for answering. As a result, the respondent chose ‘don’t know’, ‘no opinion’ etc. as the answer.
- Processing Error: The respondent provided an answer but, for some reason (interviewer error, illegible record, incorrect coding etc.), it was not recorded in the database.
- Not Applicable/Inapplicable (NAP/INAP): A question did not apply to the respondent. For example, a question was skipped following a filter question (e.g. respondents without a partner did not answer partner-related questions) or some sets of questions were only asked of random subsamples.
- No Match: In this case, data are drawn from different sources, and information from one source cannot be matched with a corresponding value from another source.
- No Data Available: The question should have been asked, but the answer is missing for a reason other than those above or for an unknown reason.
Not all these missing data situations tend to be distinguished. It is crucial for the integrity of a database to at least document which questions were not asked of some respondents and specifically which respondents were not asked those questions. During coding, this information should be adequately recorded as ‘Not Applicable/Inapplicable’ (NAP, INAP). Furthermore, it is useful for many analyses to distinguish whether the respondent did not know the answer or simply did not answer (or refused to).
In order to facilitate data processing and error prevention, it is advisable to establish a uniform system for coding missing values for the entire database. Typically, negative values or values like 7, 8, 9, 97, 98, 99 (where the number of digits corresponds to the variable’s format and number of valid values) are used for these purposes. The coding scheme should prevent overlapping codes for valid and missing values. For instance, whenever the digit zero is used for missing values, we should bear in mind that zero may represent a valid value for many variables such as personal income.
Data entry and data file integrity
The goal is to enter data in a digital format that is fit for analysis and, in this process, to minimise errors such as typos and meaningless values and ensure internal consistency of the database. For this reason, it is advisable to plan the procedure of data entry and subsequent checks in advance.
Data entry procedures have changed over the recent years. Operators entering data into a computer manually are being replaced by computer technologies, while the universal distinction between three phases - (1) data collection, (2) data entry and (3) editing and checks - is becoming obsolete. These changes have largely been ushered by the emergence of computer-assisted data collection (CAPI, CATI, internet surveys etc.). Here, data entry occurs simultaneously with data collection and the software used includes a number of functions for checking data integrity. Thus, different techniques of checks are used that offer different correction options than in a classic survey design. Even where computers are not applied directly in the field, data can still be digitised with scanners. In spite of its limits (e.g. in recording textual answers), greater automation of processes generally prevents some types of errors, produces other types of errors, and changes the the options for checking data.
The integrity of a data file is based on its structure and on links between data and elements of documentation such as variable labels and value labels. Below is a set of recommendations on minimising errors based on the data management guides of the UKDA (see Create & Manage Data at the UK Data) and the ICPSR (Guide to Social Science Data Preparation and Archiving):
- Manual data entry requires routine and concentration. Operators should not be burdened by multiple tasks. Tasks such as coding and data entry should be implemented separately.
- Final entry should be done through a smaller rather than a larger number of steps. This reduces the likelihood of errors.
- A great advantage lies in the use of specialised software with which it is possible to set the range of valid values for each category and to apply filters to manage the entry process (or the entire data collection process in the case of computer-assisted interviewing). These automatic checks prevent meaningless values from being entered and often help to discover inconsistencies that arise when some values are skipped or omitted, and they make the interviewer’s or operator’s work substantially clearer and easier, thus generally reducing the number of errors they make.
- Data entry errors can usually be prevented if data entry is conducted twice and the results are compared. For example, double data entry is a standard for scanning.
- Check the completeness of records.
- There are multiple methods for logical and consistency checks, including the following:
- check the value range (e.g. a respondent over the age of 100 is unlikely),
- check the lowest and highest values and extremes,
- check the relations between associated variables (e.g. educational attainment should correspond with a minimum age, the total number of hours spent doing various activities should not exceed 100% of available time),
- compare with historical data (e.g. check the number of household members with the previous wave of a panel survey).
- Many checks can be conducted automatically by computer. Even logical checks can be programmed directly into specialised CAPI, CATI or data entry software. Here, software can distinguish between permanent rules that cannot be bent and warnings that only notify the operator of entering an unlikely value.
- A certain percentage, e.g. 5–10% of all records, should be subject to a more detailed, in-depth check.
- Changes should be documented and original data should be restorable.
We can either delete or try to correct error values. Simple data entry errors can be easily corrected based on comparison with respondents’ original answers. However, we should bear in mind that inconsistencies can also be generated by the respondents themselves, and a correction should make a minimum or no changes or reductions to their original answers. Any replacement of values must be planned and done in conformity with the concepts of measurement.
Anonymisation
The protection of respondents’ personal data is one of the most important ethical and legal requirements of data management and personal data can be processed only in line with the respondents’informed consent (see above). The goal of social research is not to identify information about individuals, but to obtain generalised information. Consequently, it is often possible to avoid working with personal data in a research study. If the database is not anonymous and either the respondents’informed consent does not allow the personal data processing or the personal data are not neccessary for our research intentions, then the database has to be anonymised as soon as possible.
However, many databases that appear anonymous at first sight may actually harbour a significant risk of revealing respondents’ identity. In quantitative research, this is particularly the case of surveys among smaller groups, surveys identifying some information in great detail, and ones that deal with certain specific cases in a file. Therefore, every data file should be assessed and analysed for the risk of disclosing respondents’ identities before it can be considered anonymous. In some cases, methods of data anonymisation can ensure anonymity without critically damaging data quality.
A database is not anonymous if natural persons to whom the data in the database relate can be identified based on direct or indirect identifiers. Direct identifiers are, for example, names, national identification numbers, addresses, telephone numbers, respondents’ photographs etc. Indirect identifiers make a person’s identification possible when connected with other known data, for example, about a person’s job, place of residence, workplace etc., or based on the extreme values of some variables. Indirect identification may also occur when several variables in a file are combined.
The following are basic anonymisation methods:
- Removing direct identifiers: Direct identifiers can often be replaced by anonymous codes while their basic functions are maintained. For example, the national identification number is removed and a unique questionnaire ID is kept which does not point to a specific person but helps distinguish between cases in the file.
- Removing or replacing interconnections with other available non-anonymous databases or sources of information.
- Aggregating information or reducing the variable’s level of detail: Some information can be aggregated into categories referring to broader groups of subjects without losing informational value. For example, year of birth is recorded instead of the complete date of birth, or region of residence is recorded instead of the exact address. Special attention should be paid to geographic identifiers because persons can often be identified when the names of smaller municipalities are combined with other variables.
- Treating extreme values of variables: The risk of identifying persons based on atypical, extreme values can often be eliminated by introducing minimum and maximum limits of the range of valid values.
For example, in a panel survey we need to keep non-anonymous identifiers in order to build interconnections with data from other waves of the survey, but we do not need them for the analysis. Thus, we can create two files that can be interconnected by means of a unique ID. The first file contains all the survey information we need for the analysis, it is anonymous and can only be accessed for the purposes of investigation. The other file contains personal data, it is secured in terms of personal data protection, and is used exclusively for the purposes of building interconnections with data from the other waves.
Weighting
Weights of sample survey data are constructed in order to take into account the characteristics of sampling design and correct identifiable deviations from population characteristics. Each individual case in the file is assigned a certain coefficient – individual weight – which is used to multiply the case in order to attain the desired characteristics of the sample. If the weight of a case equals 1 then the values measured are not adjusted.
Using the weights may be dysfunctional if some methods of analysis are employed. There are also general theoretical and methodological issues which discourage some researchers from using weights. Either way, the type and purpose of weighting cannot be omitted from the final decision. For this reason I here mention the fact that there are different types of weights for different purposes, though I do not devote any attention to specific procedures of weight construction:
- Design weights are constructed in order to mutually adjust individual units’ probabilities of being sampled, which are normally not equal when complex sampling procedures combining multiple methods (stratification, group sampling) in several stages are implemented. For example, we want to adjust the probabilities of being sampled for all respondents in households. While individuals are the sampling units, households are sampled in the first stage. Therefore, respondents’ probabilities of being selected depend on the number of household members.
- Non-response weighting: During the implementation of a survey, we are normally not able to get a response from some units sampled due to their refusal, our failure to contact them, or other administrative reasons. Response rates differ between various population groups and those inequalities can be compensated by weighting.
- Post-stratification weighting: This is done in order to achieve a distribution equal with that of some known characteristics of the population (e.g. sex, educational attainment).
- Population size weighting: Different groups may be represented in the database in different proportions than they are in reality. Such discrepancies are normally compensated through weighting. For example, international data files combine data from various countries. However, similarly large surveys are usually implemented in each of these countries, although their total populations are radically different in size. If we want to analyse data about large populations, such as in Europe, then we have to adjust the proportions in the representation of individual European countries.
- Combined weighting: Several different types of weights for different purposes may be constructed in the file. Subsequently, they are combined into a final, combined weight.
Data file documentation
The format of documentation should correspond with the intended uses of the data. We have to take into consideration our own intentions, the requirements of the recipients of our research results, or the requirements of archives that will be making our data available. Selected information, especially documentation of variables, is usually integrated in the data file. Other metadata should be included in a special structured document. The wording of questions might also be included in it, although this is often also provided in a separately attached research instrument. Even if variable documentation forms part of the data file, some variables should be accompanied by additional specific information, for example weighting algorithms, a detailed description of performed transformations, the syntax used in converting classifications etc.
In the past, data was usually accompanied by so-called codebooks or other reference guides printed on paper. These contained information about the research study and its methods, lists of codes, or even lists of frequencies and selected contingency tables. At present, such documentation tends to be provided electronically, which makes it easier to search through and connect with data and other materials. Like data, ‘metadata’ also have to be in line with the objectives of long-term preservation, i.e. their formats should be stable vis-à-vis software development. The DDI international standard covers the format and contents of documentation in a comprehensive way. It presents a universal structure of documentation for the widest range of types of social science data and purposes, including long-term preservation of metadata and recording of data file history.
When documenting data files from sample surveys, we should distinguish two general levels – information about the survey and information about the data. In particular, it is important not to omit the following items:
(a) Information about the survey
- Data file origin: the title of the survey (including acronyms and their explanation, alternative titles in foreign languages etc.), institutional information (authors, implementing, funding and commissioning institutions, grant numbers etc.), project abstract, objectives, concepts, hypotheses, and references to follow-up projects.
- Description and methods of data collection: description of all the sources the data originate from (e.g. for derived data, data added from other sources), the time period of data collection, temporal and geographical coverage, target population, units of observation, description of sampling design including frame, methods of data collection, the wording of the questions in the questionnaire, the original research instrument and other materials used in data collection (letters of invitation, instructions for interviewers etc.), classification schemes and concepts applied, response rate and other assessments (e.g. known deviations from the research population), identification of methodological changes for time series and longitudinal studies.
- Description of data files: specification of the version and the edition of a collection, the structure of the data files, a specification of associations and interconnections (including technical information for forming links between files etc.), size information (the number of units and variables), information about formats and compatibility.
- Data edits and modifications: methods and results of integrity checks, validation, data cleaning, or any other applicable procedures for increasing data quality (calibration, the imputation of missing values, checks and corrections of transcripts etc.), anonymisation, transformation and construction of derived variables, weighting (the identification of variables for weighting and a description of weighting methods and design).
- Access to data: a definition of authorised persons, a specification of terms of use, information about personal data protection.
- Cataloguing and citation information: bibliographic information, suggested citation, key words, cataloguing information.
- References to related materials and sources if applicable.
(b) Information about the data
- Information about the variables in the file: the names, labels and descriptions of variables, their values, a description of derived variables or, if applicable, frequencies, basic contingencies etc. The exact original wording of the question should also be available.
- Information about the cases in the file: a specification of cases if applicable.
Versions and editions, ensuring authenticity
The data management procedures typically result in several versions of the data file. New versions are created when errors that occur after data cleanup during data analysis are subsequently corrected; when the data are processed for the purpose of analysis; or when new data or data from other sources are added. The treatment of errors and the inclusion of new data may result in the publication of different editions of the same dataset which may even differ substantially in their contents (e.g. when data from additional countries are included in an international database). However, different working versions of the file are even created in research studies with simple databases or for the researchers’ own needs. Thus, it is advisable to keep track of the contents of each version and avoid overwriting the authentic original file.
A good strategy for managing data file versions and editions is necessary to ensure that the data are safe, the data file contents are comprehensible, and mismatches and mistakes are avoided. The objective is (1) to clearly distinguish between individual versions and editions and keep track of their differences, (2) to ensure data authenticity, i.e. prevent unauthorised modification of files and loss of information. The following basic rules apply (see also UKDA 2010: Version Control & Authenticity http://www.data-archive.ac.uk/create-manage/format/versions):
- Establish the terms and conditions of data use and make then known to team members and other users.
- Distinguish between versions shared by multiple researchers and individuals’ working versions.
- Introduce clear and systematic naming of data file versions and editions.
- Maintain records about the creation of versions and editions, their specific contents and mutual relations.
- Document any changes made.
- Keep original versions of data files, or keep documentation that allows the reconstruction of original files.
- Create a ‘“master file’” and take measures to preserve its authenticity, i.e. place it in an adequate location and define access rights and responsibilities – who is authoriszed to make what kind of changes.
- If there are several copies of the same version, check that they are identical.
Data preservation: Backups, formats, media
Digital media are unreliable in principle. Software development results in frequent format changes, which affects compatibility. Institutions preserving data go through organisational changes, too. Additional risks are related to software failure and viruses, inadequate human intervention, and natural disasters. As a result, data preservation represents a process, rather than a state, and requires a well-considered approach.
Backup is the fundamental element of security. The extent and methods of backup implementation should be in line with its objective, namely a complete restoration of lost files. Backup should take place in a systematic and regular fashion. For that purpose, many institutions have formulated uniform backup policies. Such policies should take into account the needs related to research data files. Moreover, they should cover, if applicable, any substantial modifications to data as soon as possible after they are made. Backup processes can be automated by means of specialised software applications and backup devices. A backup copy has to be labelled and documented. Data should be deposited in such formats and on such media that are favourable for long-term preservation (see below). Its completeness and integrity should be verified and checked. Given the risks backup is expected to cover (e.g. floods, fire), backup copies should be located elsewhere than the original data.
In order to secure our data and ensure its long-term preservation, we should choose adequate formats and a favourable location.
(a) Data formats and documentation
Software and software formats are developing rapidly. For the reasons of short-term operability, it is advisable to choose a format associated with specific software applications. We also have to take into consideration how widespread this format is and to what extent the computer environment is friendly to it in terms of the compatibility and availability of conversion tools. We should keep in mind that veryspecific formats (e.g. SPSS files with the *.sav extension, STATA files with the *.dta extension, MS Access files with the *.mdb or *.accdb extensions, Dbase files with the *.dbb extension) undergo frequent changes and may not be fully transferable, even between different versions of the same software application. However, some software utilises the so-called portable versions of formats that are associated with a specific application or group of applications and allow easy transfer of data between different versions and hardware platforms (e.g. ‘portable’ SPSS files with the *.por extension or ‘transport’ SAS files). Open, widespread formats are more advisable for long-term storage as they typically undergo fewer changes.
The use of national specific characters in the database, e.g. Czech diacritics, creates another issue. In this case we need to pay great attention to character coding. Some character encoding systems (e.g. Microsoft Windows - ANSI character encodings) do not cover all characters in one system. As a result, the appropritae language environment (Central European languages) has to be set to ensure correct display, and this cannot always be done. Other coding systems (e.g. UTF 8 and higher) allow several character sets to be correctly displayed simultaneously.
Long-term preservation of quantitative data (depending on the arrangement and type of data) is normally best done using simple text (ASCII) formats and a structured documentation file with information about the variables in it, their position in the file, formats, variable labels, value labels etc. Records for each unit in such a file are normally located on separate rows. If any of the records extend over multiple rows then we should note this in the documentation. In terms of the location of variables in the file, a distinction is made between fixed and free formats. In a fixed format, variables are arranged in columns and their exact positions, i.e. the start and end of each variable, are known.
The position of a variable is irrelevant in a free format – the data for each variable is separated by blanks or specific characters, such as a tab space or a dash. For instance, in the Czech language environment it is necessary to remember that the comma, a frequent data separator in English versions of databases, is used as a decimal separator instead of a dot. It is essential tomake sure that there is not more than one possible meaning to a symbol in a file.
Digital versions of paper documentation are usually kept in PDF/A format. This is the official version of the PDF format for archiving conforming to the ISO 19005-1:2005 standard. It guarantees independence from the platform, includes all display information (including fonts, colours etc.) and metadata in the XML format, and disallows encryption or password protection. Structured textual documentation should be saved in a simple text format, with tags and in line with a standard structure (e.g. DDI).
(b) Digital media
The choice of media for the long-term preservation of data depends not only on the type of media but also on the quality of different media of the same type. It also importantly depends on the methods of storage and, last but not least, on technological changes – after several years, compatible readers are difficult to find for some media. For example, a high-quality 5.25 inch floppy disc may preserve data for up to twenty years, but it is easily damaged if handled without care. At the same time, very few institutions these days have the equipment for reading this once widespread medium.
Generally speaking, all digital media are subject to high risk of damage, loss of information, and outdating due to technological development. Therefore, it is of primary importance to formulate a strategy for data backup and preservation in consideration of existing risks and expected developments.
Prepared by: Jindrich Krejci, 2013 - 2014