This is the second in a series of articles describing the process of creating a comprehensive, canonical source of documentation for the Midlife in the United States (MIDUS) research project. The goal is to take the diverse set of sources that currently document the MIDUS study and create a standardized, DDI 3-based set of documentation that better enables researchers to discover and use the MIDUS data. We chose DDI 3 because it is the only standardized metadata format capable of storing the breadth of information required for the MIDUS study. The project is a joint effort between MIDUS and Colectica.
- Part 1: Taking Inventory
- Part 2: Importing Diverse Documentation Sources into DDI 3
A First Draft
In Part 1 we took inventory of the available sources of documentation and decided how to map information from each of the sources to fields in DDI 3.
The next step is to create a first draft of the DDI 3 metadata. The table below, repeated from Part 1, shows the existing sources of documentation and how they will map to DDI 3 elements.
Converting these existing sources into a single, canonical, DDI 3-based source of documentation was accomplished using a combination of off-the-shelf tools and custom programming. The steps of this conversion process are discussed below.
| Source | DDI 3 Mapping |
|---|---|
| Data Files | PhysicalInstance, Variable, VariableStatistics |
| Web documents and Codebooks | StudyUnit, FundingInformation, OtherMaterial |
| DDI 2 | PhysicalInstance, Variable, QuestionItem, VariableStatistics |
| Spreadsheets | Variable, QuestionItem |
| CAI Source Code | Instrument, QuestionItem, ControlConstructSequence |
| PDF Documents | OtherMaterial |
Step 1: Create the MIDUS project in Colectica
The main tool used to manage the MIDUS DDI 3 upgrade project is Colectica. A MIDUS-specific deployment of Colectica Repository is used as the authoritative source of metadata. Colectica Designer is used to import, create, edit, and synchronize the metadata with the repository.
This means the first step is to fire up Colectica Designer and create the project that will hold each of the MIDUS studies. Since that’s fairly simple, let’s move on to more interesting steps.
Step 2: Data Files and DDI 2
MIDUS is currently broken up into two waves: MIDUS 1 and MIDUS 2. MIDUS 1 contains two sub-projects, and MIDUS 2 contains 5 sub-projects. There is also a study called MIDUS Japan (MIDJA), which is a comparable study conducted in Japan.
Each of these projects has its own data file, and most of them had existing DDI 2 documentation. This made it very easy to bring in all the data-related information quickly.
- First, for the projects that did not have DDI 2 file, we created one. This is a two step process using Colectica.
- Use the Import SPSS feature to load the dataset in Colectica.
- Use the Export to DDI 2 feature to create the DDI 2 file.
- Next, we wanted to bring each study’s dataset description into DDI 3. This is another two step process for each project.
- In Colectica Designer, we used the Import DDI 2 feature to automatically create the corresponding information in DDI 3.
- Navigate to the dataset view and use the Calculate Summary Statistics feature. As you might guess, this calculates descriptive statistics and frequencies for the entire dataset, and stores them in the DDI 3 metadata.
Step 3: Spreadsheets
While the SPSS files and DDI 2 contain a decent amount of metadata, MIDUS also has spreadsheets that hold lengthier variable labels, question text, and notes. These are highly structured files where each row has a column for the variable’s name, followed by columns holding the extended information.
Since there isn’t an off-the-shelf tool that can merge these custom spreadsheets with the DDI 3 metadata, this step required some custom programming. Using Colectica SDK this was a fairly quick task. Here is a rough outline of the custom program:
- For each row in the spreadsheet:
- Read the variable name and retrieve the variable from Colectica Repository
- If an extended label is specified for the variable, update the variable’s label
- If a description is specified for the variable, set the variable’s description
- If question text is specified:
- Create a new DDI 3 question item with the appropriate text
- Add the question to the repository
- Associate the variable with the new question
- Finally, save the updated variable and continue to the next row
- After processing the spreadsheet, synchronize the changes with the repository
Step 4: Web documents and Codebooks
The MIDUS web site contains abstracts, funding information, citation information, and more details that give a broad overview of the MIDUS study. DDI 3 provides fields for this information, so we simply copied and pasted the information from the web site into the corresponding fields in Colectica Designer.
The advantage of managing this information in Colectica instead of as simple HTML files is that from now on, changes to the information will be tracked as new versions. This results in a full audit log and version history for the MIDUS documentation.
The MIDUS web site also linked to codebooks describing each of the MIDUS datasets. These codebooks don’t contain any information that isn’t also found in the DDI 2, data files, or spreadsheets. That makes the codebooks easy to deal with: we just ignored them. A future article will describe how we automatically generated new, more interactive codebooks from the DDI 3 metadata.
Step 5: PDF Documents
The MIDUS web site also links to many PDF documents that contain additional information useful to researchers. Adding these links to the DDI 3 was straightforward with Colectica Designer.
- Enter the document title, documentation type, and URL in each study’s Related Materials section
- Use the Detect MIME Type feature to automatically set the content type of each document
Step 6: Computer Assisted Interviewing (CAI) Source Code
The MIDUS surveys were conducted using the CASES computer assisted interviewing (CAI) system. Since Colectica Designer supports importing CASES source code, this was a simple step. We just used the Import CASES feature to bring in all the questions and the flow logic of the survey.
But what about the questions that were created when we imported from DDI 2 and merged the information from the custom spreadsheets? MIDUS researchers decided that the CAI source code should be canonical when question text differed, since it is the language respondents actually heard or saw. A future article will detail how we made this happen.
Where does this leave us?
This article describes how we used off-the-shelf tools, with a bit of custom programming, to convert documentation from a wide range of resources into a single, standards-based format. The MIDUS documentation is now stored in one authoritative repository. With this repository, we can reuse resources like classifications, questions, and variable descriptions; track all changes to the MIDUS project; and generate rich forms of documentation.
Future Articles
Since MIDUS is a longitudinal study, many classifications, questions, and variables are repeated in each wave of data collection. The next article will explore how we automatically harmonized these resources. It will also discuss how we merged the questions that were described in the DDI 2 metadata with the actual question text found in the computer-assisted interviewing system source code.
Subsequent articles will show how we used this harmonized metadata to generate modern, interactive codebooks that make it easy for researchers to discover the data they need.
Follow Me