in Professional

Curating the Information Wanted historical dataset

Today I presented about my work on the historical dataset “Information Wanted” for an audience of Harvard Library graduate fellows in historic data curation working on a project to make the Chinese Maritime Customs Collection available as data.

Project citation

Harris, Ruth-Ann, 2017, “Information Wanted,” https://doi.org/10.7910/DVN/UNJU3N, Harvard Dataverse, V2, UNF:6:Zvk/OKdbWggU43MHjTVAWg== [fileUNF].

From database to dataset

Earlier in my career, I was the Digital Scholarship Librarian and Bibliographer for History at Boston College Libraries, a dual role as both a functional and subject librarian.

One day the Bibliographer for Irish Studies, Kathleen Williams, came to me and said, “Central IT shut down a custom web database created for a faculty-led project because the site had serious security concerns. They exported the data. Now what?”

So I walked over to the special collections library and asked Kathleen to tell me the story from the beginning. How did this project come to be? What was its scholarly purpose? What people were involved? What functionality did the site as it formerly existed offer?

This was back in Fall 2017. The Institute for Museum and Library Services (IMLS) had just funded the project “Always Already Computational: Library Collections As Data,” led by Thomas Padilla and the rest of the project team. Their project convened library workers in order to develop a shared strategic approach to facilitating use of library collections for computationally-driven research. After the project team drafted the Santa Barbara Statement on Collection as Data, they elicited community feedback at many conferences. I was fortunate to attend the final feedback workshop at the 2017 Digital Library Federation Forum.

So by the time I was talking with Kathleen, my ears were newly attuned to thinking of library collections as data. Through our conversation, it became clear to me that the central aspiration of this digital project had always been to structure and normalize messy free-text data, making it usable for mapping and visualization and other not-yet-imagined uses.

So we set out to curate a dataset that would, we hoped, facilitate re-use.

What’s the data?

So far I’ve told you how I got involved. Now’s the fun part. I’m going to tell you about the data itself.

Classified advertisements placed in the Boston Pilot newspaper by Irish immigrants to the United States seeking information about friends and family who also migrated. The person placing the advertisement would go to the newspaper office and speak with a clerk, who would take down their information and compose an advertisement.

Let’s take a look at an example from digitized issue of the Boston Pilot: 

These notices could be a source base for lots of scholarly questions that could be asked about the Irish diaspora! What data points do you notice in the advertisements cited above? Perhaps:

  • Hometown
  • Where they left from
  • Where they arrived
  • Current place of residence
  • Familial relationships
  • Etc.

We just looked at one example. The faculty member, Ruth-Ann Harris, looked at tens of thousands examples. She co-edited a multi-volume set of books that collected advertisements extracted from the Boston Pilot. So she was deeply familiar with the range of data points that these advertisements collectively included. After publishing the multi-volume set of collected advertisements, she moved on to creating a digital project. She collaborated with university communications to create the web database called Information Wanted. She established data entry protocols and trained student assistants to perform data entry according to her protocol. A quick note: The student assistants worked from the printed volumes, rather than directly from the newspapers themselves, whether print or digitized.

The data entry protocol identified fields including:

  • Name of the person being sought
  • Gender
  • Occupation
  • Departure date
  • Arrival date
  • Intended destination
  • Port of departure
  • Port of arrival
  • Name of the person seeking information
  • Relationship to missing person
  • Additional seekers

In addition to structuring fields, Ruth-Ann also created custom controlled vocabularies for allowable values within those fields. One controlled vocabulary for occupation included values such as:

  • Blacksmith
  • Cooper
  • Farm laborer
  • Grocer
  • Soldier

Another controlled vocabulary for family relation included values such as:

  • Mother
  • Brother
  • Husband
  • Sister-in-law
  • Cousin
  • Friend

Fields that had a geographic component in Ireland were enhanced to include other geographic units. For example, if an advertisement named a townland in Ireland, the person performing data entry consulted reference works, both printed and online, to identify other units such as Barony, Poor Law Union, or civil or parochial parish.

At some point, the scholar, Ruth-Ann Harris retired, and the Irish Studies librarian, Kathleen Williams, continued to supervise student workers performing data entry from the edited volumes of advertisements.

How did we curate the dataset?

After deciding that the project was to curate a dataset, I reached out to our then GIS and Data Librarian, Barbara Mento. We quickly decided to deposit in the Boston College Dataverse, which is a collection within the Harvard Dataverse. We didn’t really perform much of a requirements analysis that we could use to assess potential repositories. Neither of us was aware of disciplinary data repositories for the humanities that might reach an existing disciplinary audience. Now there’s enslaved.org, for example, for datasets that help to reconstruct the lives of people involved in the historical slave trade. But at the time we weren’t aware of anything for humanities datasets or Irish Studies datasets. So we kept it simple and decided that we would deposit in the platform available to us at our institution, the newly launched Boston College Dataverse.

For us this dataset would be the first of its kind as a humanities dataset, rather than a dataset underlying a scientific journal article. We wanted to describe this data in a way that would help it to be usable by its disciplinary audience. I took the lead on exploring the data dictionary for Dataverse Metadata, available in the Appendix to the User Guide, and thinking about it might apply to Information Wanted. We came back together for further discussion — myself, Kathleen, and Barbara — and we identified our core goals for publishing the dataset to Dataverse:

  1. Enable ongoing data entry and continual updates to the dataset.
  2. Describe the dataset’s complex provenance.
  3. Point users to related materials.
  4. Credit the labor of all contributors.

For our first goal, to enable ongoing data entry and updates to the dataset, we relied on Dataverse’s versioning capabilities.

For goals two and three, about provenance and related materials, we relied on citation metadata for the dataset overall, file-level metadata, and the code book. Hats off to Barbara who wrote the codebook that we deposited.

Our third goal, to credit the labor of all contributors, was the easiest to accomplish. We identified Ruth-Ann as the author and all library workers and student workers as contributors. The contributor field also allowed us to identify roles. We identified Kathleen as the Project Manager. We identified myself and Barbara as Data Curators. And we identified all student workers who performed data entry as Data Collectors.

How has the dataset been used?

Kathleen printed postcards with a direct link to the dataset in Dataverse before attending a major conference in the field of Irish Studies. She handed out hundreds of postcards at the conference. Soon after, both Kathleen and Barbara retired. After we deposited the dataset in December 2017, I didn’t look at it again until Spring 2019, and I was stunned that there were more than 1,000+ downloads. Here we are in 2023 approaching 4,000 downloads. I don’t know how it’s being used, but I credit Kathleen for knowing how to reach her disciplinary audience so that it is being used.

As of July 18, 2023, files from this dataset had been downloaded 3,831 times. I would be interested in using the newly released Digital Content Reuse Assessment Framework Toolkit, or D-CRAFT, to explore further.

Write a Comment

Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.