Love Your Data Week – Data Dispatch

Love Your Data Week 2017: Rescuing Unloved Data

Nicholas Wolf — Fri, 17 Feb 2017 17:34:45 +0000

The end of Love Your Data Week has arrived, but naturally care for our research data will continue beyond the confines of a single week in February!

Today’s final theme is “Rescuing Unloved Data.” Data in need of rescue can range from at-risk paper archives containing structured information to recently created digital files at risk of loss because of the collapse of support or infrastructure for that particular file format. Other data that can use a helping hand include data in a stable format that simply needs some further enrichment to see their full potential, and data at risk of disappearance because of a change in administrative oversight or ideological outlook.

Recently, Data Services hosted a data enrichment hack-a-thon for Soviet-era maps that had been digitized in tiff format, but not yet georeferenced so that they could be displayed in a GIS system. Moving that spatial data from a basic digital format to an image format with embedded spatial referents involved a bit of manual labor by hack-a-thon participants, reminding us that some of the best data rescues involve collective efforts. In the end, a helpful new resource was built for researchers out of a set of files that had been in need of a little love and attention.

We hope you’ve taken some time out of your schedule this week to think about the future of your research data. Check back for next year’s Love Your Data Week 2018 for more tips and new themes.

Love Your Data Week 2017: Finding the Right Data

Nicholas Wolf — Thu, 16 Feb 2017 17:35:16 +0000

Thursday’s theme for Love Your Data Week is finding data, making it timely to take a look at the resources that we have here at NYU for researchers seeking data. Bobst libraries receives many requests for data–increasingly so as data-driven research questions have begun to inform a larger number of disciplines than ever before.

Needing data and knowing what source providers are offering data is only the beginning of such inquiries, however. No two data providers follow the same approach to formatting and serving up their data, and keep in mind that it helps to keep your search relatively open ended until you see the format that data is available in. For example, one might seek population health data and find it in a number of providers, from the National Institutes of Health and Center for Disease Control to the U.S. Census. But the data provided by those sources might be aggregated, sampled, or available in exploratory (rather than downloadable) formats. Data, while plentiful, may be limited it what it measures. So build in extra time for transforming your datasets, especially if you need to link them to other datasets in your arsenal.
Here at NYU Data Services, we have aggregated many of the data resources available at the library and beyond into a handy Finding Data guide that allows you to search by genre and provider type. And it helps you consider the differences between pure data and statistics, two common formats in which providers will offer information.

As when accumulating your own data, don’t forget to document your sources each step of the way, saving you time at the end of your research process and making it harder to make errors in citation!

Love Your Data Week 2017: Good Data Examples

Nicholas Wolf — Wed, 15 Feb 2017 18:14:15 +0000

The principles behind good data have a nice handy acronym to spark your memory. Good data are FAIR:

Findable
Accessible
Interoperable
Reusable

Here at NYU’s Data Services, we emphasize interoperability and reusability through our guides and classes focusing on research data management in which we stress the need to select long-term file formats that are software agnostic. In practice, this often means open-access or open-licensed file formats and software.

But we often find it a struggle to convince researchers to consider the ramifications of how to make data findable and accessible. If we think of “accessible” and “reusable” as principles that call for preservation–a process that demands robust curatorial and archives services–the principle of “findable” often demands more flexible, dynamic (especially in terms of online manifestations of research data and analysis) approaches such as a website, print- or e-publication, or some other unique access point. In other words, it often seems natural for researchers to publish their data on a personal or departmental website (and even more effective in terms of “findability”) rather than think of long-term solutions such as a data repository.

Ideally, of course, both approaches would be blended, and there is nothing to suggest that one cannot achieve FAIR principles by putting research data in both places, so long as those locations are linked in some way and at least one of them has the appropriate permanency.

Needless to say, accessible and reusable data are also well-organized and well-documented. It is always worth a visit to a repository such as the ICPSR’s political and social research collections to see how highly curated data should look.

Love Your Data Week 2017: Documenting, Describing, Defining

Nicholas Wolf — Wed, 15 Feb 2017 00:34:02 +0000

Documenting one’s research data and methods often feels more like winning a battle against inertia than learning new tricks to documentation. After all, don’t we all know that we should be recording how and why we did things…it is just a question of finding time to do it?

This may be true, but it is also helpful to consider that documenting data efficiently is really the key. Once we set up a system that enables effective documentation, contributing to it should require far less effort and it should ideally make it easier to continue documenting as the project progresses: in other words, a lot of initial effort followed by inertia in favor of documentation rather than against it. After all, once a workflow is in place for documenting, that system can be replicated across many projects, reducing the work the next time around.

Moreover, since we don’t want to take shortcuts in documentation comprehensiveness, we need to take advantage of every time saver available. Make use of the codebook-making function in SPSS, for example (select Analyze > Reports > Codebook from your menu); use automatic TEI-XML global element generators like this one from the University of Rochester; when documenting computer code through commenting, read through style guides like this one for Python and implement its recommendations as you go so that it becomes second-nature; and don’t be hesitant to grab templates like this one from Cornell University for readme files. Data documenting is a social experience (you want others to immediately comprehend your codebook), not an exercise in authoring unique gems that only you can decipher–so templates are great.

Lastly, make sure your documentation is going to last as long as your data. It does no good to go through all the trouble of building a nice software-agnostic, cross-platform readable file only to provide documentation that no hardware will be able to read in five years. Use helpful formats like pdf, txt, and even html that have staying power!

Love Your Data Week 2017: Defining Data Quality

Nicholas Wolf — Mon, 13 Feb 2017 21:27:08 +0000

Love Your Data (#LYD17) Week kicks off today with a subject that we all struggle with when working with our research data: how do we define and establish data quality? One of the key ways we can set standards for data quality is by considering how any given dataset meets the goals for its intended use. That may mean high standards for completeness (no missing values), but it might also mean something else, such as establishing accuracy of captured values (while recording missing values systematically), setting criteria for validity (such as a confidence interval), or enabling further verification by future researchers by providing excellent documentation to accompany the data.

Sometimes, it is also useful to set parameters for good data by considering what constitutes bad data. If you haven’t come across this resource, it is always useful to consult the “Quartz Guide to Bad Data,” an extensive (maybe even exhaustive!) list of commonly found elements of bad data. Or check out this collection of “how not to do data” examples.

All too often, we also think of good data as data that is suitable for the research question at hand (meaning it is crafted and understandable by the single researcher and sometimes the single researcher only). But well-attested data can also be data that is attuned to community standards–and not just community standards within a single discipline, but across the wider data community. Selecting helpful authorities and formats for the way data is represented, an approach that can encompass everything from deploying a specific set of terms from a curated medical terminology to utilizing established formats for geographic locations, can boost the quality of one’s data by asserting specificity and standardization.

No matter how data quality is conceived, remember that achieving it very often requires peer/community review, so it is always useful to bring colleagues into the conversation!

Love Your Data Week 2017

Nicholas Wolf — Thu, 13 Oct 2016 16:11:21 +0000

Love Your Data (LYD) Week has arrived, and we have more reasons than ever to love our research data this year. We dedicate a lot of effort and expend a considerable amount of resources to building our data over many years, making it essential that we take stock of how we will ensure our digital research materials have a long life for our own use and the use of our fellow scholars. But this year loving our data takes on an extra sense of urgency as we contemplate the vulnerability of data at a time when not all organizations and agents share that goal of making the results of scholarly endeavor transparent and available.

This year Love Your Data Week encompasses the following daily themes, each dedicated to an important aspect of shepherding your data through the research lifecycle:

Monday, February 13: Defining Data Quality
Tuesday, February 14: Documenting, Describing, Defining
Wednesday, February 15: Good Data Examples
Thursday, February 16: Finding the Right Data
Friday, February 17: Rescuing Unloved Data

Nearly one hundred universities, data repositories, and data providers are participating in LYD2017 this year. You’ll find tips and information in conjunction with each day’s theme here on the Data Dispatch. Follow the discussion on Twitter using #LYD17 or #loveyourdata. And this year, we’ve added a full session at week’s end dedicated to prepping and cleaning data. Sign up for “Love Your Data; a Prepping and Cleaning Data Workshop” on Friday, February 17, from 1-3:00 pm.