Love Your Data (#LYD17) Week kicks off today with a subject that we all struggle with when working with our research data: how do we define and establish data quality? One of the key ways we can set standards for data quality is by considering how any given dataset meets the goals for its intended use. That may mean high standards for completeness (no missing values), but it might also mean something else, such as establishing accuracy of captured values (while recording missing values systematically), setting criteria for validity (such as a confidence interval), or enabling further verification by future researchers by providing excellent documentation to accompany the data.
Sometimes, it is also useful to set parameters for good data by considering what constitutes bad data. If you haven’t come across this resource, it is always useful to consult the “Quartz Guide to Bad Data,” an extensive (maybe even exhaustive!) list of commonly found elements of bad data. Or check out this collection of “how not to do data” examples.
All too often, we also think of good data as data that is suitable for the research question at hand (meaning it is crafted and understandable by the single researcher and sometimes the single researcher only). But well-attested data can also be data that is attuned to community standards–and not just community standards within a single discipline, but across the wider data community. Selecting helpful authorities and formats for the way data is represented, an approach that can encompass everything from deploying a specific set of terms from a curated medical terminology to utilizing established formats for geographic locations, can boost the quality of one’s data by asserting specificity and standardization.
No matter how data quality is conceived, remember that achieving it very often requires peer/community review, so it is always useful to bring colleagues into the conversation!