De-Identification by Design: Creating Ethical Data Derivatives with Python
Research and proprietary data often contain personally identifiable information, with variables that reveal details about the lives of individuals and may have been collected without the person’s knowledge or consent. Datasets aggregated at the individual level often interest social science scholars, yet such data poses a risk of identification and create an ethical dilemma for curators.
While some types of information and data are legally protected, other social data, such as home mortgage files, voter registration files, and tax parcel records are public and are often augmented with modeled indicators, such as religious belief or personal income, that may not represent the reality of people’s lives. Library information and data specialists must develop infrastructure, workflows, and policies to ensure the ethical stewardship and use of these datasets.
Workshop Learning Objectives
-
Develop fluency with generating random samples in order to make analysis with large files more manageable
-
Know how to assess the identification risk of specific variables within a dataset in order to protect the identity of human subjects
-
Create a Jupyter Notebook workflow that enables cleaning, redacting, and sharing data for research use
-
Learn some fundamental Pandas features for exploring, cleaning, and transforming data
Workshop Code of conduct
This workshop is dedicated to being a harassment-free experience for everyone, regardless of gender, gender identity and expression, age, sexual orientation, disability, physical appearance, body size, race, ethnicity, or religion (or lack thereof). We will not tolerate harassment of participants in any form. Sexual language and imagery is not appropriate for any venue associated with this workshop. Workshop participants violating these rules may be sanctioned or expelled.
Workshop Presenters
- Andrew Battista (ab6137@nyu.edu)
- Katie Wissel (kam245@nyu.edu)
- Daniel Hickey (dh142@nyu.edu)
Acknowledgements
Thanks to our colleagues, Marii Nyrop and Vicky Rampin, for help with setting up this website and for advising us on best steps to take for setting up the working environments.