Vicky Rampin – Data Dispatch

5 Things I Learned At … iPres 2019!

Vicky Rampin — Thu, 26 Sep 2019 15:30:24 +0000

Last week I lucky enough to head to Amsterdam for iPres2019 — the International Conference on Digital Preservation (iPres is the shorthand for the organizing body of this conference, and the conference itself).

This year iPres was held this year at the EYE Film Museum, the national museum for film located on Amsterdam’s IJ harbour. The conference program included such fun activities like Hackathons and the “Great Digital Preservation Bakeoff” as well as the more traditional conference offerings like a poster and demo session, panels, and paper presentations.

This year at iPres I myself did a fair amount of presenting — I had a panel describing a qualitative study on workplace dissatisfaction amongst digital preservation practitioners (here are the collaborative notes), a poster for the IASGE project (view the poster online), and a paper on a project to archive data journalism (read it on the LIS Scholarship Archive). Even though I was quite busy with all the preparations, given the great spread of the program I was able to attend some wonderful sessions.

So without further ado, here are the 5 things I learned at iPres 2019!

Memento Tracer: a framework for scalable high-quality web archiving. Martin Klein led a workshop on Monday afternoon, and you can read the collaborative notes here. Memento has 3 essential parts: a browser extension that records Traces (a set of instructions for capturing the essence of web publications of a certain class, like capturing slides on SlideShare or GitHub repositories), a repository where anyone can upload/download/reuse Traces (this is great, because that means Traces can be versioned! and no one has to reinvent the wheel!), and a headless browser extension that uses Traces as guidance in the process that navigates and captures web publications (so if we have one working Trace for SlideShare, we can use that for all SlideShare slides that we want!). Martin explained that the Memento can be used in conjunction with ORCID to track researchers across all the platforms they use for scholarship and preserve their work. You can see examples of how this works for 16 test (but real!) researchers at: https://myresearch.institute.
SARA – Software Archiving of Research Artifacts: this was the poster next to our IASGE poster, and had a very similar goal of preserving academic code! The goal of SARA is to “enable [researchers] to capture the intermediate statuses of their research work already during the process […] The collected research data and the different versions of the associated software tools are therefore traceable for later research.” Right now the requirements for capturing Git repositories is that it must exist in GitLab (which I love!), but I’m going to keep my eye on this project for next steps.
The Universal Virtual Interactor (UVI): a part of the Emulation as a Service Infrastructure ecosystem, the UVI is a program that allows users to click on a file (like a CAD file from 1990) and have it open in the original program and computation environment (like AutoCAD 1990 in the appropriate old version of Windows!) using an emulator, in their browser! It also lets users click around and interact (hence the name) with the files and operating systems/old software. It is designed to work for any file, though for files that can be read by any program (like .txt files), this can be tricky — which one should UVI choose? Check out the gif below demonstrating clicking a link to automatically open a Microsoft Works file running in Windows 98 within a web browser (from the UVI DPC blog post):
The file formats most present in the Library of Congress’s collections! By far, file extensions for images (.jpg, .tif, .jp2) are the most prevalent in LOC’s digital collections, following by extensions typically associated with documents (.txt, .pdf, and .xml). Even though those are the most common by count, GZip files (.gz) dominate the collection by size. There are roughly 3,937.79 TB worth of .gz files!
Setting Up Open Access Repositories: Challenges and Lessons from Palestine: I wasn’t able to attend this presentation sadly (overlapped with one of mine!) but I was really intrigued by the content of the paper, which describes a “holistic approach for deploying open access repositories and building research data management services.” They took four Palestine Universities as case studies here. Given that some of our data and information literacy classes in the library examine the digital occupation of the Gaza Strip, I am also interested to see how this scholarship might be included.

5 Things I Learned at … IASSIST 2019

Vicky Rampin — Mon, 12 Aug 2019 15:30:02 +0000

At the end of May, I headed to Sydney, Australia for this year’s IASSIST conference! IASSIST is a great organization that puts on an awesome conference, and I try to attend as often as my schedule allows. This year, I also was able to present on some work with ReproZip and ReproServer (slides on the OSF!), as well as participate on a panel about promoting reproducibility as a partner in campus-wide efforts.

Great ReproZip presentation by @VickySteeves !!! #iassist19 pic.twitter.com/skTbO1ezTT
— Eimmy Solis (@eimmysolis) May 31, 2019

While I was there, I learned about some great initiatives, projects, and tools! So without further ado, here are the 5 things I learned at IASSIST 2019:

IASSIST Qualitative Social Science and Humanities Data Interest Group — having formed in 2016, this interest group is meant to generate conversations around the needs of researchers who work with qualitative data and methods, and what types of services librarians and other information professionals can develop to support these researchers. I had never been to a special interest group meeting at IASSIST before, but this was a great first one! Very well-organized and I loved hearing about the agenda of the group.
University of Washington Libraries Data Services RDM MOOC — the Data Services team at UW pulled from existing curricula in RDM and expertise in the library to edit the materials down into a Massive Open Online Course scoped it as a non-credit, 4 day class that has about a 1hr/day time commitment with a 1:8 ratio of tutor:student. The course was offered once in Winter, Spring, and Summer with attendance ranging from 40-110. Learners in the MOOC overwhelmingly reported both that the class was very clear as to the contents (expectation management!) and that the class exceeded expectations. A nice model for a global campus to learn from, too!
IPYSheet – a neat implementation for Jupyter notebooks and JupyterLAb that lets folks edit spreadsheets right in the notebook! The add-on makes a widget that can be embedded in a code cell. Try out a live version of it on Binder: https://mybinder.org/v2/gh/QuantStack/ipysheet/master?filepath=docs%2Fsource%2Findex.ipynb
Cornell Institute for Social and Economic Research (CISER) houses R-squared, a research verification service at Cornell University. The staff at R-squared aims to reproduce the claims from a paper or report by taking the data and code and rerunning it to check whether the reported output is valid. The staff then creates an archive package with all information in one place and deposits it into their data archive. On their site, they say the average review takes 4-8 hours. They do not check methods, replicate studies, question conclusions/theories, or make any direct changes to patrons’ work.
QAMyData, an open source data quality assurance tool for SPSS, STATA and SAS files, written in the Rust programming languages (one of my faves!). The speakers mentioned CSVs are being somewhat supported as well by this tool. QAMyData aims to “automatically detect some of the most common problems in survey and other numeric data and creates a ‘data health check’, assisting with the clean up of data and providing an assurance that data is of a high quality.” It’s in a very alpha state, so I look forward to seeing this grow and add support for open formats!

Five Things We Learned At . . . SciPy2017

Vicky Rampin — Thu, 13 Jul 2017 21:36:52 +0000

So I’m at SciPy2017 (I had a talk on ReproZip accepted – slides) and I learned about some amazing open source tools for research! This year, SciPy 2017 was in Austin, Texas from July 10-16, 2017. It was the 16th annual Scientific Computing with Python Conference, and focused on great new tools and methods for research with Python.

These are my top 5 favourite takeaways from SciPy 2017:

SciSheets: Anyone who knows me knows that I really can’t stand Excel. It encodes your data weirdly, and is such a black box it causes more errors in research than it ever helps analysis. This is why I was pumped to see a session on building a better spreadsheet – one that combines programming with the simplicity of spreadsheets. SciSheets is a web application that allows users to run Python expressions or scripts in a spreadhseet, but also export spreadsheets to a standalone Python program! You can find a demo video here!
nbgrader: This is a phenomenal application for assignment management and grading in Jupyter notebooks. The nbgrader extension for Jupyter notebooks guides the instructor through assignment and grading tasks using the familiar Jupyter notebook interface. It’s made up of a few Jupyter Notebook extensions. The formgrader extension allows instructors to use functionality from nbgrader to generate student versions of assignments (including releasing to students), collecting assignments, and auto and manual grading submissions. Students just work in the notebook and submit! You can read more at the GitHub repo.
nbgrader workflow from presentation at SciPy 2017
Dataflow: This extension to Jupyter Notebooks answers the question, “how can a notebook be structured so rewriting isn’t necessary?” and “how can cells in a notebook be linked more robustly?” Their solution was to make cell IDs persistent, similarly to UUIDs. This allows users to powerfully reference previous outputs. You see the slides from SciPy here.
The Journal of Open Source Software: Ok, I didn’t just learn about JOSS (I have a paper there!) but it’s still one of my favourite things. It’s an open source journal for software. Developers just have to write a short essay (2 paragraph markdown file with some references and an image) and have their code available for review on GitHub. The reviews look at the source code, and test it out before acceptance. From their website: “The Journal of Open Source Software (JOSS) is an academic journal with a formal peer review process that is designed to improve the quality of the software submitted.” It’s a great way for developers in academia to get their work reviewed, and get credit for their excellent software.
Elegant SciPy book: Written by Juan Nunez-Iglesias (@jni), Harriet Dashnow (@hdashnow), and Stéfan van der Walt (@stefanv), and published by O’Reilly Media, this fully free and open book focuses on the foundations of scientific python. You can download the book from the GitHub repository as Markdown or an executable Jupyter Notebook. Great work done on opening the book in a machine readable and executable format!!

Five Things We Learned At . . . IASSIST 2017

Vicky Rampin — Fri, 26 May 2017 21:23:12 +0000

I just got back from IASSIST 2017 and I have to say…I was very impressed! This year, IASSIST (The International Association for Social Science Information Services & Technology) 2017 was in Lawrence, Kansas from May 23-26, 2017. True to it’s name, this conference brought people from all around the world:

A map of #iassist17 attendees! pic.twitter.com/V6fV5Ey5iv

— Vicky Steeves (@VickySteeves) May 24, 2017

These are my top 5 favourite takeaways from IASSIST 2017:

An interesting project that was recently published in PLoS One, Research data management in academic institutions: A scoping review, which was presented as a poster during the conference. This was essentially a systematic review that was designed to describe the volume, topics, and methodologies of existing scholarly work on research data management in academia. They looked at 301 articles out of the original 13,002 titles. They made the data (the text, methods, etc.) available on Zenodo: Dataset for: Research data management in academic institutions: a scoping review!
Packrat: a dependency manager in R that looks to solve the problem of “dependency hell” — that software depends on other packages to run, and these change all the time with no warning, and these changes can break existing code. Packrat works by making a project specific package library, rather than using R’s native package manager (which updates libraries as they are released). This means the R code can be packaged up with its dependencies. However, it doesn’t pack the version of R, which can pose problems.
Sam Spencer of the Aristotle metadata registry gave a great talk about work done in the open metadata space, giving a strong usecase: government data hosted on data.gov.au. He shocked the crowd by keeping metadata in CSV format. He asks for 10 basic fields of metadata from users in CSV form — and there it stays! He mentioned he was scared to admit this to this crowd, but it’s yielded good things for him, including data linkages without explicitly doing linked data. He spoke specifically about using this for geo-metadata; you can check out how it’s worked out on this map.
One of the more interesting talks I went to was about digital preservation of 3D data! The speaker laid out 5 methods of creation: freeform (like CAD), measurement, observation, “mix,” and algorithm/scanning or photogrammetry. 3D data is difficult to preserve mainly because of a lack of standards, particularly metadata standards. The speaker presented a case study that used Dublin Core as a basis for metadata for the Awash National Park Baboon Research Project’s 3D data.
The Digital Curation Network gave an update on their initial planning grant. The DCN allows universities to staff share on data curation, which often is too much for one data curator. The first grant allowed six universities to test how local curation practices translates into a network practice. The next phase includes implementation of the network, during which time other institutions can join. The network also came out with centralized steps for curation:
1. Check data files and read documentation
2. Understand/try to understand the data
3. Request missing information or changes
4. Augment the submission with metadata
5. Transform file format for reuse and long-term preservation
6. Evaluate and rate the overall submission using FAIR

Funding Opportunity: Digging into Data Challenge

Vicky Rampin — Tue, 05 Apr 2016 15:51:15 +0000

The Institute of Museum and Library Services (IMLS), in partnership with 15 other national funding agencies and 11 countries on three continents, has announced the fourth round of the “Digging into Data Challenge.”

The Digging into Data challenge has awarded close to $12 million in funding for international research that use big data sources and methodologies to answer questions in social sciences and humanities.

This year, the grant competition is presented under the guidance of the Trans-Atlantic Platform (T-AP), a consortium of 16 funders of social science and humanities from Europe, South America, and North America. U.S. agencies include IMLS, the National Endowment for the Humanities, and the National Science Foundation.

This funding opportunity is open to international projects that consist of teams from at least three member countries, and must include partners from both sides of the Atlantic. Projects must address a research question in humanities and/or social sciences disciplines by using large-scale, digital data analysis techniques, and show how these techniques can lead to new insights. Research partners will receive funding from their own national funding agencies for projects that can last for up to 36 months.

The deadline for final applications is June 29, 2016.