Data Cleaning and Management Using OpenRefine


Vicky Steeves & Nick Wolf | February 12, 2019


What is OpenRefine?

OpenRefine, formerly Google Refine, is an open source tool that allows users to load data, clean it quickly and accurately, transform it, and even geocode it.

Why Use OpenRefine?

  • Simple, easy installation
  • Lots of great import formats: .tsv, .csv, XML, RDF Triples, JSON, Google Sheets, Excel
  • Upload from local drive or import from URL
  • Many export formats: .tsv, .csv, Excel, HTML table
  • Useful extensions: geoXtension, Opentree for phylogenetic trees from Open Tree of Life
  • Active development community

Installing OpenRefine

  • Make sure you have the most updated Java JRE installed
  • Download Open Refine at http://openrefine.org/download.html and place it in the directory you wish it to run from (i.e. in your program files folder)

Installing OpenRefine

  • Extract the zipped file.
  • Click on the OpenRefine executable

Installing OpenRefine

  • A local server will be launched on the desktop.

Installing OpenRefine

  • A browser window will open in your default browser to begin session. (Note: if a window does not open, open a new browser window and visit the URL http://127.0.0.1:3333/)

Opening and Starting a Project

  • Use your own data or try using this sample data extract available here. Click on Browse and select your data.
  • Click Next

Opening and Starting a Project

  • Preview the data. Ensure encoding is correct.
  • Check that OpenRefine recognizes character (comma, tab, etc.) being used to separate values in table.

Navigating the Data Window

  • Use drop-down arrows on column names to access tools.
  • Try functions like Text Filter to select certain row values

Creating a Text Facet to Clean/Edit Cell Values

  • Under Facet option of the column to be edited, select Text Facet.
  • View universe of cell values for that column. See a list of values by clicking on Choices

Creating a Text Facet to Clean/Edit Cell Values

  • Edit columns values in-line by hovering over a value and clicking on Edit.
  • Changes made will apply to all instances of that value in the datasheet.

Creating a Text Facet to Clean/Edit Cell Values

  • Use Cluster for more powerful cleaning of data entry errors and non-standardization problems such as capitalization, extra whitespace.

Creating a Text Facet to Clean/Edit Cell Values

  • Change Method and Keying Function to look for more groupings to merge and re-cluster.
  • Enter replacement values in New Cell Value field.

Splitting Columns

  • Select Edit Column >> Split into several columns.
  • Name the new column and choose a method for splitting original column (by separator, by whitespace, by character length).

Advanced Column Expressions: String Evaluations

  • Select Edit Column >> Add column based on this column.
  • Name the new column and enter an Expression that will transform the value of the original column cell into something new.

Advanced Column Expressions: String Evaluations

  • Example:
    endsWith(value, “Agent”)

    Examines the first parameter (here, value or the cell’s value to see if it ends with the string “Agent.” If it does, returns true, otherwise false.

Advanced Column Expressions: Geocoding

  • Under the column Street_address, select Edit Column >> Add Column by Fetching URLs.
  • Name the new column and enter in the Expression box:
    "http://maps.google.com/maps/api/geocode/json?address=" + escape(value, "url")

Advanced Column Expressions: Geocoding

  • Or copy and paste expression from this tutorial

Advanced Column Expressions: Geocoding

  • Once geocoding is complete, you’ll see a new column with a JSON response from Google containing coordinates for that street address.
  • Under the new JSON response column select Edit column >> Add column based on this column.

Advanced Column Expressions: Geocoding

  • In expression box, enter:
    with(value.parseJson().results[0].geometry.location, pair, pair.lat +", " + pair.lng)

Advanced Column Expressions BONUS: Jython

  • Like Python? Try setting language to Jython and entering whole Python script blocks in your expression field.

Thank you! Questions?


Email us: vicky.steeves@nyu.edu & nicholas.wolf@nyu.edu

Learn more about RDM: guides.nyu.edu/data_management

Get this presentation: guides.nyu.edu/data_management/resources

Make an appointment: guides.nyu.edu/appointment