class: center, middle # What is OpenRefine > A tool for working with ‘messy’ data ??? Most of this class will involve following along from a worksheet. Much of the time you will be encouraged to go along at your own pace. For some of you this will feel like a lot of material, for others it might not feel like enough. Remember that the session is introductory. If you finish early, our end time is not a hard stop, so you may of course leave. Alternatively, you might want to use the time to search online for more information or advanced skills guides. You may even wish to deepen your own skills by staying around and helping someone else out: there is nothing better for really getting to know something than teaching it to someone else! If you don't finish, don't worry, there are no prerequisites between classes and if you have time, you can always carry on at home. --- It can help you: * Get an overview of a data set * Resolve inconsistencies in a data set * Help you split data up into more granular parts * Match local data up to other data sets * Enhance a data set with data from other sources ??? Some common scenarios might be: * Where you want to know how many times a particular value appears in a column in your data * Where you want to know how values are distributed across your whole data set * Where you have a list of dates which are formatted in different ways, and want to change all the dates in the list to a single common date format --- For example: | Data you have | Desired data | |-----------------|:-------------| | 1st January 2014| 2014-01-01 | | 01/01/2014 | 2014-01-01 | | Jan 1 2014 | 2014-01-01 | | 2014-01-01 | 2014-01-01 | ??? --- ####Where you have a list of names or terms that differ from each other but refer to the same people, places or concepts. For example: | Data you have | Desired data | |-----------------|:-------------| | London | London | | London] | London | | London,] | London | | london | London | --- ####Where you have several bits of data combined together in a single column, and you want to separate them out into individual bits of data with one column for each bit of the data. > Address in a single field: >'University of Wales, Llyfrgell Thomas Parry Library, Llanbadarn Fawr, ABERYSTWYTH, Ceredigion, SY23 3AS, United Kingdom' >"University of Wales", "Llyfrgel Thomas Parry Library", "Llanbadarn Fawr", "ABERYSTWYTH", "Ceredigion", "SY23 3AS", "United Kingdom" --- ####Where you want to add to your data from an external data source: | Data you have | Date of Birth from VIAF (Virtual International Authority File) | Date of Death from VIAF (Virtual International Authority File) | |-----------------|:-------------|:-------------| | Braddon, M. E. (Mary Elizabeth) | 1835 | 1915 | | Rossetti, William Michael | 1829 | 1919 | | Prest, Thomas Peckett | 1810 | 1879 | --- ## Exercise 1: Importing data
--- ## Basic Operations * Layout * Sorting data * Facets ??? * OpenRefine displays data in a tabular format. Each row will usually represent a 'record' in the data, while each column represents a type of information * OpenRefine only displays a limited number of lines of data at one time. You can adjust the number choosing between 5, 10 (the default), 25 and 50. * You can reord or drop columns by using the first column 'all' and choosing 'Edit columns->Re-order / remove columns …' * drag and drop columns ### Sorting data * Choose the drop down of the column you want to sort and choose sort * unlike excel, sorts are temporary, you can remove sort and the data will go back to original order * you can reverse sort ### Facets * A 'Facet' groups all the values that appear in a column, and then allow you to filter the data by these values and edit values across many records at the same time. * simplest facet is 'text facet' * To create a Text Facet for a column, click on the drop down menu at the top of the column and choose 'Facet -> Text Facet'. The facet will then appear in the left hand panel * filter by facet by clicking or 'including' * you can 'invert' your selections to exclude ### filters * you can also use text filters to filter data * text filter look for a specifice string ** when you have filtered the data displayed in OpenRefine, any operations you carry out will apply only to the rows that match the filter ** * more on facets:
* if there are too many unique facets OF will cough and it won't create the facet (like faceting on word on title) --- Exercise 3: Find all publications without a DOI
--- ## Changing Data through Facets * You can change the values for faceted data * Mouse over facet and click edit ??? This approach is useful in relatively small facets where you might have small variations through punctuation or typing errors etc. For example, a column that should contain only terms from a small restricted list such as days of the week or months of the year. --- ## Exercise 4: Correct the Language values via a facet
--- ## Rows and Records * OF has two modes: Rows and Records * We've been using rows so far: each row representing a single case or record in a dataset (an article in our case) * How this works can be seen in Exercise 5... --- ## Exercise 5: Split author names into separate cells
--- ## Clustering * Groups together values in a column that are 'similar' and enables you to merge together several different, but similar, values into a single value * To use the 'Cluster' function, click on the 'Edit Cells' menu option in the relevant column and choose 'Cluster and edit...' * Clusters are created using an algorithm & OF has multiple clustering algorithms -- experiment to see what does best * For more information on the methods used to create Clusters see
??? * This is very effective where you have data where there can be minor variations in data values that are likely such as names of people, organisations and places. --- ## Exercise 6: Use Clustering to clean up author data
--- ## Break --- ## Transforming data * Splitting data that is in a single column into multiple columns (e.g. splitting an address into multiple parts) * Standardizing the format of data in a column without changing the values (e.g. removing punctuation or standardizing a date format) * Extracting a particular type of data from a longer text string (e.g. finding ISBNs in a bibliographic citation) ??? Through facets, filters and clusters OpenRefine offers relatively straightforward ways of getting an overview of your data, and making changes where you want to standardise terms used to a common set of values. However, sometimes there will be changes you want to make to the data that cannot be achieved in this way. To support this type of activity OpenRefine supports 'Transformations' which are ways of manipulating data in columns. Transformations are normally written in a special language called 'GREL' (Google Refine Expression Language). To some extent GREL expressions are similar to Excel Formula, although they tend to focus on text manipulations rather than numeric functions. --- ## GREL * 'GREL' (Google Refine Expression Language). To some extent GREL expressions are similar to Excel Formula, although they tend to focus on text manipulations rather than numeric functions. * [https://github.com/OpenRefine/OpenRefine/wiki/Google-refine-expression-language](https://github.com/OpenRefine/OpenRefine/wiki/Google-refine-expression-language) * This tutorial covers only a small subset of the commands available. --- ## Common Transformations * Available through column drop downs Common Transformation | Action | GREL expression --------------------| ------------- | ------------- To Uppercase| Converts the current value to uppercase | ```value.toUppercase()``` To Lowercase| Converts the current value to lowercase | ```value.toLowercase()``` To Titlecase| Converts the current value to titlecase (i.e. each word starts with an uppercase character and all other characters are converted to lowercase) | ```value.toTitlecase()``` Trim leading and trailing whitespace | Removes any 'whitespace' characters (e.g. spaces, tabs) from the start or end of the current value | ```value.trim()``` --- ### Exercise 7: Correct Publisher data * Create a text facet on the Publisher column * Note that in the values there are two that look identical - why does this value appear twice? * On the publisher column use the dropdown menu to select 'Edit cells->Common transforms->Trim leading and trailing whitespace' * Look at the publisher facet now - has it changed? (if it hasn't changed try clicking the Refresh option to make sure it updates) --- ## Writing Transformations in GREL * Select the column on which you wish to perform a transformation and choose 'Edit cells->Transform…' * Note the box where you write the expression * Needs to be valid GREL (if you don't get a preview of the changed data it isn't GREL) --- ## GREL function syntax GREL functions are written by giving a value of some kind (a text string, a date, a number etc.) to a GREL function. Some GREL functions take additional parameters or options which control how the function works. GREL supports two types of syntax: * value.function(options) * function(value, options) ??? Either is valid, and which is used is completely down to personal preference. In these notes the first syntax is used. Next to the 'Preview' option are options to view: * History - a list of transformations you've previously used with the option to reuse them immediately or to 'star' them for easy access * Starred - a list of transformations you've 'starred' via the 'History' view * Help - a list of all the GREL functions and brief information on how to use them --- ## Exercise 8: Put titles into Title Case
--- ## Undo and Redo * OpenRefine lets you undo, and redo, any number of steps you have taken in cleaning the data * Feel free to try things out and undo/redo * Open OF and see how you can unwind the transformations * If you unwind to a point and then create new transformations from that point the 'grayed out' previous transformations will be lost ??? if you 'undo' a set of steps and then start doing new transformations, the greyed out steps will disappear and you will no longer have the option to 'redo' these steps. --- ## Undo and Refine -- Getting the JSON * Click extract JSON - save in file * You can apply work from someone else or share your own with others * Use the 'apply' button to do this --- ## Data Types in OF Every piece of data in OpenRefine has a 'type'. The most common 'type' is a 'string' - that is a piece of text. The data types supported are: * String * Number * Date * Boolean * Array ??? there are other data types available and transformations let you convert data from one type to another where appropriate So far we've been looking only at 'String' type data. Much of the time it is possible to treat numbers and dates as strings. For example in the Date column we have the date of publication represented as a String. However, some operations and transformations only work on 'number' or 'date' type operations. The simplest example is sorting values in numeric or date order. To carry out these functions we need to convert the values to a date or number first. --- ## Exercise 9: Reformat the Date
--- ## Booleans and Arrays * Booleans and Arrays are data types that are more often used while manipulating data in a GREL expression than for actually storing in a cell (in fact, Arrays cannot be stored in a cell in OpenRefine). * 'Boolean' is a binary value that can either be 'true' or 'false': ``` value.contains("test") ``` ??? Boolean often use to generate true and false in a test --- ## Arrays * An 'Array' is a list of values, represented in Refine by the use of square brackets containing a list of values surrounded by inverted commas and separated by commas. E.g.: ["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"] ??? * Can be sorted, de-duplicated, and manipulated in other ways in GREL expressions, but cannot appear directly in an OpenRefine cell. * Arrays in OpenRefine are usually the result of a transformation. For example the 'split' function takes a string, and changes it into an array based on a 'separator' --- ## Arrays For example if a cell has the value: "Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday" This can be transformed into an array using the 'split' function ``` value.split(",") ``` And would create; ["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"] --- ##Arrays This can be combined with array operations like 'sort'. For example, assuming the cell contains the same value as above, then the function ``` value.split(",").sort() ``` would result in an array containing the days of the week sorted in alphabetical order: ["Friday","Monday","Saturday","Sunday","Thursday","Tuesday","Wednesday"] --- ??? To output a value from an array you can either select a specific value depending on its position in the list (with the first position treated as 'zero'). For example ``` value.split(",")[0] ``` would extract the first value --- ## Arrays You can also join arrays together to make a 'String'. The GREL expression would look like ``` value.split(",").sort().join(",") ``` Taking the above example again, this would result in a string with the days of the week in alphabetical order, listed with commas between each day. --- ## Exercise 10: Reverse author names
--- ## Looking up data from an URL * OF can retrieve data from the web * As an example, you can look up names against the Virtual International Authority File (VIAF), and retrieve additional information such as dates of birth/death and identifiers. * Typically this is a two step process - firstly a step to retrieve data from a remote service, and secondly to extract the relevant information from the data you have retrieved. * To retrieve data from an external source, from the drop down menu at a column heading use the option 'Edit column->Add column by fetching URLs'. ??? This will prompt you for a GREL expression to create a URL. Usually this would be a URL that uses existing values in your data to build a query. When the query runs OpenRefine will request each URL (for each line) and retrieve whatever data is returned (this may often be structured data, but could be simply HTML). The data retrieved will be stored in a cell in the new column that has been added to the project. You can then use OpenRefine transformations to extract relevant information from the data that has been retrieved. Two specific OpenRefine functions used for this are: * parseHtml() * parseJson() The 'parseHtml()' function can also be used to extract data from XML. The next exercise demonstrates this two stage process in full. --- ## Exercise 11: Retrieving journal details from CrossRef via ISSN
--- ## Reconciliation services * allow you to lookup terms from your data in OpenRefine against external services, and use values from the external services in your data. * Reconciliation services can be more sophisticated and often quicker than using the method described above to retrieve data from a URL * There are a few services where you can find an OpenRefine Reconciliation option available. For example WikiData has a (fledgling) reconciliation service at [https://tools.wmflabs.org/wikidata-reconcile/](https://tools.wmflabs.org/wikidata-reconcile/). ??? * However, to use the ‘Reconciliation’ function in OpenRefine requires the external resource to support the necessary service for OpenRefine to work with, which means unless the service you wish to use supports such a service you cannot use the ‘Reconciliation’ approach. * People have built reconciliation services * One of the most common ways of using the reconciliation option in OpenRefine is with an extension (see below for more on extensions to OpenRefine) can use linked data sources for reconciliation. The extension is called "RDF Refine" and can be downloaded from [http://refine.deri.ie](http://refine.deri.ie). --- ## Reconciliation services More info: * For more information on using Reconciliation services see [https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API](https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API) * There also exist extensions to do reconciliation against local data such as csv files (see [http://okfnlabs.org/reconcile-csv/](http://okfnlabs.org/reconcile-csv/)) and maintained lists of values (see [http://okfnlabs.org/projects/nomenklatura/index.html](http://okfnlabs.org/projects/nomenklatura/index.html)). --- ## Exercise 12: Reconcile Publisher names with VIAF IDs