class: center, middle # Regular Expressions ??? * One reason for the value of consistent and predictable directory and filenaming conventions is that working in this way enables you to use the computer to select files based on the characteristics of their file name. * So, for example, if you have a bunch of files where the first four digits are the year and you only want to do something with files from '2014', then you can. * Or if you have 'journal' somewhere in a filename when you have data about journals, you can use the computer to select just those files then do something with them. Equally, using plain text formats means that you can go further and select files or elements of files based on characteristics of the data *within* files. A powerful means of doing this selecting based on file characteristics is to use regular expressions, often abbreviated to regex. --- class: center, middle # What are they? ??? A regular expression is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations. Regular expressions are typically surrounded by `/` characters, though we will (mostly) ignore those for ease of comprehension. --- # What do they do? - Match on types of character (e.g. 'upper case letters', 'digits', 'spaces', etc.) - Match patterns that repeat any number of times - Capture the parts of the original string that match your pattern ??? As most computational software has regular expression functionality built in and as many computational tasks in libraries are built around complex matching, it is good place for Library Carpentry to start in earnest. --- class: center, middle # The regular expression # `organi[sz]e` # matches both "organise" and "organize" ??? But it would also find `reorganise`. So there are a bunch of special syntax that help us be more precise. --- # Brackets - `[ABC]` matches A or B or C - `[A-Z]` matches any upper case letter - `[A-Za-z0-9]` matches any upper or lower case letter or any digit (note: this is case-sensitive) ??? square brackets can be used to define a list or range of characters to be found. --- # Then there are: - `.` matches any character - `\d` matches any single digit - `\w` matches any part of word character (equivalent to `[A-Za-z0-9]`) - `\s` matches any space, tab, or newline - `\` NB: this is also used to escape the following character when that character is a special character. So, for example, a regular expression that found `.com` would be `\.com` because `.` is a special character that matches any character. --- # And - `^` asserts the position at the start of the line. So what you put after it will only match the first characters of a line or contents of a cell. - `$` asserts the position at the end of the line. So what you put after it will only match the last character of a line of contents of a cell. - `\b` adds a word boundary. Putting this on either side of a word stops the regular expression matching longer variants of words. So: - the regular expression `foobar` will match `foobar` and find `666foobar`, `foobar777`, `8thfoobar8th` et cetera - the regular expression `\bfoobar` will match `foobar` and find `foobar777` - the regular expression `foobar\b` will match `foobar` and find `666foobar` - the regular expression `\bfoobar\b` will find `foobar` --- class: center, middle # So, what is `^[Oo]rgani.e\b` going to match? ??? --- # Other useful special characters are: - `*` matches when the preceding character appears any number of times including zero - `+` matches when the preceding character appears any number of times excluding zero - `?` matches when the preceding character appears one or zero times - `{VALUE}` matches the preceding character the number of times define by VALUE; ranges can be specified with the syntax `{VALUE,VALUE}` - `|` means or. --- class: center, middle # So, what are these going to match? --- class: center, middle # `^[Oo]rgani.e\w*` --- class: center, middle # `[Oo]rgani.e\w+$` --- class: center, middle # `^[Oo]rgani.e\w?\b` --- class: center, middle # `^[Oo]rgani.e\w?$` --- class: center, middle # `\b[Oo]rgani.e\w{2}\b` --- class: center, middle - `\b[Oo]rgani.e\b|\b[Oo]rgani.e\w{1}\b` ??? This logic is super useful when you have lots of files in a directory, when those files have logical file names, and when you want to isolate a selection of files. Or for looking at cells in spreadsheets for certain values. Or for extracting some data from a column of a spreadsheet to make new columns. I could go on. The point is, it is super useful in many contexts. To embed this knowledge we won't - however - be using computers. Instead we'll use pen and paper. I want you to work in teams of 4 to work through the exercises in the handout. I have an answer sheet over here if you want to check where you've gone wrong. When you finish, I'd like you to split your team into two groups and write each other some tests. These should include a) strings you want the other team to write regex for and b) regular expressions you want the other team to work out what they would match. Then test each other on the answers. If you want to check your logic, use [regex101](https://regex101.com/), [myregexp](http://myregexp.com/) or [regexper.com](http://regexper.com/): the first two help you see what text your regular expression will match, the latter visualises the workflow of a regular expression.