For this section, you will begin by counting the contents of files using the Unix shell. * The Unix shell can be used to quickly generate counts from across files, something that is tricky to achieve using the graphical user interfaces of standard office suites.
To get started, use the cd command to navigate to the directory that contains our data. * Remember, if at any time you are not sure where you are in your directory structure, type pwd and hit enter.
pwd
/Users/rotsuji/desktop/libcarp-data-notes/data
Type ls -F and then hit enter. This prints, or displays, a list of files and directories in the current directory
ls -F
2014-01-31_JA-africa.tsv* everything-together.txt
2014-01-31_JA-america.tsv* firstdir/
2014-01_JA.tsv gallic.txt*
2014-01_JA.tsv.zip gulliver-backup.txt*
2014-02-01_JA-art .tsv* gulliver-twice.txt
2014-02-02_JA-britain.tsv* gulliver.txt*
829-0.txt*
The files in this directory includes a zipeed and unzipped version of the dataset 2014-01_JA.tsv that contains journal article metadata. We unpacked this file at the beginning of the workshop.
Looking at the other data files, there are also four .tsv files derived from 2014-01_JA.tsv. * Each of these four includes all data where a keyword such as africa or america appears in the ‘Title’ field of 2014-01_JA.tsv.
.TSV files are those in which within each row the units of data (or cells) are separated by tabs. * They are similar to CSV (comma separated value) files were the values are separated by commas. * The latter are more common but can cause problems with the kind of data we have, where commas can be found within the cells (though with the right encoding this can be overcome). * Either way both can be read in simple text editors or in spreadsheet programs such as Libre Office Calc or Microsoft Excel.
Before we begin working with these files, lets double check we are in the right directory in which they are stored.
pwd
/Users/rotsuji/desktop/libcarp-data-notes/data
Next we want to Count the contents of the files. * The Unix command for counting is wc. * Type wc -w 2014-01-31_JA-africa.tsv and hit enter.
wc -w 2014-01-31_JA-africa.tsv
511261 2014-01-31_JA-africa.tsv
The file contains 511261 words. * The flag -w combined with wc instructs the computer to print a word count, and the name of the file that has been counted, into the shell. * As was seen earlier today flags such as -w are an essential part of getting the most out of the Unix shell as they give you better control over commands.
If your reader request or piece of work is more concerned with number of entries (or lines) than the number of words, you can use the line count flag. * Type wc -l 2014-01-31_JA-africa.tsv and hit enter.
wc -l 2014-01-31_JA-africa.tsv
13712 2014-01-31_JA-africa.tsv
Finally, type wc -c 2014-01-31_JA-africa.tsv and hit enter.
wc -m 2014-01-31_JA-africa.tsv
3773660 2014-01-31_JA-africa.tsv
This wc command uses the flag -c in combination with the command wc to print a character count for 2014-01-31_JA-africa.tsv
With these three flags, the most obvious thing we can use wc for is to quickly compare the shape of sources in digital format: * for example word counts per page of a book, the distribution of characters per page across a collection of newspapers, the average line lengths used by poets. * You can also use wc with a combination of wildcards and flags to build more complex queries.
Can you guess what the line **wc -l 2014-01-31_JA-a*.tsv** will do? (try running the command)
wc -l 2014-01-31_JA-a*.tsv
13712 2014-01-31_JA-africa.tsv
27392 2014-01-31_JA-america.tsv
41104 total
This wc command prints the line counts for 2014-01-31_JA-africa.tsv and 2014-01-31_JA-america.tsv, offering a simple means of comparing these two sets of research data.
It may be faster if you only have a handful of files to compare the line count for the two documents in Libre Office Calc, Microsoft Excel, or a similar spreadsheet program. * But when wishing to compare the line count for tens, hundreds, or thousands of documents, the Unix shell has a clear speed advantage.
As datasets increase in size you can use the Unix shell to do more than copy these line counts by hand, by the use of print screen, or by copy and paste methods.
Using the > redirect operator we used earlier you can export query results to a new file for example: * type **wc -l 2014-01-31_JA-a*.tsv > DATE_JA-a-wc.txt** (or something like that, the last bit after results/ could be anything!) and hit enter.
wc -l 2014-01-31_JA-a*.tsv > DATE_JA-a-wc.txt
Check to make sure the file was created.
ls -F
2014-01-31_JA-africa.tsv* DATE_JA-a-wc.txt
2014-01-31_JA-america.tsv* everything-together.txt
2014-01_JA.tsv firstdir/
2014-01_JA.tsv.zip gallic.txt*
2014-02-01_JA-art .tsv* gulliver-backup.txt*
2014-02-02_JA-britain.tsv* gulliver-twice.txt
829-0.txt* gulliver.txt*
This runs the same query as before, but rather than print the results within the Unix shell it saves the results as DATE_JA_a-wc.txt.
To check the results in the file * Type head DATE_JA-a-wc.txt to see the file contents in the shell (as it is 10 lines or fewer in length, all the file contents will be shown here).
head DATE_JA-a-wc.txt
13712 2014-01-31_JA-africa.tsv
27392 2014-01-31_JA-america.tsv
41104 total
As you can see, the redirect command is useful to organize and group wc results in a single file.