pwd
print working directory
ls
list directory
-l
: list file information-lh
: list human readable file informationcd
change directory
mkdir
make directory
cat
send file or files to output (in most cases, print to shell)
head
output first parts of a file or files
tail
output last parts of a file or files
mv
rename or move a file or files. Syntax for renaming a file: mv FILENAME NEWFILENAME
cp
copy a file or files. Syntax: cp FILENAME NEWFILENAME
>
redirect output. Syntax with cat
: cat FILENAME1 FILENAME2 > NEWFILENAME
rm
remove a file or files. NB: USE WITH CAUTION!!!
?
a placeholder for one character or number
*
a placeholder for zero or more characters or numbers
[]
defines a class of characters
Examples
foobar?
: matches seven character strings starting with foobar
and ending with one character or numberfoobar*
: matches strings starting with foobar
ending with zero or more further characters or numbersfoobar*.txt
: matches strings starting with foobar
and ending with .txt
[1-9]foobar?
: matches eight character strings starting that start with a number, have foobar
after the number, and end with any character or number.wc
word count
-w
: count words-l
: count lines-c
: count characters (or m
for Mac users)grep
global regular expression print
-c
: displays counts of matches for each file-i
: match with case insensitivity-w
: match whole words-v
: exclude match--file=FILENAME.txt
: use the file FILENAME.txt
as the source of strings used in queryWith the person next to you, select a word to search for and use what you have learnt do to the following:
grep
Work on this exercise with the person next to you
Grabbing a text, cleaning it up
Head to .../libcarp-data-notes/data
. You're going to work again with the gulliver.txt
file we saw earlier.
The sed
command allows you to edit files directly. This can be used to remove all the header and footer information that Project Gutenberg add before and after a text.
Type sed '9352,9714d' gulliver.txt > gulliver-nofoot.txt
and hit enter.
The command sed
in combination with the d
value will look at gulliver.txt
and delete all values between the rows specified. The >
action then prompts the script to this edited text to the new file specified.
Now type sed '1,37d' gulliver-nofoot.txt > gulliver-noheadfoot.txt
and hit enter. This does the same as before, but for the header.
You now have a cleaner text. The next step is to prepare it even further for rigorous analysis.
The tr
command is used for translating or deleting characters. Type tr -d [:punct:] < gulliver-noheadfoot.txt > gulliver-noheadfootpunct.txt
and hit enter.
This uses the translate command and a special syntax to remove all punctuation. It also requires the use of both the output redirect >
we have seen and the input redirect <
we haven't seen.
Finally regularise the text by removing all the uppercase lettering. Type tr [:upper:] [:lower:] < gulliver-noheadfootpunct.txt > gulliver-clean.txt
and hit enter.
Open the gulliver-clean.txt
in a text editor. Note how the text has been transformed ready for analysis.
Pulling a text apart, counting word frequencies
You are now ready to pull the text apart.
Type tr ' ' '\n' < gulliver-clean.txt > gulliver-linebyline.txt
and hit enter. Note: there is a space between the first two single quotation marks
This uses the translate command again, this time to translate every blank space into \n
which renders as a new line. Every word in the file will now have its own line.
This isn't much use, so to get a better sense of the data we need to use another new command called sort
. Type sort gulliver-linebyline.txt > gulliver-ordered.txt
and hit enter.
This script uses the sort
command to rearrange the text from its original order into an alphabetical configuration. Open the file in a text editor and after scrolling past some blank space you will begin to see some numbers and finally words, or at least lots of copies of 'a'!
This is looking more useful, but you can go one step further. Type uniq -c gulliver-ordered.txt > gulliver-final.txt
and hit enter.
This script uses uniq
, another new command, in combination with the -c
flag to both remove duplicate lines and produce a word count of those duplicates.
You have now taken the text apart and produced a count for each word in it. Congratulations!
Before you finish, take a look at the data: do you notice anything odd? And if so, what might be at fault and how might you fix it?