2  Unix

We are going to use Unix to create and prepare a directory for a data analysis project.

2.1 Naming convention

In general you want to name your files in a way that is related to their contents and specifies how they relate to other files. The Smithsonian Data Management Best Practices has “five precepts of file naming and organization” and they are:

  • Have a distinctive, human-readable name that gives an indication of the content.
  • Follow a consistent pattern that is machine-friendly.
  • Organize files into directories (when necessary) that follow a consistent pattern.
  • Avoid repetition of semantic elements among file and directory names.
  • Have a file extension that matches the file format (no changing extensions!)

For specific recommendations we highly recommend you follow The Tidyverse Style Guide1.

2.2 The terminal

echo "Hello world"
Hello world

2.3 The filesystem

2.3.1 Directories and subdirectories

filesystem

2.3.2 The home directory

Home directory in Windows

Home directory in MacOS

The structure on Windows looks something like this:

And on MacOS something like this:

2.4 Working directory

The working directory is the directly you are currently in. Later we will see that we can move to other directories using the command line. It’s similar to clicking on folders.

You can see your working directory like this:

pwd
/Users/rafa/Documents/teaching/bst260/2023

In R we can use

getwd()
[1] "/Users/rafa/Documents/teaching/bst260/2023"

2.5 Paths

This string returned in previous command is full path to working directory.

The full path to your home directory is stored in an environment variable, discussed in more detail later:

echo $HOME
/Users/rafa

In Unix, we use the shorthand ~ as a nickname for your home directory

Example: the full path for docs (in image above) can be written like this ~/docs.

Most terminals will show the path to your working directory right on the command line.

Exercise: Open a terminal window and see if the working directory is listed.

2.6 Unix commands

2.6.1 ls: Listing directory content


ls

2.6.2 mkdir and rmdir: make and remove a directory

mkdir projects

If you do this correctly, nothing will happen: no news is good news. If the directory already exists, you will get an error message and the existing directory will remain untouched.

To confirm that you created these directories, you can list the directories:

ls

You should see the directories we just created listed.

mkdir docs teaching

If you made a mistake and need to remove the directory, you can use the command rmdir to remove it.

mkdir junk
rmdir junk

2.6.3 cd: navigating the filesystem by changing directories

cd projects

To check that the working directory changed, we can use a command we previously learned to see our location:

pwd

2.7 Autocomplete

In Unix you can auto-complete by hitting tab. This means that we can type cd d then hit tab. Unix will either auto-complete if docs is the only directory/file starting with d or show you the options. Try it out! Using Unix without auto-complete will make it unbearable.

2.7.1 cd continued

Going back one:

cd ..

Going home:

cd ~

or simply:

cd

Stating put (later we see why useful)

cd .

Going far:

cd /c/Users/yourusername/projects

Using relative paths:

cd ../..

Going to previous working directory

cd -

2.8 Practice

Let’s explore some examples of navigating a filesystem using the command-line. Download and expand this file into a temporary directory and you will have the data struct in the following image.

Practice file system
  1. Suppose our working directory is ~/projects, move to figs in project-1.
cd project-1/figs
  1. Now suppose our working directory is ~/projects. Move to reports in docs in two different ways:

This is a relative path:

cd ../docs/reports

The full path:

cd ~/docs/reports ## assuming ~ is hometo
  1. Suppose we are in ~/projects/project-1/figs and want to change to ~/projects/project-2, show two different ways, one with relative path and one with full path.

This is with relative path

cd ../../projects-2

With a full path

cd ~/projects/proejcts-2 ## assuming home is ~

2.9 More Unix commands

2.9.1 mv: moving files

mv path-to-file path-to-destination-directory

For example, if we want to move the file cv.tex from resumes to reports, you could use the full paths like this:

mv ~/docs/resumes/cv.tex ~/docs/reports/

You can also use relative paths. So you could do this:

cd ~/docs/resumes
mv cv.tex ../reports/

or this:

cd ~/docs/reports/
mv ../resumes/cv.tex ./

We can also use mv to change the name of a file.

cd ~/docs/resumes
mv cv.tex resume.tex

We can also combine the move and a rename. For example:

cd ~/docs/resumes
mv cv.tex ../reports/resume.tex

And we can move entire directories. To move the resumes directory into reports, we do as follows:

mv ~/docs/resumes ~/docs/reports/

It is important to add the last / to make it clear you do not want to rename the resumes directory to reports, but rather move it into the reports directory.

2.9.2 cp: copying files

The command cp behaves similar to mv except instead of moving, we copy the file, meaning that the original file stays untouched.

2.9.3 rm: removing files

In point-and-click systems, we remove files by dragging and dropping them into the trash or using a special click on the mouse. In Unix, we use the rm command.

Warning

Unlike throwing files into the trash, rm is permanent. Be careful!

The general way it works is as follows:

rm filename

You can actually list files as well like this:

rm filename-1 filename-2 filename-3

You can use full or relative paths. To remove directories, you will have to learn about arguments, which we do later.

2.9.4 less: looking at a file

Often you want to quickly look at the content of a file. If this file is a text file, the quickest way to do is by using the command less. To look a the file cv.tex, you do this:

cd ~/docs/resumes
less cv.tex 

To exit the viewer, you type q. If the files are long, you can use the arrow keys to move up and down. There are many other keyboard commands you can use within less to, for example, search or jump pages.

2.10 Preparing for a data science project

We are now ready to prepare a directory for a project. We will use the US murders project2 as an example.

You should start by creating a directory where you will keep all your projects. We recommend a directory called projects in your home directory. To do this you would type:

cd ~
mkdir projects

Our project relates to gun violence murders so we will call the directory for our project murders. It will be a subdirectory in our projects directories. In the murders directory, we will create two subdirectories to hold the raw data and intermediate data. We will call these data and rda, respectively.

Open a terminal and make sure you are in the home directory:

cd ~

Now run the following commands to create the directory structure we want. At the end, we use ls and pwd to confirm we have generated the correct directories in the correct working directory:

cd projects
mkdir murders
cd murders
mkdir data rdas 
ls
pwd

Note that the full path of our murders dataset is ~/projects/murders.

So if we open a new terminal and want to navigate into that directory we type:

cd projects/murders

2.11 Text editors

In the course we will be using RStudio to edit files. But there will be situations in where this is not the most efficient approach. You might also need to write R code on a server that does not have RStudio installed. For this reason you need to learn to use a command-line text editors or terminal-based text editors. A key feature of these is that you can do everything you need on a terminal without the need for graphical interface. This is often necessary when using remote servers or computers you are not sitting in front off.

Command-line text editors are essential tools, especially for system administrators, developers, and other users who frequently work in a terminal environment. Here are some of the most popular command-line text editors:

  • Nano - Easy to use and beginner-friendly.

    • Features: Simple interface, easy-to-use command prompts at the bottom of the screen, syntax highlighting.
  • Pico - Originally part of the Pine email client (Pico = PIne COmposer). It’s a simple editor and was widely used before Nano came around.

  • Vi or Vim - Vi is one of the oldest text editors and comes pre-installed on many UNIX systems. It is harder to use than Nano and Pico but is much more powerful. Vim is an enhanced version of Vi.

  • Emacs - Another old and powerful text editor. It’s known for being extremely extensible.

To use these to edit a file you type, for example,

nano filename

2.12 Advanced Unix

2.12.1 Arguments

rm -r directory-name

all files, subdirectories, files in subdirectories, subdirectories in subdirectories, and so on, will be removed. This is equivalent to throwing a folder in the trash, except you can’t recover it. Once you remove it, it is deleted for good. Often, when you are removing directories, you will encounter files that are protected. In such cases, you can use the argument -f which stands for force.

You can also combine arguments. For instance, to remove a directory regardless of protected files, you type:

rm -rf directory-name
Warning

Remember that once you remove there is no going back, so use this command very carefully.

A command that is often called with argument is ls. Here are some examples:

ls -a 
ls -l 

It is often useful to see files in chronological order. For that we use:

ls -t 

and to reverse the order of how files are shown you can use:

ls -r 

We can combine all these arguments to show more information for all files in reverse chronological order:

ls -lart 

Each command has a different set of arguments. In the next section, we learn how to find out what they each do.

2.12.2 Getting help

man ls

or

ls --help

2.12.3 Pipes

man ls | less

or in Git Bash:

ls --help | less 

This is also useful when listing files with many files. We can type:

ls -lart | less 

2.12.4 Wild cards

ls *.html

To remove all html files in a directory, we would type:

rm *.html

The other useful wild card is the ? symbol.

rm file-???.html

This will only remove files with that format.

We can combine wild cards. For example, to remove all files with the name file-001 regardless of suffix, we can type:

rm file-001.* 
Warning

Combining rm with the * wild card can be dangerous. There are combinations of these commands that will erase your entire filesystem without asking “are you sure?”. Make sure you understand how it works before using this wild card with the rm command.**

2.12.5 Environment variables

Earlier we saw this:

echo $HOME 

You can see them all by typing:

env

You can change some of these environment variables. But their names vary across different shells. We describe shells in the next section.

2.12.6 Shells

echo $SHELL

The most common one is bash.

Once you know the shell, you can change environmental variables. In Bash Shell, we do it using export variable value. To change the path, described in more detail soon, type: (Don’t actually run this command though!)

export PATH = /usr/bin/

2.12.7 Executables

which git

That directory is probably full of program files. The directory /usr/bin usually holds many program files. If you type:

ls /usr/bin

in your terminal, you will see several executable files.

There are other directories that usually hold program files. The Application directory in the Mac or Program Files directory in Windows are examples.

To see where your system looks:

echo $PATH

you will see a list of directories separated by :. The directory /usr/bin is probably one of the first ones on the list.

If your command is called my-ls, you can type:

./my-ls

Once you have mastered the basics of Unix, you should consider learning to write your own executables as they can help alleviate repetitive work.

2.12.8 Permissions and file types

If you type:

ls -l

At the beginning, you will see a series of symbols like this -rw-r--r--. This string indicates the type of file: regular file -, directory d, or executable x. This string also indicates the permission of the file: is it readable? writable? executable? Can other users on the system read the file? Can other users on the system edit the file? Can other users execute if the file is executable? This is more advanced than what we cover here, but you can learn much more in a Unix reference book.

2.12.9 Commands you should learn

  • curl - download data from the internet.

  • tar - archive files and subdirectories of a directory into one file.

  • ssh - connect to another computer.

  • find - search for files by filename in your system.

  • grep - search for patterns in a file.

  • awk/sed - These are two very powerful commands that permit you to find specific strings in files and change them.

  • ln - create a symbolic link. We do not recommend its use, but you should be familiar with it.

2.13 Resources

To get started.

2.14 Exercises

You are not allowed to use RStudio or point and click for any of the exercises below. Open a text file called commands.txt using a text editor and keep a log of the commands you use in the exercises below. If you want to take notes, you can use # to distinguish notes from commands.

  1. Decide on a directory where you will save your class materials. Navigate into the directory using a full path.

  2. Make a directory called project-1 and cd into that directory.

  3. Make directors called data: data, rdas, code, and docs.

  4. Use curl or wget to download the file https://raw.githubusercontent.com/rafalab/dslabs/master/inst/extdata/murders.csv and store it in rdas.

  5. Create a R file in the code directory called code-1.R, write the following code in the file so that if the working directory is code it reads in the csv file you just downloaded. Use only relative paths.

filename <- ""
dat <- read.csv(filename)
  1. Add the following line to your R code so that it saves the file to the rdas directory. Use only relative paths.
out <- ""
dat <- save(dat, file = out)
  1. Create a file code-2.R in the code directory. Use the following command to add a line to the file.
echo "load('../rdas/murders.rda')" > code/code-2.R

Check to see if the line of code as added without opening a text editor.

  1. Navigate to the code directory and list all the files ending in .R.

  2. Navigate to the project-1 directory. Without navigating away, change the name of code-1.R to import.R, but keep the file in the same directory.

  3. Change the name of the project directory to murders. Describe what you have to change so the R script sill does the right thing and how this would be different if you had used full paths.

  4. Bonus : Navigate to the murders directory. Read the man page for the find function. Use find to list all the files ending in .R.


  1. https://style.tidyverse.org/↩︎

  2. https://github.com/rairizarry/murders↩︎