Q1. Git/GitHub

No handwritten homework reports are accepted for this course. We work with Git and GitHub. Efficient and abundant use of Git, e.g., frequent and well-documented commits, is an important criterion for grading your homework.

  1. Apply for the Student Developer Pack at GitHub using your snu.ac.kr email.

  2. A link to join the 326.621A Github Classroom and a link to create an individual Github repository for homework is provided in the eTL. First join the classroom, and then create your own homework repo by accepting these two invitations in turn.

  3. For each homework, the teaching assistant will make a pull request. Merge each pull request to your homework repo.

  4. Maintain two branches master and develop. The develop branch will be your main playground, the place where you develop solution (code) to homework problems and write up report. The master branch will be your presentation area. Submit your homework files (R markdown file Rmd, html file converted from R markdown, all code and data sets to reproduce results) in master branch.

  5. Before each homework’s due date, commit your master branch. The teaching assistant and the instructor will check out your committed master branch for grading. Commit time will be used as your submission time. That means if you commit your Homework 1 submission after the deadline, penalty points will be deducted for late submission according to the syllabus.

Q2. Getting used to Unix shell

The /home/stat326_621a/data/molecules directory in both teaching servers contains the following files:

cubane.pdb  ethane.pdb  methane.pdb  octane.pdb  pentane.pdb  propane.pdb

You can check the above output by typing ls /home/stat326_621a/data/molecules at the shell prompt. Do not copy these data files into your home directory and github when you submit this homework. Just read from the directory /home/stat326_621a/data/molecules directly.

  1. Write down a command such that, after running this command, typing wc *.pdb at the prompt would produce the following output:

      20  156  1158  cubane.pdb
      12  84   622   ethane.pdb
       9  57   422   methane.pdb
      30  246  1828  octane.pdb
      21  165  1226  pentane.pdb
      15  111  825   propane.pdb
     107  819  6081  total
  2. Studying man wc if necessary, write down a command whose output shows only the number of lines per file.

  3. Symbol > tells the shell to redirect the command’s output to a file instead of printing it to the screen. Using this knowledge, write down a command that creates a file (lengths.txt) whose content is the output of the command of the previous question.

  4. What’s the difference between the following two commands?

    echo 'hello, world' > test1.txt

    and

    echo 'hello, world' >> test2.txt
  5. Write down a command that takes lengths.txt as the input and prints out a sorted list of files in the ascending order of line lengths.

  6. In the current directory, we want to find the 3 files which have the least number of lines. Using pipes, write down command that does the desired task.

  7. What is the output of the following code?

    for datafile in *.pdb
    do
        ls *.pdb
    done 

    Now, what is the output of the following code?

    for datafile in *.pdb
    do
        ls $datafile
    done

    Why do these two loops give different outputs?

Q4. Doing something more serious with the shell

  1. To learn English, you and your friend just have finished reading Pride and Prejudice by Jane Austen. Among the four main characters in the book, Elizabeth, Jane, Lydia, and Darcy, your friend thinks that Darcy was the most mentioned. You, however, are certain it was Elizabeth. Luckily, you have a file pride_and_prejudice.txt containing the full text of the novel in /home/stat326_621a/data/novels. Using a for loop, how would you tabulate the number of times each of the four characters is mentioned?

  2. Write down a line of Unix commands that finds all files in /home/stat326_621a/data/ (including subdirectories) whose names do not end in [vowel].txt (e.g., little_women.txt).

  3. Using your favorite text editor (e.g., vi), type the following and save the file as middle.sh:

    #!/bin/sh
    # Select lines from the middle of a file.
    # Usage: bash middle.sh filename end_line num_lines
    head -n "$2" "$1" | tail -n "$3"

    Using chmod make the file executable by the owner, and run

    ./middle.sh /home/stat326_621a/data/molecules/pentane.pdb 20 5

    Explain the output. Explain the meaning of "$1", "$2", and "$3" in this shell script. Why do we need the first line of the shell script?

Q3. R Batch Run

In class we discussed using R to organize simulation studies.

  1. Expand the runSim.R script to include arguments seed (random seed), n (sample size), dist (distribution) and rep (number of simulation replicates). When dist="gaussian", generate data from standard normal; when dist="t1", generate data from t-distribution with degree of freedom 1 (same as Cauchy distribution); when dist="t5", generate data from t-distribution with degree of freedom 5. Calling runSim.R will (1) set random seed according to argument seed, (2) generate data according to argument dist, (3) compute the primed-indexed average estimator in class and the classical sample average estimator for each simulation replicate, (4) report the average mean squared error (MSE) \[ \frac{\sum_{r=1}^{\text{rep}} (\widehat \mu_r - \mu_{\text{true}})^2}{\text{rep}} \] for both methods.

  2. Modify the autoSim.R script to run simulations with combinations of sample sizes nVals = seq(100, 500, by=100) and distributions distTypes = c("gaussian", "t1", "t5") and write output to appropriately named files. Use rep = 50, and seed = 280.

  3. Write an R script to collect simulation results from output files and print average MSEs in a table of format

    \(n\) Method \(t_1\) \(t_5\) Gaussian
    100 PrimeAvg
    SampAvg
    200 PrimeAvg
    SampAvg
    300 PrimeAvg
    SampAvg
    400 PrimeAvg
    SampAvg
    500 PrimeAvg
    SampAvg