No handwritten homework reports are accepted for this course. We work with Git and GitHub. Efficient and abundant use of Git, e.g., frequent and well-documented commits, is an important criterion for grading your homework.
Apply for the Student Developer Pack at GitHub using your snu.ac.kr
email.
A link to join the 326.621A Github Classroom and a link to create an individual Github repository for homework is provided in the eTL. First join the classroom, and then create your own homework repo by accepting these two invitations in turn.
For each homework, the teaching assistant will make a pull request. Merge each pull request to your homework repo.
Maintain two branches master
and develop
. The develop
branch will be your main playground, the place where you develop solution (code) to homework problems and write up report. The master
branch will be your presentation area. Submit your homework files (R markdown file Rmd
, html
file converted from R markdown, all code and data sets to reproduce results) in master
branch.
Before each homework’s due date, commit your master branch. The teaching assistant and the instructor will check out your committed master branch for grading. Commit time will be used as your submission time. That means if you commit your Homework 1 submission after the deadline, penalty points will be deducted for late submission according to the syllabus.
The /home/stat326_621a/data/molecules
directory in both teaching servers contains the following files:
cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb
You can check the above output by typing ls /home/stat326_621a/data/molecules
at the shell prompt. Do not copy these data files into your home directory and github when you submit this homework. Just read from the directory /home/stat326_621a/data/molecules
directly.
Write down a command such that, after running this command, typing wc *.pdb
at the prompt would produce the following output:
20 156 1158 cubane.pdb
12 84 622 ethane.pdb
9 57 422 methane.pdb
30 246 1828 octane.pdb
21 165 1226 pentane.pdb
15 111 825 propane.pdb
107 819 6081 total
Studying man wc
if necessary, write down a command whose output shows only the number of lines per file.
Symbol >
tells the shell to redirect the command’s output to a file instead of printing it to the screen. Using this knowledge, write down a command that creates a file (lengths.txt
) whose content is the output of the command of the previous question.
What’s the difference between the following two commands?
echo 'hello, world' > test1.txt
and
echo 'hello, world' >> test2.txt
Write down a command that takes lengths.txt
as the input and prints out a sorted list of files in the ascending order of line lengths.
In the current directory, we want to find the 3 files which have the least number of lines. Using pipes, write down command that does the desired task.
What is the output of the following code?
for datafile in *.pdb
do
ls *.pdb
done
Now, what is the output of the following code?
for datafile in *.pdb
do
ls $datafile
done
Why do these two loops give different outputs?
To learn English, you and your friend just have finished reading Pride and Prejudice by Jane Austen. Among the four main characters in the book, Elizabeth, Jane, Lydia, and Darcy, your friend thinks that Darcy was the most mentioned. You, however, are certain it was Elizabeth. Luckily, you have a file pride_and_prejudice.txt
containing the full text of the novel in /home/stat326_621a/data/novels
. Using a for
loop, how would you tabulate the number of times each of the four characters is mentioned?
Write down a line of Unix commands that finds all files in /home/stat326_621a/data/
(including subdirectories) whose names do not end in [vowel].txt
(e.g., little_women.txt
).
Using your favorite text editor (e.g., vi
), type the following and save the file as middle.sh
:
#!/bin/sh
# Select lines from the middle of a file.
# Usage: bash middle.sh filename end_line num_lines
head -n "$2" "$1" | tail -n "$3"
Using chmod
make the file executable by the owner, and run
./middle.sh /home/stat326_621a/data/molecules/pentane.pdb 20 5
Explain the output. Explain the meaning of "$1"
, "$2"
, and "$3"
in this shell script. Why do we need the first line of the shell script?
In class we discussed using R to organize simulation studies.
Expand the runSim.R
script to include arguments seed
(random seed), n
(sample size), dist
(distribution) and rep
(number of simulation replicates). When dist="gaussian"
, generate data from standard normal; when dist="t1"
, generate data from t-distribution with degree of freedom 1 (same as Cauchy distribution); when dist="t5"
, generate data from t-distribution with degree of freedom 5. Calling runSim.R
will (1) set random seed according to argument seed
, (2) generate data according to argument dist
, (3) compute the primed-indexed average estimator in class and the classical sample average estimator for each simulation replicate, (4) report the average mean squared error (MSE) \[
\frac{\sum_{r=1}^{\text{rep}} (\widehat \mu_r - \mu_{\text{true}})^2}{\text{rep}}
\] for both methods.
Modify the autoSim.R
script to run simulations with combinations of sample sizes nVals = seq(100, 500, by=100)
and distributions distTypes = c("gaussian", "t1", "t5")
and write output to appropriately named files. Use rep = 50
, and seed = 280
.
Write an R script to collect simulation results from output files and print average MSEs in a table of format
\(n\) | Method | \(t_1\) | \(t_5\) | Gaussian |
---|---|---|---|---|
100 | PrimeAvg | |||
SampAvg | ||||
200 | PrimeAvg | |||
SampAvg | ||||
300 | PrimeAvg | |||
SampAvg | ||||
400 | PrimeAvg | |||
SampAvg | ||||
500 | PrimeAvg | |||
SampAvg |