Write programs that do one thing and do it well.
Write programs to work together.
Write programs to handle text streams, because that is a universal interface.
–The UNIX philosophy, Doug McIlroy
In this exercise we will write a few functions and execute R scripts from the command line and provide them with options and data.
Let’s jump right in and create a function. Your function should have it’s own function_name, atleast two arguments one of which with a default value, code that performs some operations on the input and a return value. I do encourage you to think about it and make your own function, rather than copying the examples. Examples should be seen more as explanatory and inspirational.
Example:
function_name <- function(arg1, arg2 = "Lucy"){
if (arg2 == "Lucy") {
output <- paste("Lucy in the sky with ",arg1,"\n",sep="")}
else {
output <- paste(arg1," in the ocean with ",arg2,"\n",sep="")}
return(cat(output))
}
Test your function to make sure it gives the output you would expect given different arguments.
return()
function in my example?cat()
function in the return()
function?variable_a <- function_name("Tom")
What happens to the variable if the return value is the output of cat()
?Now lets see the power of your function in action. Write a loop that executes your function multiple times. Functions are great for performing operations on a number of files, objects or elements in objects that fit a certain condition.
Example
for (i in c("Tom","Mary","Mahesh","Henrik")){
function_name(arg1=i,arg2="Brad")
function_name(arg1=i,arg2="Lucy")
}
## Tom in the ocean with Brad
## Lucy in the sky with Tom
## Mary in the ocean with Brad
## Lucy in the sky with Mary
## Mahesh in the ocean with Brad
## Lucy in the sky with Mahesh
## Henrik in the ocean with Brad
## Lucy in the sky with Henrik
Now let’s take a look at the special ellipsis argument for functions. Basically the ellipsis say that a function can take additional arguments. The function should work without arguments given at the ellipsis. One great example of this is the data.frame()
function.
From ?data.frame
Usage
data.frame(..., row.names = NULL, check.rows = FALSE,
check.names = TRUE, fix.empty.names = TRUE,
stringsAsFactors = default.stringsAsFactors())
The ellipsis can be used to add columns to the data.frame. Try creating a data.frame with one column and one with several. Also, try creating a data.frame with no arguments for ...
, what happens?
data.frame(a=c(1,2,3))
data.frame(a=c(1,2,3),b=c(4,5,6))
Another example where ellipsis are often used is to pass plot arguments into a function.
Example:
my_plot <- function(...){
plot(c(1,2,3),c(1,2,3),...)
}
my_plot(main="Great title", xlab = "This will be x-label of plot", ylab="Y label")
Try creating your own function that uses ellipsis arguments.
While many R users write and execute code interactively (in e.g. RStudio), you can run the content of a script by writing source(myscriptfile.R)
in the R console. This is also a convenient way to load your custom functions (and this is actually what happens when you load an installed package with library()
).
But once you have code that works you may want to run it routinely without an interactive R environment. R scripts can be executed directly from the command line by typing Rscript myscriptfile.R
.
.R
file and execute it.# Example of a small script
mydata=rnorm(1000)
print(summary(mydata))
You can also execute your script by typing its name in the console, provided it:
#!/usr/bin/env Rscript
chmod +x myscriptfile.R
on Unix systems. If you are on windows you are on your own =).Task:
Rscript
../
)?It’s unlikely that you would need to run the exact same process over and over again without any change in what data is processed or how it’s processed. One way to control the behaviour of your code is to provide arguments to it. These commonly refer to file names or settings. You can supply arguments after the name of your script where you invoke it. In R, they are available from commandArgs()
.
You can use commandArgs(trailingOnly = TRUE)
to suppress the first few items and access your actual arguments.
#!/usr/bin/env Rscript
firstarg=as.numeric(commandArgs(trailingOnly = TRUE)[1])
mydata=rnorm(1000,mean = firstarg)
print(summary(mydata))
Processing multiple arguments may become complicated, especially if you want to be able to use C-like long and short flags such as -o outputfile -i inputfile --distribution normal
. Packages that support such options include getopt
, optparse
and argparser
.
optparse
package to modify your script to accept the argument -m
or --mean
(followed by the value) for mean value.#!/usr/bin/env Rscript
# don't say "Loading required package: optparse" every time
suppressPackageStartupMessages(require(optparse))
option_list = list(
make_option(c("-m", "--mean"), default=0)
# you could put the next option here
)
options = parse_args(OptionParser(option_list=option_list))
my_mean=as.numeric(options$mean)
mydata=rnorm(1000,mean = my_mean)
print(summary(mydata))
A convenient feature of command line scripts is the possibility to pipe data from one script to another, thereby avoiding the need for intermediate files. You can use file('stdin')
and open()
to define and open the connection in R and readLines()
to read one or more lines from it.
echo 100 | ./myscriptfile.R
.cat
.stdin()
?#!/usr/bin/env Rscript
input_con <- file("stdin")
open(input_con)
oneline <- readLines(con = input_con, n = 1, warn = FALSE)
close(input_con)
mean=as.numeric(oneline)
mydata=rnorm(1000,mean = mean)
print(summary(mydata))
You can pipe your output to another process (any script or tool that accepts a stream) by appending | next_tool_or_script_call
to the call, or to a file by appending > filename
.
warning('Something is wrong')
and you pipe the output to a file?You can use write(x,file=stderr())
or write(x,file=stdout())
to explicitly divert certain output.
If you have time, practice by writing an R script that you need in your own work, or select either of the following:
Write a script that parses any text stream by line from either a file (when specified) or stdin, and writes to either another file (when specified) or stdout. Each line is written with a certain probability which is also provided as an argument (lines that start with # are always written). Report to stderr the number of lines read and written. Try your script on any fastq, bam/sam, or vcf-formatted data.
Write a script that summarizes the content of a table contained in a plain text file. The table is supplied either as a file (when specified) or as a stream. Make sure any lines starting with # are ignored. For speed, do not use more than a maximum of 1000 lines as default, or another number if supplied as an argument. Try your script on any tabular data you hav available.