Introduction to (R) Programming

Marcin Kierczak

Sun, 11 Nov 2018

R – what is it?

Features of R

a programming language,
a programming platform (= environment + interpreter),
a software project driven by the core team and the community.
a very powerful tool for statistical computing,
a very powerful computational tool in general,
a catalyst between an idea and its realization.

What R is not?

a tool to replace a statistician,
the very best programming language,
the most elegant programming solution,
the most efficient programming language.

A brief history of R

conceived c.a. 1992 by Robert Gentleman and Ross Ihaka (R&R) at the University of Auckland, NZ – a tool for teaching statistics,
1994 – initial version,
2000 – stable version.

A brief history of R cted.

open-source solution –> fast development,
based on the S language created at the Bell Labs by John Chambers to turn ideas into software, quickly and faithfully,
inspired also by Lisp syntax (lexical scope),
since 1997 developed by the R Development Core Team (~20 (6) experts, with Chambers onboard),
overviewed by The R Foundation for Statistical Computing,
learn more

The system of R packages – an overview

developed by the community,
cover several very diverse areas of science/life,
uniformely structured and documented,
organised in repositiries:
- CRAN - The Comprehensive R Archive Network,
- R-Forge,
- Bioconductor,
- GitHub.

R packages in the main repos

Advantages of using R

a very powerful ecosystem of packages,
uniform, clear and clean system of documentation and help,
good interconnectivity with compiled languages like Java or C,
free and open source, GNU GPL and GNU GPL 2.0,
easy to generate high quality graphics.

Disadvantages of R

steep learning curve,
sometimes slow,
difficulties due to a limited object-oriented programming capabilities, e.g. an agent-based simulation is a challenge,
cannot order a pizza for you?

What a programming language is?

Definition of a programming language

A programming language is a formal computer language or constructed language designed to communicate instructions to a machine, particularly a computer. Programming languages can be used to create programs to control the behavior of a machine or to express algorithms.[source: Wikipedia]

What defines a programming language?

Think of a program as a flow of data from one function to another that does something to the data.

There are three main things that define a programming language:

type system – what types of data can I process,
syntax – the form defined by a language grammar,
semantics – the meaning of statements.

Syntax

Syntax as form

Syntax is the form, typically defined by the Chomsky II == context-free, grammar like:
- 2 * 1 + 1
- (+ (* 2 1) 1)

Programming language Lisp is defined by the following grammar (BNF or Bakus-Naur Form):

expression ::= atom | list
atom       ::= number | symbol
number     ::= [+-]?['0'-'9']+
symbol     ::= ['A'-'Z''a'-'z'].*
list       ::= '(' expression* ')'

Semantics

Semantics – give it a meaning

Semantics is the meaning, a gramatically correct sentence does not necessarily have a proper meaning:

“Colorful yellow train sleeps on a crazy wave.” – has no generally accepted meaning.
“There is $500 on his empty bank acount.” – cannot evaluate to true.

Semantics – extra details

Static semantics – in compiled languages, e.g. checking that every identifier is declared before the first use or that the conditionals have distinct predicates.
Dynamic semantics – how the chunks of code are executed. For instance lazy vs. eager evaluation.

The type system

How to represent things?

Untyped:
- Assembler – everything is a byte.
Typed:
- 1 - integer
- 1.0 - float
- “1.0” - string
Single type:
- HTML – everything is a character.

The type system – extra details

Static vs. dynamic typing.
- Static - type determined before execution, declared by the programmer (manifestly-typed) or checked by the compiler (type-inferrred) earlier:
```
integer i   # Declaration  
i = 1       # Initialization
```
- Dynamic - type determined when executing.
```
i = 1
```
Weak vs. strong types 1 + '1' =
- Weak - 2 or “11”
- Strong - ERROR.

Types – ERROR checking!

Programming paradigms

Major types of programming languages

imperative – a set of step-by-step instructions (R),
declarative – a clearly defined goal.

About programming paradigms

There many programming paradigms, e.g.:

imperative:
- literate (R, knitr, Sweavy, R Markdown),
- procedural (R - functions),
- …
declarative:
- functional (R, $\lambda$-abstraction),
- …
agent-oriented,
structured:
- object-oriented (R, S3 and S4 classes),
- …
…

Interpreted vs. compiled languages

Machine code

Computers understand the machine code not programming languages!
Machine code is what the processor (CPU) understands.
Every computer language code has to be in some ways turned into the machine code.

Interpreters and compilers

Two major approaches exist to turn code in a particular language to the machine code:

Interpretation – on-the-fly translation of your code, theoretically line-by-line. This is done every time you run your program and the job is done by a software called an interpreter.
Compilation – your program is translated and saved as a machine code and as such can be directly executed on the machine. The job is performed by a compiler.

A more formal description of R

Interpreted – it is every time translated by the interpreter.
Dynamically typed – you do not declare types.
Multi-paradigm:
- array – works on multi-dimensional data structures, like vectors or matrices,
- functional – treats computation as evaluation of math functions,
- imperative – the programmer specifies how to solve the problem,
- object-oriented – allows working with objects: data + things you can do to the data,
- procedural – structure is organised in procedures and procedure calls, e.g. functions and
- reflective – the code can modify itself in runtime.

So how to program?

Divide et impera – Divide and rule.

Top-down approach: define the big problem and split it into smaller ones. Assume you have solution to the small problems and continue – push the responsibility down. Wishful thinking!

You've got a csv file that contains data about people: 
year of birth, favorite music genre and the name of 
a pet if the person has one and salary. Your task is 
to read the data and, for people born in particular 
decades (..., 50-ties, 60-ties, ...), compute the 
mean and the variance salary and find the most 
frequent pet name.

Problem decomposition 1

This task can be decomposed into:

read data from csv file,
split the data into age classes based on the decade of birth,
compute the mean and the variance salary per class,
find the most frequent pet name per class.

Problem decomposition 2

To compute an the mean you have to: sum all values, divide the sum by the number of values – simple enough, we can program it right away.

To compute the variance you need to first refresh the formula: \[Var(X) = \frac{1}{n} \Sigma_{i=1}^{n} (x_{i} - \bar{x})^2\]

Thus, you realise that you need to compute the mean, but you know how to do this from the previous point. So, instead of coding computation of the mean twice, make a function that you can reuse! Lazines is the major driving force of a programmer!

Let’s put it down!

Pseudocode 1

\[Var(X) = \frac{1}{n} \Sigma_{i=1}^{n} (x_{i} - \bar{x})^2\]

Task: create the computeMean procedure that computes the mean for a sequence of numbers

Input: a sequence of numbers, e.g.: {1, 4, 5.7, 42357.533, 42}. Wait, isn’t it a vector?

Output: the computed mean, a single number, that is what we want our procedure to return.

function computeMean(aVector) {  
    sum = sum all numbers in aVector  
    count = count how many numbers are in aVector
    theMean is: sum / count  
    return theMean  
}

Summary

So far, we have learnt about:

what R is and what it is not,
history of R,
the system of packages,
advantages and disadvantages of the language,
definition of a programing language,
elements of a programing language (types, syntax and semantics),
programing paradigms,
wishful thinking,
problem decomposition,
pseudocode.

Quite a bit, right?