Introduction to (R) Programming

Marcin Kierczak

Sun, 11 Nov 2018

R – what is it?

Features of R

  • a programming language,
  • a programming platform (= environment + interpreter),
  • a software project driven by the core team and the community.
  • a very powerful tool for statistical computing,
  • a very powerful computational tool in general,
  • a catalyst between an idea and its realization.

What R is not?

  • a tool to replace a statistician,
  • the very best programming language,
  • the most elegant programming solution,
  • the most efficient programming language.

A brief history of R

  • conceived c.a. 1992 by Robert Gentleman and Ross Ihaka (R&R) at the University of Auckland, NZ – a tool for teaching statistics,

  • 1994 – initial version,
  • 2000 – stable version.

A brief history of R cted.

  • open-source solution –> fast development,
  • based on the S language created at the Bell Labs by John Chambers to turn ideas into software, quickly and faithfully,
  • inspired also by Lisp syntax (lexical scope),
  • since 1997 developed by the R Development Core Team (~20 (6) experts, with Chambers onboard),
  • overviewed by The R Foundation for Statistical Computing,
  • learn more

The system of R packages – an overview

  • developed by the community,
  • cover several very diverse areas of science/life,
  • uniformely structured and documented,
  • organised in repositiries:

R packages in the main repos

Advantages of using R

  • a very powerful ecosystem of packages,
  • uniform, clear and clean system of documentation and help,
  • good interconnectivity with compiled languages like Java or C,
  • free and open source, GNU GPL and GNU GPL 2.0,
  • easy to generate high quality graphics.

Disadvantages of R

  • steep learning curve,
  • sometimes slow,
  • difficulties due to a limited object-oriented programming capabilities, e.g. an agent-based simulation is a challenge,
  • cannot order a pizza for you?

What a programming language is?

Definition of a programming language

A programming language is a formal computer language or constructed language designed to communicate instructions to a machine, particularly a computer. Programming languages can be used to create programs to control the behavior of a machine or to express algorithms.[source: Wikipedia]

What defines a programming language?

Think of a program as a flow of data from one function to another that does something to the data.


There are three main things that define a programming language:

  • type system – what types of data can I process,
  • syntax – the form defined by a language grammar,
  • semantics – the meaning of statements.

Syntax

Syntax as form

  • Syntax is the form, typically defined by the Chomsky II == context-free, grammar like:

    • 2 * 1 + 1
    • (+ (* 2 1) 1)

Programming language Lisp is defined by the following grammar (BNF or Bakus-Naur Form):

expression ::= atom | list
atom       ::= number | symbol
number     ::= [+-]?['0'-'9']+
symbol     ::= ['A'-'Z''a'-'z'].*
list       ::= '(' expression* ')'

Semantics

Semantics – give it a meaning

Semantics is the meaning, a gramatically correct sentence does not necessarily have a proper meaning:

  • “Colorful yellow train sleeps on a crazy wave.” – has no generally accepted meaning.
  • “There is $500 on his empty bank acount.” – cannot evaluate to true.

Semantics – extra details

  • Static semantics – in compiled languages, e.g. checking that every identifier is declared before the first use or that the conditionals have distinct predicates.

  • Dynamic semantics – how the chunks of code are executed. For instance lazy vs. eager evaluation.

The type system

How to represent things?

  • Untyped:
    • Assembler – everything is a byte.
  • Typed:
    • 1 - integer
    • 1.0 - float
    • “1.0” - string
  • Single type:
    • HTML – everything is a character.

The type system – extra details

  • Static vs. dynamic typing.
    • Static - type determined before execution, declared by the programmer (manifestly-typed) or checked by the compiler (type-inferrred) earlier:

      integer i   # Declaration  
      i = 1       # Initialization
    • Dynamic - type determined when executing.

      i = 1
  • Weak vs. strong types 1 + '1' =
    • Weak - 2 or “11”
    • Strong - ERROR.

Types – ERROR checking!

Programming paradigms

Major types of programming languages

  • imperative – a set of step-by-step instructions (R),
  • declarative – a clearly defined goal.

About programming paradigms

There many programming paradigms, e.g.:

  • imperative:
    • literate (R, knitr, Sweavy, R Markdown),
    • procedural (R - functions),
  • declarative:
    • functional (R, \(\lambda\)-abstraction),
  • agent-oriented,
  • structured:
    • object-oriented (R, S3 and S4 classes),

Interpreted vs. compiled languages

Machine code

  • Computers understand the machine code not programming languages!
  • Machine code is what the processor (CPU) understands.
  • Every computer language code has to be in some ways turned into the machine code.

Interpreters and compilers

Two major approaches exist to turn code in a particular language to the machine code:

  • Interpretation – on-the-fly translation of your code, theoretically line-by-line. This is done every time you run your program and the job is done by a software called an interpreter.

  • Compilation – your program is translated and saved as a machine code and as such can be directly executed on the machine. The job is performed by a compiler.

A more formal description of R

  • Interpreted – it is every time translated by the interpreter.
  • Dynamically typed – you do not declare types.
  • Multi-paradigm:
    • array – works on multi-dimensional data structures, like vectors or matrices,
    • functional – treats computation as evaluation of math functions,
    • imperative – the programmer specifies how to solve the problem,
    • object-oriented – allows working with objects: data + things you can do to the data,
    • procedural – structure is organised in procedures and procedure calls, e.g. functions and
    • reflective – the code can modify itself in runtime.

So how to program?

Divide et impera – Divide and rule.

Top-down approach: define the big problem and split it into smaller ones. Assume you have solution to the small problems and continue – push the responsibility down. Wishful thinking!

You've got a csv file that contains data about people: 
year of birth, favorite music genre and the name of 
a pet if the person has one and salary. Your task is 
to read the data and, for people born in particular 
decades (..., 50-ties, 60-ties, ...), compute the 
mean and the variance salary and find the most 
frequent pet name. 

Problem decomposition 1

This task can be decomposed into:

  • read data from csv file,
  • split the data into age classes based on the decade of birth,
  • compute the mean and the variance salary per class,
  • find the most frequent pet name per class.

Problem decomposition 2

To compute an the mean you have to: sum all values, divide the sum by the number of values – simple enough, we can program it right away.

To compute the variance you need to first refresh the formula: \[Var(X) = \frac{1}{n} \Sigma_{i=1}^{n} (x_{i} - \bar{x})^2\]

Thus, you realise that you need to compute the mean, but you know how to do this from the previous point. So, instead of coding computation of the mean twice, make a function that you can reuse! Lazines is the major driving force of a programmer!

Let’s put it down!

Pseudocode 1

\[Var(X) = \frac{1}{n} \Sigma_{i=1}^{n} (x_{i} - \bar{x})^2\]

Task: create the computeMean procedure that computes the mean for a sequence of numbers

Input: a sequence of numbers, e.g.: {1, 4, 5.7, 42357.533, 42}. Wait, isn’t it a vector?

Output: the computed mean, a single number, that is what we want our procedure to return.

function computeMean(aVector) {  
    sum = sum all numbers in aVector  
    count = count how many numbers are in aVector
    theMean is: sum / count  
    return theMean  
} 

Summary

So far, we have learnt about:

  • what R is and what it is not,
  • history of R,
  • the system of packages,
  • advantages and disadvantages of the language,
  • definition of a programing language,
  • elements of a programing language (types, syntax and semantics),
  • programing paradigms,
  • wishful thinking,
  • problem decomposition,
  • pseudocode.

Quite a bit, right?