Up until now we have mostly created the object we worked with on the fly from within R. The most common use-case is however to read in different data sets that are stored as files, either somewhere on a server or locally on your computer. In this exercise we will test some common ways to import data in R and also show to save data from R. After this exercise you will know how to:
The function scan() can be used both to read data from files and directly from keyboard. The function is very flexible and have many different settings that allow to read data in different formats. To read and store a set of words that you type on your keyboard try the following code that will prompt your for input. After each word press enter and R will prompt you for new input. After the last word have been typed press enter twice to get back to your R prompt and have your character vector named words available in R your session.
words <- scan(what = character())
Download the file book chapter from this link. Read the manual for scan and read the text file named book_chapter.txt into R, first as vector and then as a list, with each word in the chapter saved as a entry in the vector or as a single vector in a list.
shelley.vec <- scan(file = "book_chapter.txt", what = character()) str(shelley.vec) shelley.list <- scan(file = "book_chapter.txt", what = list(character())) class(shelley.list) Read 420 items chr [1:420] "My" "present" "situation" "was" "one" "in" ... Read 420 records [1] "list"
Check that your newly created objects contain the correct information and have been saved as you have intended eg. each entry of the vector or the list should contain a single word. Once your convinced that you have a sound word vector and list.
sort(nchar(shelley.vec), decreasing = TRUE) which(nchar(shelley.vec) == max(nchar(shelley.vec))) shelley.vec[381] [1] 690 12 11 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 [19] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 [37] 9 9 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 [55] 8 8 8 8 8 8 8 8 8 8 7 7 7 7 7 7 7 7 [73] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 6 [91] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 [109] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 [127] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 [145] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 [163] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 [181] 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 [199] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 [217] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 [235] 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 [253] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 [271] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 [289] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 [307] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 [325] 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 [343] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 [361] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 [379] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 [397] 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [415] 1 1 1 1 1 1 [1] 381 [1] "By the sacred earth on which I kneel, by the shades that wander near me, by the deep and eternal grief that I feel, I swear; and by thee, O Night, and the spirits that preside over thee, to pursue the daemon who caused this misery, until he or I shall perish in mortal conflict. For this purpose I will preserve my life; to execute this dear revenge will I again behold the sun and tread the green herbage of earth, which otherwise should vanish from my eyes forever. And I call on you, spirits of the dead, and on you, wandering ministers of vengeance, to aid and conduct me in my work. Let the cursed and hellish monster drink deep of agony; let him feel the despair that now torments me."
shelley.vec2 <- scan(file = "book_chapter.txt", what = 'character', sep = ' ', quote = NULL) shelley.filt2 <- gsub(pattern = '[^[:alnum:] ]', replacement = "", x = shelley.vec2) which(nchar(shelley.filt2) == max(nchar(shelley.filt2))) shelley.filt2[301] Read 551 items [1] 301 [1] "uninterested"
This is the by far most common way to get data into R. As the function creates a data frame at import it will only work for data set that fits those criteria, meaning that the data needs to have a set of columns of equal length that are separated with a common string eg. tab, comma, semicolon etc.
In this code block with first import the data from normalized.txt from a file (that you can get here) and accept the defaults for all other arguments in the function. With this settings R will read it as a tab delimited file and will use the first row of the data as colnames (header) and the first column as rownames.
expr.At <- read.table("normalized.txt")
head(expr.At)
bZIP29_1 bZIP29_2 bZIP29_3 WT_1 WT_2 WT_3
AT1G01020.1 13.572739 14.167143 12.972703 14.8181738 15.904017 11.3270623
AT1G01030.1 1.417234 1.201454 1.385434 0.8590246 1.096829 0.8596431
AT1G01040.2 41.862906 41.199853 42.696566 37.5286358 34.849241 48.6456871
AT1G01050.1 102.422397 98.318969 92.068406 104.2104178 97.418336 86.0654463
AT1G01060.1 3.216031 4.004846 3.589534 3.7045434 2.642360 6.2703380
AT1G01070.1 8.230858 17.871625 14.924906 7.9996663 8.824486 13.8554244
One does however not have to have all data as a file an the local disk, instead one can read data from online resources. The following command will read in a file from a web server.
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
abalone = read.table(url, header = F , sep = ',')
head(abalone)
V1 V2 V3 V4 V5 V6 V7 V8 V9
1 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
2 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
3 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
4 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
5 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7
6 I 0.425 0.300 0.095 0.3515 0.1410 0.0775 0.120 8
ed <- read.table("example.data", sep = ":") head(ed) str(ed)
Encode all NA values as “missing”, at export. \
write.table(x = ed, na = "missing", file = "example_mis.data")
df.test <- read.table("example_mis.data", na.strings = "missing")
For this exercise we will load a GTF file into R and calculate some basic summary statistics from the file. In the first part we will use basic manipulations of data frames to extract the information. In the second part you get a try out a library designed to work with annotation data, that stores the information in a more complex format, that allow for easy manipulation and calculation of summaries from genome annotation files.
For those not familiar with the gtf format it is a file format containing annotation information for a genome. It does not contain the actual DNA sequence of the organism, but instead refers to positions along the genome.
A valid GTF file should contain the following tab delimited fields (taken from the ensembl home page).
1. | 2. | 3. | 4. | 5. | 6. | 7. | 8. | 9. |
---|---|---|---|---|---|---|---|---|
1 | transcribed_unprocessed_pseudogene | gene | 11869 | 14409 | . | + | . | gene_id; "ENSG00000223972"; |
1 | processed_transcript | transcript | 11869 | 14409 | . | + | . | gene_id; "ENSG00000223972"; |
The last column can contain a large number of attributes that are comma-separated.
As these files for many organisms are large we will in this exercise use the latest version of Drosophila melanogaster genome annotation available at ftp://ftp.ensembl.org/pub/release-86/gtf/drosophila_melanogaster that is small enough for analysis even on a laptop.
Open this and download the file named Drosophila_melanogaster.BDGP6.86.gtf.gz to your computer. Unzip this file and keep track of where your store the file.
With this done read this file into R using the function read.table and add meaningful column names to the table.
d.gtf <- read.table("Drosophila_melanogaster.BDGP6.86.gtf", header = FALSE, comment.char = "#", sep = "\t") colnames(d.gtf) <- c("Chromosome", "Source", "Feature", "Start", "End", "Score", "Strand", "Frame", "Attribute")
Prior to any analysis you should make sure that your attempt to read in the file has worked as expected. This can for example be done by having a look at the dimension of the stored object and making sure that it has the structure you expect.
dim(d.gtf) str(d.gtf) [1] 538684 9 'data.frame': 538684 obs. of 9 variables: $ Chromosome: Factor w/ 57 levels "211000022278158",..: 51 51 51 51 51 51 51 51 51 51 ... $ Source : Factor w/ 2 levels "FlyBase","ensembl": 1 1 1 1 1 1 1 1 1 1 ... $ Feature : Factor w/ 9 levels "CDS","Selenocysteine",..: 5 9 3 5 9 3 1 6 3 1 ... $ Start : int 722370 722370 722370 835381 835381 835381 835381 835381 869486 869486 ... $ End : int 722621 722621 722621 2503907 2503907 835491 835491 835383 869548 869548 ... $ Score : Factor w/ 1 level ".": 1 1 1 1 1 1 1 1 1 1 ... $ Strand : Factor w/ 2 levels "+","-": 2 2 2 1 1 1 1 1 1 1 ... $ Frame : Factor w/ 4 levels ".","0","1","2": 1 1 1 1 1 1 2 2 1 2 ... $ Attribute : Factor w/ 455338 levels "gene_id FBgn0000003; gene_name 7SLRNA:CR32864; gene_source FlyBase; gene_biotype lincRNA;",..: 348579 348581 348580 453172 453215 453194 453195 453193 453199 453200 ...
levels(d.gtf$Chromosome) [1] "211000022278158" "211000022278279" [3] "211000022278282" "211000022278298" [5] "211000022278307" "211000022278309" [7] "211000022278436" "211000022278449" [9] "211000022278498" "211000022278522" [11] "211000022278603" "211000022278604" [13] "211000022278664" "211000022278724" [15] "211000022278750" "211000022278760" [17] "211000022278875" "211000022278877" [19] "211000022278878" "211000022278879" [21] "211000022278880" "211000022278985" [23] "211000022279055" "211000022279108" [25] "211000022279132" "211000022279134" [27] "211000022279165" "211000022279188" [29] "211000022279222" "211000022279264" [31] "211000022279342" "211000022279392" [33] "211000022279446" "211000022279528" [35] "211000022279529" "211000022279531" [37] "211000022279555" "211000022279681" [39] "211000022279708" "211000022280133" [41] "211000022280328" "211000022280341" [43] "211000022280347" "211000022280481" [45] "211000022280494" "211000022280645" [47] "211000022280703" "2L" [49] "2R" "3L" [51] "3R" "4" [53] "Unmapped_Scaffold_8" "X" [55] "Y" "dmel_mitochondrion_genome" [57] "rDNA"
aggregate(d.gtf$Feature, by = list(d.gtf$Chromosome), summary) Group.1 x.CDS x.Selenocysteine x.exon x.five_prime_utr 1 211000022278158 0 0 1 0 2 211000022278279 0 0 1 0 3 211000022278282 0 0 1 0 4 211000022278298 0 0 1 0 5 211000022278307 0 0 1 0 6 211000022278309 0 0 1 0 7 211000022278436 0 0 1 0 8 211000022278449 0 0 2 0 9 211000022278498 0 0 1 0 10 211000022278522 0 0 1 0 11 211000022278603 0 0 1 0 12 211000022278604 0 0 1 0 13 211000022278664 0 0 1 0 14 211000022278724 0 0 1 0 15 211000022278750 0 0 1 0 16 211000022278760 2 0 2 0 17 211000022278875 0 0 1 0 18 211000022278877 0 0 1 0 19 211000022278878 0 0 1 0 20 211000022278879 0 0 1 0 21 211000022278880 0 0 1 0 22 211000022278985 0 0 1 0 23 211000022279055 0 0 1 0 24 211000022279108 0 0 1 0 25 211000022279132 0 0 1 0 26 211000022279134 0 0 1 0 27 211000022279165 0 0 1 0 28 211000022279188 3 0 3 1 29 211000022279222 0 0 1 0 30 211000022279264 0 0 1 0 31 211000022279342 0 0 1 0 32 211000022279392 0 0 1 0 33 211000022279446 0 0 1 0 34 211000022279528 0 0 1 0 35 211000022279529 0 0 1 0 36 211000022279531 0 0 1 0 37 211000022279555 0 0 1 0 38 211000022279681 0 0 1 0 39 211000022279708 0 0 1 0 40 211000022280133 0 0 1 0 41 211000022280328 4 0 4 1 42 211000022280341 0 0 1 0 43 211000022280347 0 0 1 0 44 211000022280481 0 0 1 0 45 211000022280494 0 0 2 0 46 211000022280645 0 0 1 0 47 211000022280703 0 0 1 0 48 2L 28047 2 32747 8374 49 2R 33657 0 38551 8842 50 3L 29473 0 34347 8738 51 3R 37167 0 43159 10693 52 4 2732 0 3165 570 53 Unmapped_Scaffold_8 12 0 14 4 54 X 28967 2 34136 8824 55 Y 111 0 182 13 56 dmel_mitochondrion_genome 13 0 37 0 57 rDNA 0 0 21 0
by(data = d.gtf$Feature, d.gtf[,"Chromosome"], summary) d.gtf[, "Chromosome"]: 211000022278158 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022278279 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022278282 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022278298 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022278307 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022278309 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022278436 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022278449 CDS Selenocysteine exon five_prime_utr gene 0 0 2 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022278498 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022278522 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022278603 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022278604 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022278664 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022278724 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022278750 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022278760 CDS Selenocysteine exon five_prime_utr gene 2 0 2 0 1 start_codon stop_codon three_prime_utr transcript 1 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022278875 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022278877 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022278878 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022278879 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022278880 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022278985 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022279055 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022279108 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022279132 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022279134 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022279165 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022279188 CDS Selenocysteine exon five_prime_utr gene 3 0 3 1 1 start_codon stop_codon three_prime_utr transcript 1 1 1 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022279222 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022279264 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022279342 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022279392 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022279446 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022279528 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022279529 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022279531 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022279555 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022279681 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022279708 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022280133 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022280328 CDS Selenocysteine exon five_prime_utr gene 4 0 4 1 1 start_codon stop_codon three_prime_utr transcript 1 1 1 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022280341 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022280347 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022280481 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022280494 CDS Selenocysteine exon five_prime_utr gene 0 0 2 0 2 start_codon stop_codon three_prime_utr transcript 0 0 0 2 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022280645 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 211000022280703 CDS Selenocysteine exon five_prime_utr gene 0 0 1 0 1 start_codon stop_codon three_prime_utr transcript 0 0 0 1 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 2L CDS Selenocysteine exon five_prime_utr gene 28047 2 32747 8374 3465 start_codon stop_codon three_prime_utr transcript 5675 5656 6142 6632 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 2R CDS Selenocysteine exon five_prime_utr gene 33657 0 38551 8842 3601 start_codon stop_codon three_prime_utr transcript 6028 6018 6484 6927 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 3L CDS Selenocysteine exon five_prime_utr gene 29473 0 34347 8738 3433 start_codon stop_codon three_prime_utr transcript 5873 5853 6352 6676 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 3R CDS Selenocysteine exon five_prime_utr gene 37167 0 43159 10693 4125 start_codon stop_codon three_prime_utr transcript 7105 7092 7782 8010 ------------------------------------------------------------ d.gtf[, "Chromosome"]: 4 CDS Selenocysteine exon five_prime_utr gene 2732 0 3165 570 111 start_codon stop_codon three_prime_utr transcript 295 289 339 343 ------------------------------------------------------------ d.gtf[, "Chromosome"]: Unmapped_Scaffold_8 CDS Selenocysteine exon five_prime_utr gene 12 0 14 4 2 start_codon stop_codon three_prime_utr transcript 3 3 2 3 ------------------------------------------------------------ d.gtf[, "Chromosome"]: X CDS Selenocysteine exon five_prime_utr gene 28967 2 34136 8824 2647 start_codon stop_codon three_prime_utr transcript 5372 5351 5921 5973 ------------------------------------------------------------ d.gtf[, "Chromosome"]: Y CDS Selenocysteine exon five_prime_utr gene 111 0 182 13 71 start_codon stop_codon three_prime_utr transcript 23 22 10 72 ------------------------------------------------------------ d.gtf[, "Chromosome"]: dmel_mitochondrion_genome CDS Selenocysteine exon five_prime_utr gene 13 0 37 0 37 start_codon stop_codon three_prime_utr transcript 12 10 0 37 ------------------------------------------------------------ d.gtf[, "Chromosome"]: rDNA CDS Selenocysteine exon five_prime_utr gene 0 0 21 0 19 start_codon stop_codon three_prime_utr transcript 0 0 0 19
d.gtf.gene <- d.gtf[d.gtf$Feature == "gene",]
mean(abs(d.gtf.gene$Start - d.gtf.gene$End)) [1] 5753.282
sum(d.gtf.gene$Strand == "+") / length(d.gtf.gene$Strand) [1] 0.5016231
d.gtf3R <- d.gtf[d.gtf$Chromosome == "3R",] exon.position <- d.gtf3R[d.gtf3R$Feature == "exon",c("Start", "End")] median(abs(exon.position$Start - exon.position$End)) mean(abs(exon.position$Start - exon.position$End)) [1] 251 [1] 468.6693
chr <- c("2L", "2R", "3L", "4", "X", "Y") for (i in chr) { d.gtf.tmp <- d.gtf[d.gtf$Chromosome == i,] exon.position <- d.gtf.tmp[d.gtf.tmp$Feature == "exon", c("Start", "End")] exon.med <- median(abs(exon.position$Start - exon.position$End)) exon.mean <- mean(abs(exon.position$Start - exon.position$End)) txt <- sprintf("The median and mean exon length for %s is %g and %g, respectively", i, exon.med, exon.mean) print(txt) } [1] "The median and mean exon length for 2L is 279 and 502.617, respectively" [1] "The median and mean exon length for 2R is 225 and 437.187, respectively" [1] "The median and mean exon length for 3L is 255 and 502.116, respectively" [1] "The median and mean exon length for 4 is 198 and 429.573, respectively" [1] "The median and mean exon length for X is 257 and 526.301, respectively" [1] "The median and mean exon length for Y is 410.5 and 657.055, respectively"