Archive

R

The following process, describing how to build R from sources, was run on a Ubuntu 16.04 Xenial Xerus.

Minimal Dependencies

The minimal dependencies to configure a standard version of a R can be installed through APT:

sudo apt-get install gfortran
sudo apt-get install build-essential

sudo apt-get install libreadline6 libreadline6-dev

sudo apt-get install xorg-dev

Java is a requirement. Under ubuntu, the package openjdk-9-jdk has a problem that can be avoided by forcing it to overwrite system files:

sudo apt-get -o Dpkg::Options::="--force-overwrite" install openjdk-9-jdk
sudo apt-get install openjdk-9-jre

Once the installation of the dependencies are done we can go for a minimal configuration.

Minimal Configuration

/.configure

Check the last section Errors if you get some exception and do not get the following result:

R is now configured for x86_64-pc-linux-gnu

Source directory:          .
Installation directory:    /usr/local

C compiler:                gcc  -g -O2
Fortran 77 compiler:       f95  -g -O2

C++ compiler:              g++  -g -O2
C++11 compiler:            g++  -std=c++11 -g -O2
Fortran 90/95 compiler:    gfortran -g -O2
Obj-C compiler:

Interfaces supported:      X11
External libraries:        readline, curl
Additional capabilities:   PNG, NLS
Options enabled:           shared BLAS, R profiling

Capabilities skipped:      JPEG, TIFF, cairo, ICU
Options not enabled:       memory profiling

Recommended packages:      yes

configure: WARNING: you cannot build info or HTML versions of the R manuals
configure: WARNING: you cannot build PDF versions of the R manuals
configure: WARNING: you cannot build PDF versions of vignettes and help pages

OPTIONAL. To remove the WARNING that makes reference to the HTML documentation, we need to install the tool texinfo:

sudo apt-get install texinfo

OPTIONAL. To remove the WARNING that makes reference to the PDF documentation, we need to install LaTeX. I propose to use texlive.

WARNING. The package texlive-full will install a lot of packages, hence it will be very time consuming.

sudo apt-get install texlive-full

Extended Configuration

The first extra step is to enable R and RStudio to talk each other. This is done by adding the modifier --enable-R-shlib. Moreover we usually want to link R with system BLAS libraries rather than use the internal versions from R.

./configure --enable-R-shlib --with-blas --with-lapack

This will result in:

R is now configured for x86_64-pc-linux-gnu

Source directory:          .
Installation directory:    /usr/local

C compiler:                gcc  -g -O2
Fortran 77 compiler:       f95  -g -O2

C++ compiler:              g++  -g -O2
C++11 compiler:            g++  -std=c++11 -g -O2
Fortran 90/95 compiler:    gfortran -g -O2
Obj-C compiler:

Interfaces supported:      X11
External libraries:        readline, curl
Additional capabilities:   PNG, NLS
Options enabled:           shared R library, shared BLAS, R profiling

Capabilities skipped:      JPEG, TIFF, cairo, ICU
Options not enabled:       memory profiling

Recommended packages:      yes

We can see that the section Options enabled includes shared R library.

Next is to allow to use external libraries like cairo or jpeglib. Before, we need to install the dependencies for them.

To allow R to use jpeglib:

sudo apt-get install libjpeg9-dev

To allow R to use cairo:

sudo apt-get install libcairo2-dev libxt-dev

Then we configure again R:

./configure --enable-R-shlib --with-blas --with-lapack --with-cairo --with-jpeglib --with-readline --enable-R-profiling --enable-memory-profiling

The line Additional capabilities has changed:

R is now configured for x86_64-pc-linux-gnu

Source directory:          .
Installation directory:    /usr/local

C compiler:                gcc  -g -O2
Fortran 77 compiler:       f95  -g -O2

C++ compiler:              g++  -g -O2
C++11 compiler:            g++  -std=c++11 -g -O2
Fortran 90/95 compiler:    gfortran -g -O2
Obj-C compiler:

Interfaces supported:      X11
External libraries:        readline, curl
Additional capabilities:   PNG, JPEG, NLS, cairo
Options enabled:           shared R library, shared BLAS, R profiling, memory profiling

Capabilities skipped:      TIFF, ICU
Options not enabled:

Recommended packages:      yes

Now, adding the argument --prefix we locate where to install the new version of R:

./configure --prefix=/home/carleshf/Software/R-3.3.3 --enable-R-shlib --with-blas --with-lapack --with-cairo --with-jpeglib --with-readline --enable-R-profiling --enable-memory-profiling

Compiling R

Once the configuration is done, R needs to be compiled. For this operation we will use the tool make.

Being in R-sources folder we run:

make

This command will generate lots of text. Take care that the following to fragments are clean of errors.

configuring Java ...
Java interpreter : /usr/bin/java
Java version     : 9-internal
Java home path   : /usr/lib/jvm/java-9-openjdk-amd64
Java compiler    : /usr/bin/javac
Java headers gen.: /usr/bin/javah
Java archive tool: /usr/bin/jar

[...]

JAVA_HOME        : /usr/lib/jvm/java-9-openjdk-amd64
Java library path: $(JAVA_HOME)/lib/amd64/server
JNI cpp flags    : -I$(JAVA_HOME)/include -I$(JAVA_HOME)/include/linux
JNI linker flags : -L$(JAVA_HOME)/lib/amd64/server -ljvm
Updating Java configuration in /home/carleshf/Downloads/R-3.3.3
Done.

After the command, a new folder was created with R binaries:

carleshf@sky:~/Downloads/R-3.3.3$ ll
total 2624
drwxr-xr-x 15 carleshf carleshf    4096 abr  4 23:23 ./
drwxr-xr-x  3 carleshf carleshf    4096 abr  4 23:16 ../
drwxrwxr-x  3 carleshf carleshf    4096 abr  4 23:18 bin/

Optional. Once R is compiled, it can be checked using the same make command.

make check

Installing R

After the configuration process and the compilation, a single command is required to install R:

make install

Optionally, you might want to point the R command to the latest R build (in this case, R version 3.3.3).

sudo ln -s /home/carleshf/Software/R-3.3.3/bin/R /bin/R
sudo ln -s /home/carleshf/Software/R-3.3.3/bin/Rscript /bin/Rscript

Optional. Now we can check if the flags used in the configuration process were properly applied. In an R session:

R> capabilities()
jpeg         png        tiff       tcltk         X11        aqua
TRUE        TRUE       FALSE       FALSE        TRUE       FALSE
http/ftp     sockets      libxml        fifo      cledit       iconv
TRUE        TRUE        TRUE        TRUE        TRUE        TRUE
NLS     profmem       cairo         ICU long.double     libcurl
TRUE        TRUE        TRUE       FALSE        TRUE        TRUE

Errors

Configuration

The following is a list of common errors and the packages required to overcome the problem.

bzip2

checking whether bzip2 support suffices... configure: error: bzip2 library and headers are required
sudo apt-get install libbz2-dev

liblzma

configure: error: "liblzma library and headers are required"
sudo apt-get install liblzma-dev

PCRE

checking whether PCRE support suffices... configure: error: pcre >= 8.10 library and headers are required
sudo apt-get install libpcre3-dev

libcurl

configure: error: libcurl >= 7.28.0 library and headers are required with support for https
sudo apt-get install libcurl4-openssl-dev

Post Installation

XML R package

sudo apt-get install libxml2-dev

devtools R package

ERROR: dependencies ‘httr’, ‘git2r’ are not available for package ‘devtools’
* removing ‘/home/kuragari/Software/R-3.3.3/lib/R/library/devtools’
sudo apt-get install libssl-dev

In R there are many functions that work with a pattern written as a regular expression. Today I needed to deal with one of these functions: str_locate_all (doc) from stringr

My goal was to find "223777_at [Chip: U133B]" in a series of strings like the following one:

text <- "11753227_s_at [Chip: PrimeView]; 223777_at [Chip: HT_HG-U133B]; 223777_PM_at [Chip: U133_Plus_PM]; 48336_at [Chip: U95B]; 223777_at [Chip: GeneProfilingArray]; g13477210_3p_at [Chip: U133_X3P]; MmugDNA.4759.1.S1_at [Chip: Rhesus]; 11753227_s_at [Chip: HG-U219]; ADXECADA.19261_s_at [Chip: Xcel]; ADXECRS.13279_at [Chip: Xcel]; ADXECRS.13279_x_at [Chip: Xcel]; 223777_at [Chip: U133B]; 223777_at [Chip: U133_Plus_2]; RC_T49570_at [Chip: Hu35KsubB]"

The way to find the location (in my case all the locations) of the pattern number slash at white-space bracket Chip two-points U133B brecket follows:

str_locate_all(text, pattern="[0-9]+_at \\[Chip: U133B\\]")

This returns a list of matrix with the locations (start and end point) of all the occurrences found by the given pattern. Take care, so the brackets need to be escaped by double slash.

A long the process of creating and testing R packages I usually need to unload a loaded package. The function to perform this operation is called detach and must be run as:

detach(name="package:rexposome", unload=TRUE)

The name argument needs to be filled with package: and followed by the name of the package to be unload, that can be found at search().

This last week I used lost of Microsoft Excel files that I need in my R scripts. Hence I discovered the excellent and complete package XLConnect. But, for easy and fast working, its not the best solution.

So I codded a wrapper of XLConnect, calling it loadxls that allows to read and write Excel files in an easy way.

loadxls

The R package loadxls can be installed from its github repository with:

devtools::install_github("carleshf/loadxls")

It implements only 4 functions: read_all, read_sheet, write_all and write_sheet.

Functions

read_all

read_all(filename, environment = parent.frame(), verbose = TRUE)

This functions reads a given Excel file and loads each sheet as a data.frame in the current environment. The created objects will have the name of the sheet.

read_sheet

read_sheet(filename, sheetname, varname, environment = parent.frame(), verbose = TRUE)

This functions loads, instead of the full content of a file, only a given sheet. If the argument varname is supplied, the object loaded from the given sheet will take its name.

write_all

write_all(..., filename, verbose = TRUE)

This function writes all the objects given though ... to a new Excel file, saving each objects as a new sheet.

write_sheet

write_sheet(data, sheetname, filename, replace = FALSE, verbose = TRUE)

This function allows to save a single object to and giving the name of the sheet where it will be write.

I had a file containing ~1M SNPs in their rsid (and their position). I needed to complete the information with their chromosome.

Since the list is really large I use scan combined with a bash command (to know the length of the file).

I found this solution:

library("SNPlocs.Hsapiens.dbSNP.20120608")

## Create connection to big file
inputName <- "input_file.gen"
outputName <- "output_file.gen"
inputCon <- file(description = inputName, open = "r")
outputCon <- file(description = outputName, open = "w")

## We need to know the number of lines of the big file
## This will work for GNU/Linux environments
command <- paste("wc -l ", inputName, " | awk '{ print $1 }'", sep="")
nLines <- as.numeric(system(command = command, intern = TRUE))
rm(command)

## Loop over the file connection until end of lines
pb <- txtProgressBar(min = 0, max = nLines, style = 3)
for(ii in 1:nLines) {
  readLine <- scan(file = inputCon, nlines = 1, what = "list", quiet = TRUE)

  ## Get the chr from the SNP
  x <- tryCatch({
    x <- rsidsToGRanges(readLine[[2]])
    as.character(x@seqnames[1])
  }, error = function(e) {
    "---"
  })

  ## Update on list and save on output
  readLine[[1]] <- x
  writeLines(paste(readLine, collapse = " "), outputCon)

  setTxtProgressBar(pb, ii)
}

close(inputCon)
close(outputCon)

This solution is really slow but it gets the SNP’s chromosome and fills it as "---" when the SNP is not found in dbSNP.

WARNING: The SNP’s rsid must be located into the second column of the file. Take a look at readLine[[2]] in the for.


I am open to know different ways of doing this.

PsyGeNET is a database that integrates information on psychiatric disorders and their genes (check its About page for more information). The current version of the database centered the information of three main psychiatric disorders: Alcoholism, Depression and Cocaine-Related-Disorders.

Currently the author of PsyGeNET, Alba Gutiérrez, and me are developing an R package (PsyGeNET2R) to query the information stored into the database and to perform some analysis using this information. We thought that could be a good idea to perform an enrichment analysis on the three main psychiatric disorders given a list of genes of interest.

Using the PsyGeNET‘s browser we can get the following code to download the stored data:

library(RCurl)
oql <- "DEFINE
    c0='/data/main',
    c1='/data/genes',
    c2='/data/diseases',
    c3='/data/datasources'
  ON
   'http://www.psygenet.org/web/PsyGeNET'
  SELECT
    c1 (Gene_Symbol, Gene_Id, Gene_Description),
    c2 (Disease_Id, Disease_code, DiseaseName, PsychiatricDisorder),
    c0 (Score, Number_of_Abstracts)
  FROM
    c0
  WHERE
    c3 ='ALL'
  ORDER BY
    c0.Score DESC";

dataTsv <- getURLContent("http://www.psygenet.org/oql", readfunction = charToRaw(oql), upload = TRUE, customrequest = "POST")
data <- read.csv(textConnection(dataTsv), header = TRUE, sep=";")

The data.frame called data corresponds to a generalization of the information stored into PsyGeNET:

> colnames(data)
[1] "c1.Gene_Symbol"         "c1.Gene_Id"             "c1.Gene_Description"
[4] "c2.Disease_Id2"         "c2.Disease_code"        "c2.DiseaseName"
[7] "c2.PsychiatricDisorder" "c0.Score"               "c0.Number_of_Abstracts"

The following tables shows the content of the first three rows of data:

c1.Gene_Symbol c1.Gene_Id c1.Gene_Description c2.Disease_Id
1 NPY 4852 neuropeptide Y umls:C0001973
2 ADH1B 125 alcohol dehydrogenase 1B (class I), beta polypeptide umls:C0001973
3 ADH1C 126 alcohol dehydrogenase 1C (class I), gamma polypeptide umls:C0001973
c2.Disease_code c2.DiseaseName c2.PsychiatricDisorder c0.Score
1 C0001973 Alcoholism Alcoholism 0.9613497
2 C0001973 Alcoholism Alcoholism 0.9000000
3 C0001973 Alcoholism Alcoholism 0.9000000
c0.Number_of_Abstracts
1 21
2 82
3 37

From the vignette “psyGeNET2R: Case study with genes from GWAS study related to bipolar disorder” I extracted a list of genes of interest:

genesOfInterest <- c( "ADCY2", "FAM155A", "MSI2", "AKAP13",
  "FLJ16124", "NFIX", "SIPA1L2", "SNX8", "ANK3", "FSTL5",
  "NGF", "SPERT", "ANKS1A", "GATA5", "NPAS3", "STK39", "ATP6V1G3",
  "GNA14", "ODZ4", "SYNE1", "ATXN1", "GPR81", "PAPOLG", "THSD7A",
  "C11orf80", "HHAT", "PAX1", "TNR", "C15orf53", "IFI44", "PBRM1",
  "TRANK1", "CACNA1C", "ITIH3", "PTPRE", "TRIM9", "CACNA1D", "KDM5B",
  "PTPRT", "UBE2E3", "CACNB3", "KIF1A", "RASIP1", "UBR1", "CROT",
  "LOC150197", "RIMBP2", "ZMIZ1", "DLG2", "MAD1L1", "RXRG", "ZNF274",
  "DNAJB4", "MAPK10", "SGCG", "DUSP22", "MCM9", "SH3PXD2A"
)

But from the total of 58 genes, only 21 are in PsyGeNET today (January 06th, 2015):

> length(genesOfInterest)
[1] 58
> sum(genesOfInterest %in% data$c1.Gene_Symbol)
[1] 21
> sel <- genesOfInterest[genesOfInterest %in% data$c1.Gene_Symbol]
> sel
 [1] "MSI2"     "NFIX"     "ANK3"     "NGF"      "ANKS1A"   "NPAS3"
 [7] "SYNE1"    "THSD7A"   "C11orf80" "C15orf53" "PBRM1"    "CACNA1C"
[13] "ITIH3"    "PTPRE"    "PTPRT"    "CACNB3"   "RASIP1"   "ZMIZ1"
[19] "DLG2"     "MAD1L1"   "SGCG"

Using the strategy explained at the post entiteled Understanding Hypergeometric Tests to know if the selected genes are enriched into “Depression”, the description-table will look like:

Depression ¬Depression Total
selection 17 21 – 17
background 877 1317 – 877
total 877 440 1317

All the data comes from:

> data.dep = data[data$c2.PsychiatricDisorder == "Depression", ]
> x <- sum(genesOfInterest %in% data$c1.Gene_Symbol)
> n <- sum(sel %in% data.dep$c1.Gene_Symbol)
> t <- length(unique(data.dep$c1.Gene_Symbol))
> T <- length(unique(data$c1.Gene_Symbol))
> x
[1] 21
> n
[1] 17
> t
[1] 877
> T
[1] 1317

To end, we can say that those selected genes are not enriched in “Depression”:

> phyper(n - 1, t, T - t, x, lower.tail = FALSE)
[1] 0.1177713

Hypergeometric test are useful to perform enrichment analysis. As I see, the most performed enrichment analysis is the one where people want to obtain a list of enriched GO terms given a list of genes.

The hypergeometric test is the equivalent of the one-tailed Fisher’s exact test, giving the statistical confidence in p-values.

For example, given a shuffled poker deck with no jokers we want to see if getting five random cards the result is diamond-enriched:

  • x (x = 5): Is the sample’s size
  • t (t = 13): Number of cards of each suit
  • T (T = 52): Total number of cards in the deck (T = t x 4)
  • n: Number of diamonds in sample
Diamonds ¬Diamonds Total
selection n x – n
left t – n T – t – x + n
total T

This table summaries the content of the selection and the deck from diamonds’ point of view. The first row corresponds to the description of the selection cards: x total cards and n diamonds. The second row describes the left part: 39 (from the original 52 cards we subtracted the diamonds suit)

This experiment can be seen from the background point of view, instead of the left point. This table is:

Diamonds ¬Diamonds Total
selection n x – n
background t T – t
total T

If our selection of 5 cards was 4♣, A♦, K♦, 8♥ and 5♠, the left-table for n = 2 should be:

Diamonds ¬Diamonds Total
selection n = 2 5 – 2 = 3
left 13 – 2 = 11 52 – 13 – 5 + 2 = 36
total 2 + 11 = 13 36 + 3 = 39 13 + 36 = 52

And the background-table:

Diamonds ¬Diamonds Total
selection n = 2 5 – 2 = 3
background t = 13 52 – 13 = 39
total 13 39 13 + 39 = 52

Using R, this experiment can be done from both points of view: the fisher.test function uses the left-point of view while the function phyper uses the background-point of view.

From the help of R function phyper:

Description
Density, distribution function, quantile function and random generation for the hypergeometric distribution.

Usage
dhyper(x, m, n, k, log = FALSE)
phype(q, m, n, k, lower.tail = TRUE, log.p = FALSE)

Arguments
x, q: vector of quantiles representing the number of white balls drawn without replacement from an urn which contains both black and white balls.
m: the number of white balls in the urn.
n: the number of black balls in the urn.
k: the number of balls drawn from the urn.

log, log.p: logical; if TRUE, probabilities p are given as log(p).
lower.tail: logical; if TRUE (default), probabilities are P[X ≤ x], otherwise, P[X > x].

So, from our previous description, the call to phyper, must be:

> phyper(n, t, T - t, x, lower.tail = FALSE)
[1] 0.09276711

In other words, we are calculating the Cumulative Probability of P(X > n). But in fact, as we are looking to this experiment to understand the mechanism under an enrichment analysis, we want the Cumulative Probability of P(X >= n). That is because we want to know the probability of having one, two, three, four and five diamond cards (one, one and two, one and two and three,…) in our selection. This p-value is obtained from phyper as:

> phyper(n - 1, t, T - t, x, lower.tail = FALSE)
[1] 0.3670468

Just as a curiosity, to get the Hypergemoetric Probability of P(X = n) we need to use the R function dhyper as follows:

> dhyper(n, t, T - t, x)
[1] 0.2742797