Archive

bioinformatics

Today I spent some time looking for this trick. The function to use is called VLOOKUP. From Office Support web page:

Use VLOOKUP, one of the lookup and reference functions, when you need to find things in a table or a range by row. For example, look up a price of an automotive part by the part number.

Function VLOOKUP requires 4 arguments. the first two are easy to understand, but the third

  1. The value to look for.
  2. The range where we want to look for the value.
  3. Index of the column in the range containing the return value.
  4. [optional] TRUE for approximate match and FALSE for a perfect match.

Let’s see an example for argument 3, from the same Office Support site:

For example, if you specify B2:D11 as the range, you should count B as the first column, C as the second, and so on.

So, this means that given a range of a single column (aka. B2:B2550) this argument will be 1. If the range includes more that one column, then it should take the index of the value to return.

The following is an example of using VLOOKUP to find the value of an item:

In my case I want to get a typical boolean TRUE / FALSE to know if the value to look for is in a given column or not. To this end I used two more functions: IF and ISERROR.

The following picture shows the use of VLOOKUP to check if the elements in column A are in column C using the 4Th argument of VLOOKUP``, set toFALSE`:

As can be see, when an element from A is not present in the C it returns #N/A (aka. an error). So we can use ISERROR to check if VLOOKUP raises an error and IF to return TRUE or FALSE:

The final formula follows:

=IF(ISERROR(VLOOKUP(A2,C2:C7, 1, FALSE)), FALSE, TRUE)

The following process, describing how to build R from sources, was run on a Ubuntu 16.04 Xenial Xerus.

Minimal Dependencies

The minimal dependencies to configure a standard version of a R can be installed through APT:

sudo apt-get install gfortran
sudo apt-get install build-essential

sudo apt-get install libreadline6 libreadline6-dev

sudo apt-get install xorg-dev

Java is a requirement. Under ubuntu, the package openjdk-9-jdk has a problem that can be avoided by forcing it to overwrite system files:

sudo apt-get -o Dpkg::Options::="--force-overwrite" install openjdk-9-jdk
sudo apt-get install openjdk-9-jre

Once the installation of the dependencies are done we can go for a minimal configuration.

Minimal Configuration

/.configure

Check the last section Errors if you get some exception and do not get the following result:

R is now configured for x86_64-pc-linux-gnu

Source directory:          .
Installation directory:    /usr/local

C compiler:                gcc  -g -O2
Fortran 77 compiler:       f95  -g -O2

C++ compiler:              g++  -g -O2
C++11 compiler:            g++  -std=c++11 -g -O2
Fortran 90/95 compiler:    gfortran -g -O2
Obj-C compiler:

Interfaces supported:      X11
External libraries:        readline, curl
Additional capabilities:   PNG, NLS
Options enabled:           shared BLAS, R profiling

Capabilities skipped:      JPEG, TIFF, cairo, ICU
Options not enabled:       memory profiling

Recommended packages:      yes

configure: WARNING: you cannot build info or HTML versions of the R manuals
configure: WARNING: you cannot build PDF versions of the R manuals
configure: WARNING: you cannot build PDF versions of vignettes and help pages

OPTIONAL. To remove the WARNING that makes reference to the HTML documentation, we need to install the tool texinfo:

sudo apt-get install texinfo

OPTIONAL. To remove the WARNING that makes reference to the PDF documentation, we need to install LaTeX. I propose to use texlive.

WARNING. The package texlive-full will install a lot of packages, hence it will be very time consuming.

sudo apt-get install texlive-full

Extended Configuration

The first extra step is to enable R and RStudio to talk each other. This is done by adding the modifier --enable-R-shlib. Moreover we usually want to link R with system BLAS libraries rather than use the internal versions from R.

./configure --enable-R-shlib --with-blas --with-lapack

This will result in:

R is now configured for x86_64-pc-linux-gnu

Source directory:          .
Installation directory:    /usr/local

C compiler:                gcc  -g -O2
Fortran 77 compiler:       f95  -g -O2

C++ compiler:              g++  -g -O2
C++11 compiler:            g++  -std=c++11 -g -O2
Fortran 90/95 compiler:    gfortran -g -O2
Obj-C compiler:

Interfaces supported:      X11
External libraries:        readline, curl
Additional capabilities:   PNG, NLS
Options enabled:           shared R library, shared BLAS, R profiling

Capabilities skipped:      JPEG, TIFF, cairo, ICU
Options not enabled:       memory profiling

Recommended packages:      yes

We can see that the section Options enabled includes shared R library.

Next is to allow to use external libraries like cairo or jpeglib. Before, we need to install the dependencies for them.

To allow R to use jpeglib:

sudo apt-get install libjpeg9-dev

To allow R to use cairo:

sudo apt-get install libcairo2-dev libxt-dev

Then we configure again R:

./configure --enable-R-shlib --with-blas --with-lapack --with-cairo --with-jpeglib --with-readline --enable-R-profiling --enable-memory-profiling

The line Additional capabilities has changed:

R is now configured for x86_64-pc-linux-gnu

Source directory:          .
Installation directory:    /usr/local

C compiler:                gcc  -g -O2
Fortran 77 compiler:       f95  -g -O2

C++ compiler:              g++  -g -O2
C++11 compiler:            g++  -std=c++11 -g -O2
Fortran 90/95 compiler:    gfortran -g -O2
Obj-C compiler:

Interfaces supported:      X11
External libraries:        readline, curl
Additional capabilities:   PNG, JPEG, NLS, cairo
Options enabled:           shared R library, shared BLAS, R profiling, memory profiling

Capabilities skipped:      TIFF, ICU
Options not enabled:

Recommended packages:      yes

Now, adding the argument --prefix we locate where to install the new version of R:

./configure --prefix=/home/carleshf/Software/R-3.3.3 --enable-R-shlib --with-blas --with-lapack --with-cairo --with-jpeglib --with-readline --enable-R-profiling --enable-memory-profiling

Compiling R

Once the configuration is done, R needs to be compiled. For this operation we will use the tool make.

Being in R-sources folder we run:

make

This command will generate lots of text. Take care that the following to fragments are clean of errors.

configuring Java ...
Java interpreter : /usr/bin/java
Java version     : 9-internal
Java home path   : /usr/lib/jvm/java-9-openjdk-amd64
Java compiler    : /usr/bin/javac
Java headers gen.: /usr/bin/javah
Java archive tool: /usr/bin/jar

[...]

JAVA_HOME        : /usr/lib/jvm/java-9-openjdk-amd64
Java library path: $(JAVA_HOME)/lib/amd64/server
JNI cpp flags    : -I$(JAVA_HOME)/include -I$(JAVA_HOME)/include/linux
JNI linker flags : -L$(JAVA_HOME)/lib/amd64/server -ljvm
Updating Java configuration in /home/carleshf/Downloads/R-3.3.3
Done.

After the command, a new folder was created with R binaries:

carleshf@sky:~/Downloads/R-3.3.3$ ll
total 2624
drwxr-xr-x 15 carleshf carleshf    4096 abr  4 23:23 ./
drwxr-xr-x  3 carleshf carleshf    4096 abr  4 23:16 ../
drwxrwxr-x  3 carleshf carleshf    4096 abr  4 23:18 bin/

Optional. Once R is compiled, it can be checked using the same make command.

make check

Installing R

After the configuration process and the compilation, a single command is required to install R:

make install

Optionally, you might want to point the R command to the latest R build (in this case, R version 3.3.3).

sudo ln -s /home/carleshf/Software/R-3.3.3/bin/R /bin/R
sudo ln -s /home/carleshf/Software/R-3.3.3/bin/Rscript /bin/Rscript

Optional. Now we can check if the flags used in the configuration process were properly applied. In an R session:

R> capabilities()
jpeg         png        tiff       tcltk         X11        aqua
TRUE        TRUE       FALSE       FALSE        TRUE       FALSE
http/ftp     sockets      libxml        fifo      cledit       iconv
TRUE        TRUE        TRUE        TRUE        TRUE        TRUE
NLS     profmem       cairo         ICU long.double     libcurl
TRUE        TRUE        TRUE       FALSE        TRUE        TRUE

Errors

Configuration

The following is a list of common errors and the packages required to overcome the problem.

bzip2

checking whether bzip2 support suffices... configure: error: bzip2 library and headers are required
sudo apt-get install libbz2-dev

liblzma

configure: error: "liblzma library and headers are required"
sudo apt-get install liblzma-dev

PCRE

checking whether PCRE support suffices... configure: error: pcre >= 8.10 library and headers are required
sudo apt-get install libpcre3-dev

libcurl

configure: error: libcurl >= 7.28.0 library and headers are required with support for https
sudo apt-get install libcurl4-openssl-dev

Post Installation

XML R package

sudo apt-get install libxml2-dev

devtools R package

ERROR: dependencies ‘httr’, ‘git2r’ are not available for package ‘devtools’
* removing ‘/home/kuragari/Software/R-3.3.3/lib/R/library/devtools’
sudo apt-get install libssl-dev

I was a fan of the package Dictionaries but is seems that it is no available and that will not be re-included. The package’s wen-page in package control is here, indicating that the package was removed.

Anyway, the package is in GitHub and it can be installed from the repository. The steps follows:

  1. Download the package as a ZIP file.
  2. Open a terminal on your Downloads-folder.
  3. Unzip the file with unzip Dictionaries-master.zip
  4. Rename the folder in a more suitable name with mv Dictionaries-master Dictionaries
  5. Move the unzipped content to you package-folder with mv Dictionaries-master ~/.config/sublime-text-3/Packages

The steps described here are also in the package’s repository (at the end).

SRA stores all the sequencing from GIO experiments in files in .sra format. These files are managed using the SRA Toolkit.

I recently download some .sra files from this GEO corresponding to paired-end sequencing data. My surprise when I run fastq-dump (from SRA toolkit) utility and I got only one file rather than two.

From the documentation of the tool, it seems that the option --split-files should be enough but not. We need to add the --split-3 option. If we run fastq-dump with this configuration in a single-end experiment a single .fastq files will be create, otherwise two files with suffixes _1 and _2 will be the matched paired read files (.fastq) while a posible third file (no sufix) will contain the non matched reads.

I currently run fastq-dump as:

fastq-dump --split-files --split-3 SRR1813404.sra -O SRR1813404

This last week I used lost of Microsoft Excel files that I need in my R scripts. Hence I discovered the excellent and complete package XLConnect. But, for easy and fast working, its not the best solution.

So I codded a wrapper of XLConnect, calling it loadxls that allows to read and write Excel files in an easy way.

loadxls

The R package loadxls can be installed from its github repository with:

devtools::install_github("carleshf/loadxls")

It implements only 4 functions: read_all, read_sheet, write_all and write_sheet.

Functions

read_all

read_all(filename, environment = parent.frame(), verbose = TRUE)

This functions reads a given Excel file and loads each sheet as a data.frame in the current environment. The created objects will have the name of the sheet.

read_sheet

read_sheet(filename, sheetname, varname, environment = parent.frame(), verbose = TRUE)

This functions loads, instead of the full content of a file, only a given sheet. If the argument varname is supplied, the object loaded from the given sheet will take its name.

write_all

write_all(..., filename, verbose = TRUE)

This function writes all the objects given though ... to a new Excel file, saving each objects as a new sheet.

write_sheet

write_sheet(data, sheetname, filename, replace = FALSE, verbose = TRUE)

This function allows to save a single object to and giving the name of the sheet where it will be write.

I had a file containing ~1M SNPs in their rsid (and their position). I needed to complete the information with their chromosome.

Since the list is really large I use scan combined with a bash command (to know the length of the file).

I found this solution:

library("SNPlocs.Hsapiens.dbSNP.20120608")

## Create connection to big file
inputName <- "input_file.gen"
outputName <- "output_file.gen"
inputCon <- file(description = inputName, open = "r")
outputCon <- file(description = outputName, open = "w")

## We need to know the number of lines of the big file
## This will work for GNU/Linux environments
command <- paste("wc -l ", inputName, " | awk '{ print $1 }'", sep="")
nLines <- as.numeric(system(command = command, intern = TRUE))
rm(command)

## Loop over the file connection until end of lines
pb <- txtProgressBar(min = 0, max = nLines, style = 3)
for(ii in 1:nLines) {
  readLine <- scan(file = inputCon, nlines = 1, what = "list", quiet = TRUE)

  ## Get the chr from the SNP
  x <- tryCatch({
    x <- rsidsToGRanges(readLine[[2]])
    as.character(x@seqnames[1])
  }, error = function(e) {
    "---"
  })

  ## Update on list and save on output
  readLine[[1]] <- x
  writeLines(paste(readLine, collapse = " "), outputCon)

  setTxtProgressBar(pb, ii)
}

close(inputCon)
close(outputCon)

This solution is really slow but it gets the SNP’s chromosome and fills it as "---" when the SNP is not found in dbSNP.

WARNING: The SNP’s rsid must be located into the second column of the file. Take a look at readLine[[2]] in the for.


I am open to know different ways of doing this.