I was working on a kata from CodeWars that asks to implement a series of methods that involve working with arrays. This led me to discovered a feature from ruby language that impressed me: You can extend the functionality of an existing class! In other words, you can add methods to an existing class.

To extend the functionality of Array class with two methods that allows to get the odds and evens number within the array we can do:

class Array

  def even
    self.select{|x| x % 2 == 0}
  end

  def odd
    self.reject{|x| x % 2 != 0}
  end

end

And now, the object from Array will have both methods:

> [1, 2, 3, 4, 5].even
=> [2, 4]
> [1, 2, 3, 4, 5].odd
=> [1, 3, 5]

No programmer can be an entry week at home without typing something. The ruby script squarefy I’m introducing in this post is the result of this afternoon, the 8th day of my Christmas holidays.

  • The repository of the project can he found at github (link).

The idea behind the script is to pixelize a picture but by drawing squares on it. This squares add Gaussian-blur effect on the original picture and also tint effect. So given the following picture:

The squarefy script will add a layer of colored and blurred squares on it with the aim to add a pixelized effect to the picture:

Using the Dir and File we are able to create a function to list the content of a given path:

    def list_files( path, filter = '', full_names = FALSE )
        if full_names == TRUE
            files = Dir.entries(path).select{ | ff | ff != '.' && ff != '..' }.map{ | ff | File.join(path, ff) }
        else
            files = Dir.entries(path).select{ | ff | ff != '.' && ff != '..' }
        end

        if filter.class != ''
            files = files.select{ | ff | ff[filter] }
        end

        return files
    end

The previous function, moreover allows to get the full name of the files (in base of the given path) and also to filter the result given a regular expression.

For example, having a folder with:

    [carleshf@lambdafunction Data]$ ll aligned/
    -rw-rw----+ 1 carleshf carleshf 2200853068 apl 21 06:06 mRNA10_AGTTCC_f2.bam
    -rw-rw----+ 1 carleshf carleshf    4778016 apl 21 06:06 mRNA10_AGTTCC_f2.bam.bai
    -rw-rw----+ 1 carleshf carleshf 1621816053 apl 21 06:06 mRNA10_CAGATC_f2.bam
    -rw-rw----+ 1 carleshf carleshf    4774552 apl 21 06:06 mRNA10_CAGATC_f2.bam.bai
    -rw-rw----+ 1 carleshf carleshf 2195712479 apl 21 06:06 mRNA10_CCGTCC_f2.bam
    -rw-rw----+ 1 carleshf carleshf    4445664 apl 21 06:06 mRNA10_CCGTCC_f2.bam.bai
    ...

We can list the .bam files with:

    list_files('aligned', full_names = TRUE, filter = /.bam$/)

Let’s imagine we have a file-architecture like:

analysis ( folder )
    |
    +-- sample_name_1 ( folder )
    |   |
    |   +-- tophat_out ( folder )
    |   |    |
    |   |    +-- align_summary.txt
    |   |    ...
    |   +-- qc
    |   ...
    +-- sample_name_2 ( folder )
    |   |
    |   +-- tophat_out ( folder )
    |   |   |
    |   |   +-- align_summary.txt
    |   |   ...
    |   +-- qc
    |   ...
    ...

And we want to get the number and the percentage of duplicated reads from the file align_summary.txt, generated using tophat2. The file align_summary.txt contains a string as this one:

        Reads:
          Input     :  26373797
           Mapped   :  21928071 (83.1% of input)
            of these:   5067439 (23.1%) have multiple alignments (6356 have >20) 
83.1% overall read mapping rate.

So it can be read as a plain-text file and parse the string to get the 4th line, parsing it to get the 5067439 and `23.1%“. That can be done as:

from os import listdir
from os.path import join, basename

def getDuplicateds( file_name ):
    with open( file_name, 'r' ) as fh:
        return( [ x for x in fh.readlines()[ 3 ].replace('(', '').replace(')', '').replace('%', '').split( ' ' ) if x != '' ][ 2:4 ] )

input_folder = "."
all_summary = [ join( x, "tophat_out", "align_summary.txt" ) for x in listdir( input_folder ) ]
dupValue = [ getDuplicateds( ff ) for ff in all_summary ]

Moreover we can add the sample’s name to the list, convert it into a data frame and save is as a .tsv:

import pandas as pd

def getName( facility_name ):
    x = facility_name.replace(".fastq", '' ).split( '_' )
    return( x[ 1 ] + "_" + x[ 3 ] )

names       = [ getName( x ) for x in listdir( input_folder ) ]
dupValue = zip( names, [ x[ 0 ] for x in dupValue ], [ x[ 1 ] for x in dupValue ] )
df = pd.DataFrame( dupValue )
df.columns = [ "Sample Name", "Duplicated (#)", "Duplicated(%)" ]
df.to_csv( "duplicate_reads.tsv", sep="t", index = False )

So, the content of the output file is:

[carleshf@lambdafunction]$ head duplicate_reads.tsv 
Sample Name     Duplicated (#)  Duplicated(%)
mRNA4_I19       1892308 8.3
mRNA4_I5        5104835 22.7
mRNA4_I6        1722464 8.7
mRNA1_I13       3970997 14.4
mRNA1_I15       9024103 32.8
mRNA1_I19       2632723 9.2
mRNA1_I5        1175850 14.9
mRNA1_I6        221270  8.3
mRNA2_I14       7177404 19.5

The requested R packages are:

library( Rsamtools )
library( TxDb.Hsapiens.UCSC.hg19.knownGene )
library( ShortRead )

We start from a set of aligned BAM files. The BAM files will be loaded using:

bamfls <- BamFileList( list.files( "...", pattern = "bam$", full.names = TRUE ) )

The reference comes as:

ex   <- unlist( exonsBy( TxDb.Hsapiens.UCSC.hg19.knownGene, "gene" ) )

And the function to extract the GC content is defined as:

getGCcontent <- function( bam_path, ex ) {
    if( missing( ex ) ) {
        stop( "Needed reference as 'ex'. ")
    }
    aln   <- readGappedReads( bam_path )
    olap  <- findOverlaps( aln, ex, ignore.strand = TRUE )
    hits  <- tabulate( queryHits( olap ), length( aln ) )
    keep  <- queryHits( olap ) %in% which( hits == 1 )
    reads <- qseq( aln )[ queryHits( olap )[ keep ] ]
    readGc <- rowSums( alphabetFrequency( reads, as.prob = TRUE )[ , c( "C", "G" ) ] )

    return( mean( readGc ) )
}

I usually create my R scripts inside a .Rnw files that I run with cacheSweave and the I convert to PDF. Since I started the PhD I am creating a lot of this files so I developed a script to run the .Rnw files and to create the corresponding .pdf file in a single command without needing to open R and run the source.

The script is the following:

# Get the arguments from the Rscript
args <- commandArgs( trailingOnly = TRUE )

# Get the sweave (original) and the tex (created with sweave) files
sweave_file <- args[ 1 ]
file_name   <- strsplit( sweave_file, "[/]" )[[ 1 ]]
tex_file    <- paste0( getwd(), "/", strsplit( file_name[ length( file_name ) ], "[.]")[[ 1 ]][ 1 ], ".tex" )

# Load cacheSweave and run the sweave file
library( cacheSweave )
Sweave( sweave_file, driver = cacheSweaveDriver)

# Convert the tex file to pdf
system( paste0( "pdflatex ", tex_file ) )
system("rm *.aux *.log *.out *.toc")

And I run it as:

Rscript ../../scripts/creaPdf.R ../../scripts/mosaicism_analysis.Rnw

The file ../../scripts/creaPdf.R is the one containing the code above. The file ../../scripts/mosaicism_analysis.Rnw is the one that will be executed (and this execution will create a .tex file) and converted to PDF.