Get duplicate reads from tophat2

Let’s imagine we have a file-architecture like:

analysis ( folder )
    |
    +-- sample_name_1 ( folder )
    |   |
    |   +-- tophat_out ( folder )
    |   |    |
    |   |    +-- align_summary.txt
    |   |    ...
    |   +-- qc
    |   ...
    +-- sample_name_2 ( folder )
    |   |
    |   +-- tophat_out ( folder )
    |   |   |
    |   |   +-- align_summary.txt
    |   |   ...
    |   +-- qc
    |   ...
    ...

And we want to get the number and the percentage of duplicated reads from the file align_summary.txt, generated using tophat2. The file align_summary.txt contains a string as this one:

        Reads:
          Input     :  26373797
           Mapped   :  21928071 (83.1% of input)
            of these:   5067439 (23.1%) have multiple alignments (6356 have >20) 
83.1% overall read mapping rate.

So it can be read as a plain-text file and parse the string to get the 4th line, parsing it to get the 5067439 and `23.1%“. That can be done as:

from os import listdir
from os.path import join, basename

def getDuplicateds( file_name ):
    with open( file_name, 'r' ) as fh:
        return( [ x for x in fh.readlines()[ 3 ].replace('(', '').replace(')', '').replace('%', '').split( ' ' ) if x != '' ][ 2:4 ] )

input_folder = "."
all_summary = [ join( x, "tophat_out", "align_summary.txt" ) for x in listdir( input_folder ) ]
dupValue = [ getDuplicateds( ff ) for ff in all_summary ]

Moreover we can add the sample’s name to the list, convert it into a data frame and save is as a .tsv:

import pandas as pd

def getName( facility_name ):
    x = facility_name.replace(".fastq", '' ).split( '_' )
    return( x[ 1 ] + "_" + x[ 3 ] )

names       = [ getName( x ) for x in listdir( input_folder ) ]
dupValue = zip( names, [ x[ 0 ] for x in dupValue ], [ x[ 1 ] for x in dupValue ] )
df = pd.DataFrame( dupValue )
df.columns = [ "Sample Name", "Duplicated (#)", "Duplicated(%)" ]
df.to_csv( "duplicate_reads.tsv", sep="t", index = False )

So, the content of the output file is:

[carleshf@lambdafunction]$ head duplicate_reads.tsv 
Sample Name     Duplicated (#)  Duplicated(%)
mRNA4_I19       1892308 8.3
mRNA4_I5        5104835 22.7
mRNA4_I6        1722464 8.7
mRNA1_I13       3970997 14.4
mRNA1_I15       9024103 32.8
mRNA1_I19       2632723 9.2
mRNA1_I5        1175850 14.9
mRNA1_I6        221270  8.3
mRNA2_I14       7177404 19.5
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: