Welcome to riboraptor’s documentation!

riboraptor library is a Python library for analysis of Ribo-seq data. It assumes the reads have already been aligned to a reference and are available as BAM/SAM for input.

Current capabilities of riboraptor include:

Visualization:
  • Plot read length distribution
  • Plot metagene coverage
Utilities:
  • Export all gene coverages
  • Export metagene coverage
  • Export read length distribution
  • Calculate periodicity
$ Usage: riboraptor [OPTIONS] COMMAND [ARGS]...

  riboraptor: Tool for ribosome profiling analysis

 Options:
    --version  Show the version and exit.
    --help     Show this message and exit.

 Commands:
    bam-to-bedgraph              Convert bam to bedgraph
    bedgraph-to-bigwig           Convert bedgraph to bigwig
    export-bed-fasta             Export gene level fasta from specified bed
    export-gene-coverages        Export gene level coverage for all genes for given region
    export-metagene-coverage     Export metagene coverage for given region
    export-read-length           Calculate read length distribution
    periodicity                  Calculate periodicity
    plot-metagene                Plot metagene profile
    plot-read-length             Plot read length distribution
    uniq-bam                     Create a new bam with unique mapping reads only
    uniq-mapping-count           Count number of unique mapping reads

Contents:

Installation

Stable release

To install riboraptor, run this command in your terminal:

$ pip install riboraptor

This is the preferred method to install riboraptor, as it will always install the most recent stable release.

If you don’t have pip installed, this Python installation guide can guide you through the process.

From sources

The sources for riboraptor can be downloaded from the Github repo.

You can either clone the public repository:

$ git clone git://github.com/saketkc/riboraptor

Or download the tarball:

$ curl  -OL https://github.com/saketkc/riboraptor/tarball/master

Once you have a copy of the source, you can install it with:

$ python setup.py install

Example Workflow

We will be working with the first published Ribo-seq dataset GSE13750 from Ingolia et al. (2009) which has samples for both mRNA-seq and Ribo-seq from Yeast grown in starved and nutrient rich media.

At this point, we assume you have already completed all the steps under “Installing dependencies” section of the README.

Step 1: Downloading datasets

We will download all SRA files corresponding to GSE13750.

cd riboraptor
download_sra_data --sradb=../riboraptor-data/SRAmetadb.sqlite \
--geodb=../riboraptor-data/GEOmetadb.sqlite GSE13750

GEO IDs are automatatiicaly converted to corresponding SRP IDs. GSE13750 corresponds to SRP000637.

There are 6 experiments in total (SRX003184-SRX003191), but we will be working with only two: SRX003187`and `SRX003191 one of which is mRNA-seq while other is Ribo-seq. (We will figure out which is which later.) You can delete are SRX directories except the above two. We will be using sacCerR64 as our reference.

We will now use Snakemake to run all the downstream steps. Here is what the overall workflow looks like:

_images/dag.svg

Step 2: Copy template

cd snakemake
cp configs/SRP000637.py.sample configs/SRP000637.py

Edit the paths inside SRP000637.py to point to your RAW data, GTF and BED files. BED files for a lot of assemblies are inbuilt into riboraptor. In most cases, those should suffice. However, you stil need to provide paths to fasta, chromosome sizes and the GTF. The BED files are created from a particular version of the GTF, so if you are using your own GTF, you should ideally be using your own BED files too.

An example of a config would be:

## Path to SRP directory
RAWDATA_DIR = '/staging/as/skchoudh/SRA_datasets/SRP000637'

## Output directory (will be created if does not exist)
OUT_DIR = '/staging/as/skchoudh/riboraptor-analysis/SRP000637'

## Genome fasta location
GENOME_FASTA = '/home/cmb-06/as/skchoudh/genomes/sacCerR64/fasta/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa'

## Chromosome sizes location
CHROM_SIZES = '/home/cmb-06/as/skchoudh/genomes/sacCerR64/fasta/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.sizes'

## Path to STAR index (will be generated if does not exist)
STAR_INDEX = '/home/cmb-06/as/skchoudh/genomes/sacCerR64/star_annotated'

## GTF path
GTF = '/home/cmb-06/as/skchoudh/genomes/sacCerR64/annotation/Saccharomyces_cerevisiae.R64-1-1.91.gtf'


## Path to bed file containing Intron coordinates
INTRON_BED = '/home/cmb-panasas2/skchoudh/github_projects/riboraptor/riboraptor/annotation/sacCerR64/intron.bed'

## Path to bed file containing CDS coordinates
CDS_BED = '/home/cmb-panasas2/skchoudh/github_projects/riboraptor/riboraptor/annotation/sacCerR64/cds.bed'

## Path to bed file containing 5'UTR coordinates
UTR5_BED = '/home/cmb-panasas2/skchoudh/github_projects/riboraptor/riboraptor/annotation/sacCerR64/utr5.bed'

## Path to bed file containing 3'UTR coordinates
UTR3_BED = '/home/cmb-panasas2/skchoudh/github_projects/riboraptor/riboraptor/annotation/sacCerR64/utr3.bed'

Step 3 : Change your miniconda path

Edit Line4 snakemake/jobscript.sh pointing to your conda root directory.

An example path would be:
export PATH="/home/cmb-panasas2/wenzhenl/miniconda3/bin:$PATH"

Step 4: Edit snakemake/cluster.yaml

Edit Line6 snakemake/cluster.yaml and Line7 snakemake/cluster.yaml to point to your log directory error log file.

An example path would be:
logout: '/home/cmb-06/as/skchoudh/logs/{rule}.{wildcards}.out'
logerror: '/home/cmb-06/as/skchoudh/logs/{rule}.{wildcards}.err'

You would want to just edit the directory path leading to /home/cmb-06/as/skchoudh/logs/ and leave the rest as it is.

Step 5: Submit job

bash submitall.sh SRP000637

The submitall.sh looks for a file named SRP000637.py in the configs directory, so make sure SRP000637.py exists inside configs/ directory.

Visualizing Results

When the entire pipeline as run, it will create an html file riboraptor_report.html as output. You can copy it locally to visualize metagene profiles and read length distribution for all samples.

Manual

Contents

roboraptor is a comprehensive pipeline and set of tools for analyzing Ribo-seq data. This manual explains the stages in our pipeline, how to use the analysis tools and how to modify the pipeline for your specific context.

Assumptions

Our pipeline was designed to run in a cluster computing context, with many processing nodes available, and a job submission system like PBS or SGE. Much of this analysis is computationally intensive. We assume that individual nodes will have several GB of memory available for processing.

Translatome construction

Ribo-seq experiments are always single-end sequenced. Ribosome protected fragments range from 28-32 nucleotides and hence most experiments involve 50 bp single end reads. Before mapping, we need to get rid of the adapters ligated at the 3’ end of the fragments as part of the library protocol.

Trimming Reads

We use trim_galore for trimming. It automates adapter trimming:

$ trim_galore -o <out_dir> -q <min_quality> <input.fq.gz>
-o out_dir Output directory
-q min_quality Trim low-quality ends from reads in addition to adapter removal

Mapping Reads

We use STAR to map reads, though other splice-aware aligners can also be used. The first step is to create an index, preferably using a GTF file. If the index step is run without a GTF file (which is optional), STAR will not be splice-aware.

Creating Index
$ STAR --runThreadN <threads>\
      --runMode genomeGenerate\
      --genomeDir <index_out_dir>\
      --genomeSAindexNbases <SA_INDEX_Nbases>\
      --genomeFastaFiles <input.fasta>\
      --sjdbGTFfile <input.gtf>
--runThreadN Number of threads to use
--runMode Flag to set for index mode
--genomeDir Directory to write index files to
--genomeSAindexNbases
 min(14, log2(GenomeLength)/2 - 1), this must be scaled down for small genomes
--genomeFastaFiles
 Path to reference fasta
--sjdbGTFfile Path to GTF file
Mapping

Often Ribo-seq experiments have a matching RNA-seq library. Ideally, the RNA-seq library should be as similar to Ribo-seq library and hence will often be single-ended. We recommend both RNA-seq and Ribo-seq samples be mapped using the following parameters:

$ STAR --runThreadN <threads>\
      --genomeDir <input.index>\
      --outFilterMismatchNmax 2\
      --alignIntronMin <ALIGN_INTRON_Nmin>\
      --alignIntronMax <ALIGN_INTRON_Nmax>\
      --outFileNamePrefix <params.prefix> --readFilesIn <input.R1>\
      --outSAMtype BAM Unsorted\
      --readFilesCommand zcat\
      --quantMode TranscriptomeSAM\
      --outTmpDir /tmp/<params.name>_tmp\
      --outReadsUnmapped Fastx\
--runThreadN Number of threads to use
--genomeDir Path to index directory
--outFilterMismatchNmax
 Allow a maximum of mismatches=2
--alignIntronMin
 Minimum intron size. Any genomic gap is considered intron if its length >= alignIntronMin. (Default = 20)
--alignIntronMax
 Maximum intron size (Default = 1000000)
--outFileNamePrefix
 Prefix for output files
--readFilesIn Path to input fastq.gz
--outSAMtype Output an unsorted BAM file (outtype=BAM Unsorted)
--readFilesCommand
 cat/zcat depending on input is fq/fq.gz
--quantMode Also output BAM aligned to the transcriptome
--outTmpDir Directory to use for writing temporary files
--outReadsUnmapped
 Write unmapped reads to separate fastq file
Sorting and Indexing

STAR outputted BAM files are not sorted. We need a BAM file sorted by coordinates.

$ samtools sort <prefix>Aligned.out.bam -o <output.bam> -T <tmpdir>_sort &&\
$ samtools index <prefix>Aligned.out.bam

Additionaly, we also need BAM file sorted by name, since htseq-counts (and featureCounts) prefer a BAM sorted by name in their default mode.

$ samtools sort -on <input.bam> -T <tmpdir> -o <output.bam> &&\
$ samtools index <output.bam>

Translatome analysis

Once we have the bams, we are ready for downstream analysis. The downstream step often involves a number of steps. The following list summarises these steps along with their recommended values (wherever applicable):

  • Quality Control
    • Number of uniquely mapped reads : >=5M
    • Periodicity : TODO
    • Ratio of CDS/(5’UTR+3’UTR) : >1 after length normalization
    • Fragment length distribution : Peak around 28-32 nt
  • Metagene analysis
    • P-site offsets : Around 12-14 nt upstream of the start codon when counting based on 5’end

Counting uniquely mapped reads

The first step is to simply caculate the number of uniquely mapped reads. We recommend a minimum of 5 million reads for any downstream analysis. TODO: list different recommendation for different species

$ riboraptor uniq-mapping-count --bam <input.bam>

–bam input.bam Path to bam file

Read length distribution

An ideal Ribo-seq library is expected to have 28-32 nt long fragments most enriched. We can calculate enrichment and plot the fragment size distribution using riboraptor.

Readd length distribution can be calculated using the read-length-dist subcommand:

$ riboraptor read-length-dist --bam <input.bam>

This will print out the read length and associated counts on the console. In order to visualize thhese counts as a barplot, we can use the plot-read-dist subcommand:

$ riboraptor read-length-dist --bam <input.bam>\
     | riboraptor plot-read-dist --saveto <output.png>

Metagene Analysis

A metagene plot is used as a summary statistic to visualize the distribution of ribosome protected fragments along the positions of a gene often starting (ending) at the start (stop) codon. This is useful for estimating P-site offsets. The ribosome subunuits are known to protect 28-32 nt and hence the P-site is often located 12 nt downstream the 5’ position of the mapped read.

Creating bigWig file

To perform metagene analysis, we will work with bigWig format. in order to do that, we need an intermediate bedGraph file. This can be done using bam-to-bedgraph subcommand:

$ riboraptor bam-to-bedgraph --bam <input.bam>

This will print the bedGraph to console. this cna be piped to bedgraph-to-bigwig subcommand:

$ riboraptor bam-to-bedgraph --bam <input.bam> \
     | riboraptor bedgraph-to-bigwig --sizes <genome.sizes> --saveto <output.bw>

We now have <output.bw> ready for further downstream analysis.

Distribution in 5’UTR/3’UTR/CDS regions

TODO (See Example)

Metagene plot

TODO (See Example)

Example

We will use two samples from GSE94454 , one RNA-seq sample (SRR5227310) and one Ribo-seq sample (SRR5227306) as examples for examples that follow.

$ riboraptor uniq-mapping-count --bam data/SRR5227310.bam
28637667

This is a pretty deep library.

$ riboraptor read-length-dist --bam data/SRR5227310.bam\
     | riboraptor plot-read-dist --saveto SRR5227310.png
Fragment length distribution SRR5227310

Fragment length distribution for SRR5227310

How enriched is it in 27-32 nt fragment range?

$ riboraptor read-length-dist --bam data/SRR5227310.bam\
     | riboraptor read-enrichment
 (Enrichment: 1.52768004237, pval: 0.458943823895)

So the fragment length distribution doesn’t seem to be enriched. We next perform metagene analysis. Ribo-seq data is expected to have an inherent periodicity of 3, since ribosomes move one codon at a time during active translation.

$ riboraptor bedgraph-to-bigwig -bg data/SRR5227310.bg -s hg38 -o data/SRR5227310.bw
$  riboraptor metagene-coverage -bw data/SRR5227310.bw \
   --region_bed hg38_cds --max-positions 500 \
   --prefix data/SRR5227310.metagene --offset 60 --ignore_tx_version
$ riboraptor plot-read-counts \
    --counts data/SRR5227310.metagene_metagene_normalized.pickle\
    --saveto data/SRR5227310.metagene.png
Metagene distribution for SRR5227310

Metagene distribution for SRR5227310

Since metagene gives a summary statistic, we can also look at the abolute counts distribution per frame:

$ riboraptor plot-framewise-counts --counts data/SRR5227310.metagene_metagene_raw.pickle\
     --saveto data/SRR5227310.framewise.png
Fragment length distribution SRR5227310

Framewise distribution for SRR5227310

Let’s try another sample: SRR5227306 and compare it with SRR5227310 with respect to distribution of reads.

$ riboraptor uniq-mapping-count --bam data/SRR5227306.bam
10658208
$ riboraptor read-length-dist --bam data/SRR5227306.bam | riboraptor plot-read-dist --saveto SRR5227306.png
Fragment length distribution SRR5227306

Fragment length distribution for SRR5227306

$ riboraptor read-length-dist --bam data/SRR5227306.bam | riboraptor read-enrichment
(Enrichment: 14.0292145986, pval: 0.135220082438)

As compared to SRR5227310, the enrichment in this case is almost 10 times higher.

$ riboraptor plot-framewise-counts --counts data/SRR5227306.metagene_metagene_raw.pickle\
     --saveto data/SRR5227306.framewise.png
Fragment length distribution SRR5227306

Framewise distribution for SRR5227306

We can see the framewise distribution of reads in SRR5227310 is more or less uniform, but not so in SRR5227306.

$ riboraptor bedgraph-to-bigwig -bg data/SRR5227306.bg -s hg38 -o data/SRR5227306.bw
$  riboraptor metagene-coverage -bw data/SRR5227306.bw \
   --region_bed hg38_cds --max-positions 500 \
   --prefix data/SRR5227306.metagene --offset 60 --ignore_tx_version
$ riboraptor plot-read-counts \
    --counts data/SRR5227306.metagene_metagene_normalized.pickle\
    --saveto data/SRR5227306.metagene.png
Metagene distribution for SRR5227306

Metagene distribution for SRR5227306

The metagene of a Ribo-seq sample will show periodicity as in the case of SRR5227306 sample. On the other hand a RNA-seq sample like SRR5227310 will tend to have a flat profile.

Distribution of 5’UTR/CDS/3’UTR counts

TODO

API Usage

from riboraptor import read_length_distribution
from riboraptor import fragment_enrichment
from riboraptor.plotting import  plot_read_counts
from riboraptor.plotting import plot_fragment_dist

Counting Fragment Lengths

ribo_bam_f = '../data/U251_ribo.bam'
fragment_lengths = read_length_distribution(ribo_bam_f)
ax, fig = plot_fragment_dist(fragment_lengths)
fragment length distribution

Fragment length distribution of an Ideal Ribo-seq library

The abosve is a good example of an ideal Ribo-seq library where the fragment lengths are concentrated between 28 and 31 nucelotides.

We can also “quantify” this enrichment:

ribo_enrichment = fragment_enrichment(ribo_bam_f)
print(ribo_enrichment)
>>> (75.19322097821424, 0.045749574004642399)

Let’s compare this with a RNA-Seq sample from the same sample.

rna_bam_f = '../data/U251_rna.bam'
fragment_lengths = read_length_distribution(rna_bam_f)
ax, fig = plot_fragment_dist(fragment_lengths)
fragment length distribution

Fragment length distribution of a paired-end RNA-seq library

rna_enrichment = fragment_enrichment(rnabam_f)
print(rna_enrichment)
>>> (2.334333862251336e-05, 1.0)

Gene Coverage

In order to visualize genewise coverage, we need to create a bigwig file first. This inturn requires a bedgraph file.

Creating bedgraph

from riboraptor import create_bedgraph
create_bedgraph(ribo_bam_f, strand='both', end_type='5prime', outfile='../data/U251_ribo.bg')

Creating bigwig

from riboraptor import bedgraph_to_bigwig
bedgraph_f = '../data/U251_ribo.bg'
chrom_sizes = '../data/hg38.sizes'
bedgraph_to_bigwig(bedgraph_f, chrom_sizes, '../data/U251_ribo.bw')

Gene coverage plot

from riboraptor import gene_coverage
cds_bed = '../data/hg38.cds.bed'
bw = '../data/U251_ribo.bw'
coverage, _, _, _ = gene_coverage('ENSG00000080824', cds_bed, bw, 60)

The last argument 60 here specifies the number of upstream bases to count. We visualize only the first 100 bases:

ax, fig, peak = plot_read_counts(coverage[range(-60,100)],
                                 majorticks=10,
                                 minorticks=5,
                                 marker='o',
                                 millify_labels=False)
Gene coverage

Gene coverage across ENSG00000080824

Periodicity Index

TODO

5’UTR/CDS/3’UTR coverage

TODO

Scores

http://latex.codecogs.com/svg.latex?%5Cinline%20%5Cfrac%7B%5Cbig%28%5Cfrac%7BCounts_%7BCDS%7D%7D%7BCounts_%7BUTR%7D%7D%5Cbig%29_%7BRibo%7D%7D%7B%5Cbig%28%5Cfrac%7BCounts_%7BCDS%7D%7D%7BCounts_%7BUTR%7D%7D%5Cbig%29_%7BRNA%7D%7D

A score to determine if protein translation is complete. Defined as the ratio between reads in coding region to reads in the 3’UTR region normalized by the corresponding ratio in mRNA data allowing for incorrect 3’UTR annotations (CDS to 3’UTR ratio in mRNA is not 1 and is reflective of different protein producing potential)

  • ORFScore [Bazzini2014] : Compares count of number of RPFs in each frame to a uniform distribution using Chi-Squared statistic to identify actively translated ORFs.
https://latex.codecogs.com/svg.latex?%5Clog_2%5Cbig%281%20&plus;%20%5Csum_%7Bi%3D1%7D%5E3%20%5Cfrac%7B%28F_i-%5Cbar%7BF%7D%29%5E2%7D%7BF%7D%20%5Cbig%29%20%5Ctimes%20%5Cbegin%7Bcases%7D%20-1%20%26%20%28F_1%20%3C%20F2%29%20%5Ccup%20%28F_1%20%3C%20F_3%29%2C%5C%5C%201%20%26%20%5Ctext%7Botherwise%7D%20%5Cend%7Bcases%7D

where $F_i$ represents number of reads in frame $i$ and $bar{F}$ represents $mean(F1,F2,F3)$

https://latex.codecogs.com/svg.latex?0.5%20%5Ctimes%20%5Csum_%7Bl%3D26%7D%5E%7Bl%3D34%7D%20f%28l%29%20-%20f_%7Bref%7D%28l%29

where the $f_{ref}$ is contructed by counting the number of fragments of a particular read length over only annotated protein-coding genes. Cutiff is determined by identifying outliers using Tukey’s method.

The FLOSS cutoff, calculated as a function of the total number of reads in the transcript histogram, was established by considering a rolling window of individual annotated genes and the computing the upper extreme outlier cutoff for each window using Tukey’s method (Q3 + 3*IQR, where Q3 is the 3rd quartile and IQR is the interquartile range).

riboraptor

riboraptor package

Submodules

riboraptor.cli module

riboraptor.coherence module

riboraptor.coherence.get_periodicity(values, input_is_stream=False)[source]

Calculate periodicty wrt 1-0-0 signal.

Parameters:
values : array like

List of values

Returns:
periodicity : float

Periodicity calculated as cross correlation between input and idea 1-0-0 signal

riboraptor.coherence.naive_periodicity(values, identify_peak=False)[source]

Calculate periodicity in a naive manner

Take ratio of frame1 over avg(frame2+frame3) counts. By default the first value is treated as the first frame as well

Parameters:
values : Series

Metagene profile

Returns:
periodicity : float

Periodicity

riboraptor.count module

Utilities for read counting operations.

riboraptor.count.bam_to_bedgraph(bam, strand=u'both', end_type=u'5prime', saveto=None)[source]

Create bigwig from bam.

Parameters:
bam : str

Path to bam file

strand : str, optional

Use reads mapping to ‘+/-/both’ strands

end_type : str

Use only end_type=5prime(5’) or “3prime(3’)”

saveto : str, optional

Path to write bedgraph

Returns:
genome_cov : str

Bedgraph output

riboraptor.count.bedgraph_to_bigwig(bedgraph, sizes, saveto, input_is_stream=False)[source]

Convert bedgraph to bigwig.

Parameters:
bedgraph : str

Path to bedgraph file

sizes : str

Path to genome chromosome sizes file or genome name

saveto : str

Path to write bigwig file

input_is_stream : bool

True if input is sent through stdin

riboraptor.count.collapse_gene_coverage_to_metagene(gene_coverages, target_length, outfile=None)[source]

Collapse gene coverages to specific target length.

Parameters:
gene_coverages : string

Path to gene coverages.tsv

target_lenght : int

Collapse to target length

Returns:
collapsed_gene_coverage : Series like

Collapsed version

riboraptor.count.count_feature_genewise(feature_bed, bam, force_strandedness=False, use_multiprocessing=False)[source]

Count features genewise.

Parameters:
bam : str

Path to bam file

feature_bed : str

Path to features bed file

Returns:
counts : dict

Genewise feature counts

riboraptor.count.count_reads_bed(bam, region_bed_f, saveto)[source]

Count number of reads following in each region.

Parameters:
bam : str

Path to bam file (unique mapping only)

region_bed_f : pybedtools.BedTool or str

Genomic regions to get distance from

prefix : str

Prefix to output pickle files

Returns:
counts_by_region : Series

Series with counts indexed by gene id

region_lengths : Series

Series with gene lengths

counts_normalized_by_length : Series

Series with normalized counts

riboraptor.count.count_reads_in_features(feature_bed, bam, force_strandedness=False, use_multiprocessing=False)[source]

Count reads overlapping features.

Parameters:
feature_bed : str

Path to features bed file

bam : str

Path to bam file

force_strandedness : bool

Should count feature only if on the same strand

use_multiprocessing : bool

True if multiprocessing mode

Returns
——-
counts : int

Number of intersection between bam and bed

riboraptor.count.count_reads_per_gene(bw, bed, prefix=None, n_cores=16, collapse_intervals=True)[source]

Count number of reads following in each region.

Parameters:
bw : str

Path to bigWig file

bed : pybedtools.BedTool or str

Genomic regions to get distance from

prefix : str

Prefix to output pickle files

n_cores : int

Use multiple cores (Default: 16). Set to 1 to disable multiprocessing

collapse_intervals : bool

Should the intervals be collapsed based on the ‘name’ column in gene This should be set to False for things like tRNA where the tRNA can span multiple chromosomes

Returns:
counts_by_region : Series

Series with counts indexed by gene id

region_lengths : Series

Series with gene lengths

counts_normalized_by_length : Series

Series with normalized counts

riboraptor.count.count_utr5_utr3_cds(bam, utr5_bed=None, cds_bed=None, utr3_bed=None, genome=None, force_strandedness=False, genewise=False, saveto=None, use_multiprocessing=False)[source]

One shot counts over UTR5/UTR3/CDS.

Parameters:
bam : str

Path to bam file

utr5_bed : str

Path to 5’UTR feature bed file

utr3_bed : str

Path to 3’UTR feature bed file

cds_bed : str

Path to CDS feature bed file

saveto : str, optional

Path to output file

use_multiprocessing : bool

SHould use multiprocessing? Not been well tested if it really helps

Returns:
counts : dict

Dict with keys as feature type and counts as values

riboraptor.count.diff_region_enrichment(numerator, denominator, prefix)[source]

Calculate enrichment of counts of one region over another.

Parameters:
numerator : str

Path to pickle file

denominator : str

Path to pickle file

prefix : str

Prefix to save pickles to

Returns:
enrichment : series
riboraptor.count.export_gene_coverages(bigwig, region_bed_f, saveto, offset_5p=60, offset_3p=0, ignore_tx_version=True)[source]

Export all gene coverages.

Parameters:
bigwig : str

Path to bigwig file

region_bed_f : str

Path to region bed file (CDS/3’UTR/5’UTR) with bed name column as gene or a genome name (hg38_utr5, hg38_cds, hg38_utr3)

saveto : str

Path to write output tsv file

offset_5p : int

number of bases to count upstream (5’)

offset_30 : int

number of bases to count downstream (3’)

ignore_tx_version : bool

Should versions be ignored for gene names

Returns:
gene_profiles: file

with the following format: gene1 5poffset1 3poffset1 length1 mean1 median1 stdev1 cnt1_1 cnt1_2 cnt1_3 …

gene2 5poffset2 3poffset2 length2 mean2 median2 stdev2 cnt2_1 cnt2_2 cnt2_3 cnt2_4 …

riboraptor.count.export_metagene_coverage(bigwig, region_bed_f, max_positions=None, saveto=None, offset_5p=60, offset_3p=0, ignore_tx_version=True)[source]

Calculate metagene coverage.

Parameters:
bigwig : str

Path to bigwig file

region_bed_f : str

Path to region bed file (CDS/3’UTR/5’UTR) or a genome name (hg38_utr5, hg38_cds, hg38_utr3)

max_positions: int

Number of positions to consider while calculating the normalized coverage Higher values lead to slower implementation

saveto : str

Path to write output tsv file

offset_5p : int

Number of bases to offset upstream(5’)

offset_3p : int

Number of bases to offset downstream(3’)

ignore_tx_version : bool

Should versions be ignored for gene names

Returns:
metagene_profile : series

Metagene profile

riboraptor.count.extract_uniq_mapping_reads(inbam, outbam)[source]

Extract only uniquely mapping reads from a bam.

Parameters:
inbam : string

Path to input bam file

outbam : string

Path to write unique reads bam to

riboraptor.count.gene_coverage(gene_name, bed, bw, gene_group=None, offset_5p=0, offset_3p=0, collapse_intervals=True)[source]

Get gene coverage.

Parameters:
gene_name : str

Gene name

bed : str

Path to CDS or 5’UTR or 3’UTR bed

bw : str

Path to bigwig to fetch the scores from

offset_5p : int (positive)

Number of bases to count upstream (5’)

offset_3p : int (positive)

Number of bases to count downstream (3’)

collapse_intervals : bool

Should bed be collapsed based on gene name

Returns:
coverage_combined : series

Series with index as position and value as coverage

intervals_for_fasta_read : list

List of tuples

index_to_genomic_pos_map : series
gene_offset : int

Gene wise offsets

riboraptor.count.gene_coverage_sum(gene_name, bed, bw, collapse_intervals=True)[source]

Keep track of only the sum

Parameters:
gene_name : str

Name of gene

bed : str

Path to bed file

bw : str

Path to bigwig file

collapse_intervals : bool

Should the intervals be collapsed based on the ‘name’ column in gene This should be set to False for things like tRNA where the tRNA can span multiple chromosomes

riboraptor.count.get_fasta_sequence(fasta, intervals)[source]

Extract fasta sequence given a list of intervals.

Parameters:
fasta : str

Path to fasta file

intervals : list(tuple)

A list of tuple in the form [(chrom, start, stop, strand)]

Returns:
seq : list

List of sequences at intervals

riboraptor.count.get_region_sizes(bed)[source]

Get collapsed lengths of gene in bed.

Parameters:
bed : str

Path to bed file

Returns:
region_sizes : dict

Region sies with gene names as key and value as size of this named region

riboraptor.count.htseq_to_cpm(htseq_f, saveto=None)[source]

Convert HTSeq counts to CPM.

Parameters:
htseq_f : str

Path to HTseq counts file

saveto : str, optional

Path to output file

Returns:
cpm : dataframe

CPM

riboraptor.count.htseq_to_tpm(htseq_f, cds_bed_f, saveto=None)[source]

Convert HTSeq counts to TPM.

Parameters:
htseq_f : str

Path to HTseq counts file

region_sizes : dict

Dict with keys as gene and values as length (CDS/Exon) of that gene

saveto : str, optional

Path to output file

Returns:
tpm : dataframe

TPM

riboraptor.count.interval_coverage(bw, intervals)[source]

Get coverage at custom intervals

Parameters:
bw : str

Path to bigwig file

intervals : list of tuples

[(chrom, start, stop, strand)]

Returns:
coverage : list of series

Coverage for each interval, so that it is sorted oritentation wise

riboraptor.count.mapping_reads_summary(bam, prefix)[source]

Count number of mapped reads.

Parameters:
bam : str

Path to bam file

prefix : str

Prefix to save pickle to (optional)

Returns:
counts : counter

Counter with keys as number of times read maps and values as number of reads of that type

riboraptor.count.pickle_bed_file(bed, collapse_intervals=True)[source]

Create a lookup pickle file for genewise CDS/UTR coordinates.

In order to prevent recalculating the coordinates that should be fetched for each genes’ CDS or UTR regions, they can be stored in a pickle file.

Parameters:
bed : string

Path to bed file

collapse_intervals : bool

Should the intervals be collapsed based on the ‘name’ column in gene This should be set to False for things like tRNA where the tRNA can span multiple chromosomes

riboraptor.count.read_enrichment(read_lengths, enrichment_range=[28, 29, 30, 31, 32], input_is_stream=False, input_is_file=False)[source]

Calculate read enrichment for a certain range of lengths

Parameters:
read_lengths : Counter

A counter with read lengths and their counts

enrichment_range : range or str

Range of reads to concentrate upon (28-32 or range(28,33))

input_is_stream : bool

True if input is sent through stdin

Returns:
ratio : float

Enrichment in this range (Scae 0-1)

riboraptor.count.read_htseq(htseq_f)[source]

Read HTSeq file.

Parameters:
htseq_f : str

Path to htseq counts file

Returns:
htseq_df : dataframe

HTseq counts as in a dataframe

riboraptor.count.read_length_distribution(bam, saveto)[source]

Count read lengths.

Parameters:
bam : str

Path to bam file

saveto: str

Path to write output tsv file

Returns
——-
lengths : counter

Counter of read length and counts

riboraptor.count.unique_mapping_reads_count(bam)[source]

Count number of mapped reads.

Parameters:
bam : str

Path to bam file

Returns:
n_mapped : int

Count of mapped reads

riboraptor.download module

Utilities to download data from NCBI SRA

riboraptor.download.run_download_sra_script(download_root_location=None, ascp_key_path=None, srp_id_file=None, srp_id_list=None)[source]

Download data from SRA.

Parameters:
download_root_location : string

Path to download SRA files

ascp_key_path : string

Location for aspera private keypp

srp_id_list : list

List of SRP ids for download

srp_id_file : string

File containing list of SRP Ids, one per line

riboraptor.dtw module

riboraptor.dtw.dtw(X, Y, metric=u'euclidean', ddtw=False, ddtw_order=1)[source]
Parameters:
X : array_like

M x D matrix

Y : array_like

N x D matrix

metric : string

The distance metric to use. Can be : ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘wminkowski’, ‘yule’. See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html

ddtw : bool

Should use derivative DTW where the distance matrix is created using the derivate values at each point rather than the point themselves

ddtw_order : int [1,2]

First order uses one difference method Second order uses np.gradient for an approximation upto second order

Returns
——-
total_cost : float

Total (minimum) cost of warping

pointwise_cost : array_like

M x N matrix with cost at each (i, j)

accumulated_cost : array_like

M x N matrix with (minimum) cost accumulated till (i,j) having started from (0, 0)

riboraptor.dtw.get_path(D)[source]

Traceback path of minimum cost

Given accumulated cost matrix D, trace back the minimum cost path

Parameters:
D : array_like

M x N matrix as obtained from accumulated_cost using: total_cost, pointwise_cost, accumulated_cost = dtw(X, Y, metric=’euclidean’)

Returns:
traceback_x, traceback_x : array_like

M x 1 and N x 1 array containing indices of movement starting from (0, 0) going to (M-1, N-1)

riboraptor.dtw.plot_warped_timeseries(x, y, pointwise_cost, accumulated_cost, path, colormap=<Mock name='mock.pyplot.cm.Blues' id='139824605096208'>, linecolor=u'#D55E00')[source]

riboraptor.fasta module

riboraptor.fasta.complete_gene_fasta(utr5_bed_f, cds_bed_f, utr3_bed_f, fasta_f, prefix)[source]

Merge Utr5, CDS, UTR3 coordinates to get one fasta.

Parameters:
utr5_bed : str

Path to 5’UTR bed

cds_bed : str

Path to CDS bed

utr3_bed : str

Path to 3’UTR bed

riboraptor.fasta.export_all_fasta(region_bed_f, chrom_sizes, fasta, prefix, offset_5p=60, offset_3p=0, ignore_tx_version=True)[source]

Export all gene coverages.

Parameters:
region_bed_f : str

Path to region bed file (CDS/3’UTR/5’UTR) with bed name column as gene

chrom_sizes : str

Path to chrom.sizes file

prefix : str

Prefix to write output file

offset_5p : int

number of bases to count upstream (5’)

offset_30 : int

number of bases to count downstream (3’)

ignore_tx_version : bool

Should versions be ignored for gene names

riboraptor.fasta.export_fasta_from_bed(gene_name, bed, chrom_sizes, fasta_f, gene_group=None, offset_5p=0, offset_3p=0)[source]

Extract fasta genewise given coordinates in bed file

Parameters:
gene_name : str

Gene name

bed : str

Path to CDS or 5’UTR or 3’UTR bed

fasta_f : str

Path to fasta file

chrom_sizes : str

Path to chrom.sizes file

offset_5p : int (positive)

Number of bases to count upstream (5’)

offset_3p : int (positive)

Number of bases to count downstream (3’)

Returns:
gene_offset : int

Gene wise offsets

riboraptor.fasta.get_fasta_sequence(fasta_f, intervals)[source]

Extract fasta sequence given a list of intervals.

Parameters:
fasta_f : str

Path to fasta file

intervals : list(tuple)

A list of tuple in the form [(chrom, start, stop, strand)] NOTE: 1-based start and stop only!

Returns:
seq : list

List of sequences at intervals

riboraptor.genome module

riboraptor.helpers module

All functions that are not so useful, but still useful.

riboraptor.helpers.check_file_exists(filepath)[source]

Check if file exists.

Parameters:
filepath : str

Path to file

riboraptor.helpers.codon_to_anticodon(codon)[source]

Codon to anticodon.

Parameters:
codon : string

Input codon

riboraptor.helpers.collapse_bed_intervals(intervals, chromosome_lengths=None, offset_5p=0, offset_3p=0)[source]

Collapse intervals into non overlapping manner

# NOTE # TODO : This function has a subtle bug that it will be offset by 1 # position when the gene is on negative strand # So essentially if you have CDS on a negative strand # The first position should be discarded # Similary for the last position in the gene on + strand # you have an extra position in the end

Parameters:
intervals : list of tuples

Like [(‘chr1’, 310, 320, ‘+’), (‘chr1’, 321, 330, ‘+’)]

chromosome_lengths : dict

A map of each chromosome’e length Only used with offset_3p, offset_5p>0

offset_5p : int (positive)

Number of bases to count upstream (5’)

offset_3p : int (positive)

Number of bases to count downstream (3’)

Returns:
interval_combined : list of tuples

A collapsed version of interval This is useful when the annotations are overlapping. Example: chr1 310 320 gene1 + chr1 319 324 gene1 + Returns: chr1 310 324 gene1 +

intervals_for_fasta_read : list of tuples

This list can be used to directly fetch fasta from pyfaidx. NOTE: DO NOT do offset adjustments as they are already adjusted for pyfaidx format (1-end both start and end)

gene_offset_5p, gene_offset_3 : in

Gene wise offsets. This might be different from offset_5p in cases where offset_5p leads to a negative coordinate

riboraptor.helpers.create_ideal_periodic_signal(signal_length)[source]

Create ideal ribo-seq signal.

Parameters:
signal_length : int

Length of signal to create

Returns:
signal : array_like

1-0-0 signal

riboraptor.helpers.get_strandedness(filepath)[source]

Parse output of infer_experiment.py from RSeqC to get strandedness.

Parameters:
filepath : str

Path to infer_experiment.py output

Returns:
strandedness : str

reverse or forward or none

riboraptor.helpers.identify_peaks(coverage)[source]

Given coverage array, find the site of maximum density

riboraptor.helpers.list_to_ranges(list_of_int)[source]

Convert a list to a list of range object

Parameters:
list_of_int: list

List of integers to be squeezed into range

Returns:
list_of_range: list

List of range objects

riboraptor.helpers.load_pickle(filepath)[source]

Read pickled files easy in Python 2/3

riboraptor.helpers.millify(n)[source]

Convert integer to human readable format.

Parameters:
n : int
Returns:
millidx : str

Formatted integer

riboraptor.helpers.mkdir_p(path)[source]

Python version mkdir -p

Parameters:
path : str
riboraptor.helpers.pad_five_prime_or_truncate(some_list, offset_5p, target_len)[source]

Pad first the 5prime end and then the 3prime end or truncate

Parameters:
some_list : list

Input list

offset_5p : int

5’ offset

target_length : int

Final length of list

If being extended, returns list padded with NAs.
riboraptor.helpers.pad_or_truncate(some_list, target_len)[source]

Pad or truncate a list upto given target length

Parameters:
some_list : list

Input list

target_length : int

Final length of list

If being extended, returns list padded with NAs.
riboraptor.helpers.parse_star_logs(infile, outfile=None)[source]

Parse star logs into a dict

Parameters:
infile : str

Path to starlogs.final.out file

Returns:
star_info : dict

Dict with necessary records parsed

riboraptor.helpers.path_leaf(path)[source]

Get path’s tail from a filepath

riboraptor.helpers.r2(x, y)[source]

Calculate pearson correlation between two vectors.

Parameters:
x : array_like

Input

y : array_like

Input

riboraptor.helpers.round_to_nearest(x, base=5)[source]

Round to nearest base.

Parameters:
x : float

Input

Returns:
v : int

Output

riboraptor.helpers.set_xrotation(ax, degrees)[source]

Rotate labels on x-axis.

Parameters:
ax : matplotlib.Axes

Axes object

degrees : int

Rotation degrees

riboraptor.helpers.summarize_counters(samplewise_dict)[source]

Summarize gene counts for a collection of samples.

Parameters:
samplewise_dict : dict

A dictionary with key as sample name and value as another dictionary of counts for each gene

Returns:
totals : dict

A dictionary with key as sample name and value as total gene count

riboraptor.helpers.summary_stats_two_arrays_welch(old_mean_array, new_array, old_var_array=None, old_n_counter=None, carried_forward_observations=None)[source]

Average two arrays using welch’s method

Parameters:
old_mean_array : Series

Series of previous means with index as positions

old_var_array : Series

Series of previous variances with index as positions

new_array : array like

Series of new observations (Does noes Ciunts of number of positions at a certain index

Returns:
m : array like

Column wise Mean array

var : array like

Column wise variance

Consider an example: [1,2,3], [1,2,3,4], [1,2,3,4,5]
old = [1,2,3]
new = [1,2,3,4]
counter = [1,1,1]
mean = [1,2,3,4] Var =[na, na, na, na], carried_fowrad = [[1,1], [2,2], [3,3], [4]]
old = [1,2,3,4]
new = [1,2,3,4,5]
couter = [2,2,2,1]
mean = [1,2,3,4,5]
var = [0,0,0, na, na]
carried_forward = [[], [], [], [4,4], [5]]

riboraptor.normalization module

riboraptor.normalization.deseq2_normalization(list_of_profiles)[source]

Perform DESeq2 like normalization position specific scores

Parameters:
list_of_profiles: array-like

array of profiles across samples for one gene

Returns:
normalized_profiles: array-like

array of profiles across samples

riboraptor.plotting module

Plotting methods.

riboraptor.plotting.create_wavelet(data, ax)[source]
riboraptor.plotting.plot_featurewise_barplot(utr5_counts, cds_counts, utr3_counts, ax=None, saveto=None, **kwargs)[source]

Plot barplots for 5’UTR/CDS/3’UTR counts.

Parameters:
utr5_counts : int or dict

Total number of reads in 5’UTR region or alternatively a dictionary/series with genes as key and 5’UTR counts as values

cds_counts : int or dict

Total number of reads in CDs region or alternatively a dictionary/series with genes as key and CDS counts as values

utr3_counts : int or dict

Total number of reads in 3’UTR region or alternatively a dictionary/series with genes as key and 3’UTR counts as values

saveto : str

Path to save output file to (<filename>.png/<filename>.pdf)

riboraptor.plotting.plot_framewise_counts(counts, frames_to_plot=u'all', ax=None, title=None, millify_labels=False, position_range=None, saveto=None, ascii=False, input_is_stream=False, **kwargs)[source]

Plot framewise distribution of reads.

Parameters:
counts : Series

A series with position as index and value as counts

frames_to_plot : str or range

A comma seaprated list of frames to highlight or a range

ax : matplotlib.Axes

Default none

saveto : str

Path to save output file to (<filename>.png/<filename>.pdf)

riboraptor.plotting.plot_read_counts(counts, ax=None, marker=None, color=u'royalblue', title=None, label=None, millify_labels=False, identify_peak=True, saveto=None, position_range=None, ascii=False, input_is_stream=False, ylabel=u'Normalized RPF density', **kwargs)[source]

Plot RPF density aro und start/stop codons.

Parameters:
counts : Series/Counter

A series with coordinates as index and counts as values

ax : matplotlib.Axes

Axis to create object on

marker : string

‘o’/’x’

color : string

Line color

label : string

Label (useful only if plotting multiple objects on same axes)

millify_labels : bool

True if labels should be formatted to read millions/trillions etc

saveto : str

Path to save output file to (<filename>.png/<filename>.pdf)

riboraptor.plotting.plot_read_length_dist(read_lengths, ax=None, millify_labels=True, input_is_stream=False, title=None, saveto=None, ascii=False, **kwargs)[source]

Plot read length distribution.

Parameters:
read_lengths : array_like

Array of read lengths

ax : matplotlib.Axes

Axis object

millify_labels : bool

True if labels should be formatted to read millions/trillions etc

input_is_stream : bool

True if input is sent through stdin

saveto : str

Path to save output file to (<filename>.png/<filename>.pdf)

riboraptor.plotting.setup_axis(ax, axis=u'x', majorticks=5, minorticks=1, xrotation=45, yrotation=0)[source]

Setup axes defaults

Parameters:
ax : matplotlib.Axes
axis : str

Setup ‘x’ or ‘y’ axis

majorticks : int

Length of interval between two major ticks

minorticks : int

Length of interval between two major ticks

xrotation : int

Rotate x axis labels by xrotation degrees

yrotation : int

Rotate x axis labels by xrotation degrees

riboraptor.plotting.setup_plot()[source]

Setup plotting defaults

riboraptor.statistics module

riboraptor.statistics.KDE(values)[source]

Perform Univariate Kernel Density Estimation.

Wrapper utility around statsmodels for quick KDE TODO: scikit-learn has a faster implementation (?)

Parameters:
values : array like
Returns:
support : array_like
cdf : array_like
riboraptor.statistics.KS_test(a, b)[source]

Perform KS test between a and b values

Parameters:
a, b : array-like

Input

Returns:
D : int

KS D statistic

effect_size : float

maximum difference at point of D-statistic

cdf_a, cdf_b : float

CDF of a, b

Note: By default this method does testing for alternative=lesser implying
that the test will reject H0 when the CDf of b is ‘above’ a
riboraptor.statistics.calculate_cdf(data)[source]

Calculate CDF given data points

Parameters:
data : array-like

Input values

Returns:
cdf : series

Cumulative distribution funvtion calculated at indexed points

riboraptor.statistics.series_cdf(series)[source]

Calculate cdf of series preserving the index

Parameters:
series : series like
Returns:
cdf : series

riboraptor.utils module

riboraptor.utils.determine_cell_type(sample_attribute)[source]
riboraptor.utils.get_cell_line_or_tissue(row)[source]
riboraptor.utils.get_enrichment_cds_stats(pickle_file)[source]
riboraptor.utils.get_fragment_enrichment_score(txt_file)[source]
riboraptor.utils.get_strain_type(sample_attribute)[source]
riboraptor.utils.get_tissue_type(sample_attribute)[source]
riboraptor.utils.load_tpm(path)[source]
riboraptor.utils.summary_starlogs_over_runs(directory, list_of_srr)[source]

riboraptor.wig module

class riboraptor.wig.WigReader(wig_location)[source]

Bases: object

Class for reading and querying wigfiles.

get_chromosomes

Return list of chromsome and their sizes as in the wig file.

Returns:
chroms : dict

Dictionary with {“chr”: “Length”} format

.. currentmodule:: .WigReader
.. autosummary::

.WigReader

query(intervals)[source]

Query regions for scores.

Parameters:
intervals : list(tuple)
A list of tuples with the following format:

(chr, chrStart, chrEnd, strand)

Returns:
scores : array_like

A numpy array containing scores for each tuple

.. currentmodule:: .WigReader
.. autosummary::

.WigReader

Module contents

History

0.2.0 (04-23-2018)

  • First release on PyPI.

Contributing

Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.

You can contribute in many ways:

Types of Contributions

Report Bugs

Report bugs at https://github.com/saketkc/riboraptor/issues.

If you are reporting a bug, please include:

  • Your operating system name and version.
  • Any details about your local setup that might be helpful in troubleshooting.
  • Detailed steps to reproduce the bug.

Fix Bugs

Look through the GitHub issues for bugs. Anything tagged with “bug” and “help wanted” is open to whoever wants to implement it.

Implement Features

Look through the GitHub issues for features. Anything tagged with “enhancement” and “help wanted” is open to whoever wants to implement it.

Write Documentation

riboraptor could always use more documentation, whether as part of the official riboraptor docs, in docstrings, or even on the web in blog posts, articles, and such.

Submit Feedback

The best way to send feedback is to file an issue at https://github.com/saketkc/riboraptor/issues.

If you are proposing a feature:

  • Explain in detail how it would work.
  • Keep the scope as narrow as possible, to make it easier to implement.
  • Remember that this is a volunteer-driven project, and that contributions are welcome :)

Get Started!

Ready to contribute? Here’s how to set up riboraptor for local development.

  1. Fork the riboraptor repo on GitHub.

  2. Clone your fork locally:

    $ git clone git@github.com:your_name_here/riboraptor.git
    
  3. Install your local copy into a virtualenv. Assuming you have virtualenvwrapper installed, this is how you set up your fork for local development:

    $ mkvirtualenv riboraptor
    $ cd riboraptor/
    $ python setup.py develop
    
  4. Create a branch for local development:

    $ git checkout -b name-of-your-bugfix-or-feature
    

    Now you can make your changes locally.

  5. When you’re done making changes, check that your changes pass flake8 and the tests, including testing other Python versions with tox:

    $ flake8 riboraptor tests
    $ python setup.py test or py.test
    $ tox
    

    To get flake8 and tox, just pip install them into your virtualenv.

  6. Commit your changes and push your branch to GitHub:

    $ git add .
    $ git commit -m "Your detailed description of your changes."
    $ git push origin name-of-your-bugfix-or-feature
    
  7. Submit a pull request through the GitHub website.

Pull Request Guidelines

Before you submit a pull request, check that it meets these guidelines:

  1. The pull request should include tests.
  2. If the pull request adds functionality, the docs should be updated. Put your new functionality into a function with a docstring, and add the feature to the list in README.rst.
  3. The pull request should work for Python 2.6, 2.7, 3.3, 3.4 and 3.5, and for PyPy. Check https://travis-ci.org/saketkc/riboraptor/pull_requests and make sure that the tests pass for all supported Python versions.

Tips

To run a subset of tests:

$ py.test tests.test_riboraptor

Credits

Development Lead

Contributors

None yet. Why not be the first?