Friday, 30 October 2015

Converting BAM files into BED Files

Purpose:
I have a set of aligned RNA-Seq data from S. pombe in the form of BAM files.  I want to use several tools in my analysis which take BED files as input not BAM files.  The script below automates the process of converting many BAM -> BED files on a linux system using the utility Bedtools.

Implementation:
To convert BAM alignments to BED using Bedtools the general usage is:
bedtools bamtobed -i reads.bam > reads.bed
To automatically process multiple BAM files I wrote the perl script below which calls the function above on every bam file in the hierarchy below the starting directory.  Note that this implementation will take a long time to run if there are many bam files or if it needs to search through many directories:

Script: all_bam_to_bed.pl

#!/usr/bin/perl

# Script to traverse files in folder recursively creating BED files from BAM files:
# Usage:
# perl all_bam_to_bed.pl directory/path/ 


use strict;
use warnings;

use File::Basename;
use File::Find;

my $dir = shift @ARGV; # Starting directory as specified by the user

find(\&process_file, $dir);

exit;

sub process_file {

    my $file = $_;
    my ($name, $dir, $ext) = fileparse($file, "bam");
    if ($ext eq  "bam") {
        my $new_ext = "bed";
        # Use nohup and & to run with no hang up and in the background:
my $cmd = "nohup bedtools bamtobed -i $file > $name$new_ext &"; # Make a system call using the command created dynamically above: system($cmd); # Update log file with progress: open my $fh, ">>", glob('~/bam_to_bed_logfile.txt') or die "Can't open the log file: $!"; print $fh "$cmd \n"; close $fh; } return; }

 Note that this script is linux specific but can be adapted to work on other operating systems.  I used the glob on the tilde (~) to send the log file to the home directory as my first attempt contained an error where it would write a log file in every folder where it found a bam file as it recursively moved through the hierarchy.  Also if it finds a lot of BAM files the system may run out of memory as it currently runs bedtools in parallel on all the BAM files detected.  Removing the & from the end of $cmd will mean that it runs the files sequential - it will take longer but memory/CPU requirements will be reduced.

Wednesday, 28 October 2015

Advice to a Students Wanting to Learn Bioinformatics: Key Skills and Learning Resources

During my PhD several students new to computational biology have asked me for advice regarding both the skills they should be developing and the best learning resources.  This is what I told them:

There are several key computer skills you may be interested in developing:

Linux - Standard operating system used in scientific research.  An essential skill to have is being able to save and manipulate files on a Linux system from the command line.  Also important for installing bioinformatics tools many of which run under linux/unix. 

R - A statistical programming language primarily used for producing figures, graphs etc or for statistical analysis.  Includes Bioconductor - a library of modules which are written for bioinformatics- for example DeSeq2 will detect Differentially Expressed gene from RNA-Seq data.

Regular Expressions -  A regular expression is a sequence of characters that define a search pattern, a common task using REGEX would be Find and Replace - where you define the word to search for and the word to replace it with. They are extremely useful in extracting information from text such as code, log files, spreadsheets, or FASTA files etc.  For example a correctly written search pattern could parse a FASTA file and extract all lines beginning with '>' returning all the sequence headers.   

A scripting language - PERL or Python - used for many “data wrangling” tasks - for example to manipulate data files into the correct format for a particular tool.

Vim - Vim is a text editor available on most distributions of Linux - many times when working on a server you may not have access to fancy text editors like those included with RStudio and Canopy and you will need to use a basic editor like Vim which works via numerous keyboard shortcuts - it has quite a steep learning curve.

Core Skills: Learning Resources

If you want to learn some computer skills in preparation for your PhD I would recommend starting out by learning some basic Linux/Unix commands - any system you are likely to be using will be unix based and these skills are essential for running many bioinformatic tools.  The book "Unix & Perl - UNIX and PERL to the Rescue! A Field Guide for the Life Sciences” is a good place to start - even a small amount of familiarity with creating and manipulating files from the command line will be a big advantage.

Once you are comfortable with linux - using the commands ls, cd, pwd, rm, cp, mkdir, chmod,  head, tail - I would recommend familiarising yourself with R.  The easiest way is to download RStudio - https://www.rstudio.com/ - this gives you a single interface where you can run your programs and view your output.  There is a plugin module for RStudio - http://swirlstats.com/ - which will lead you through the basic concepts of using R.  A good textbook once you have mastered the basics of R is "The Art of R Programming: A tour of Statistical Software”.

Regular Expressions
The easiest way to pickup regular expressions is via the online tutorial  http://regexone.com/ - languages like Python and R allow you to include Regular Expressions in your code.  There are online Regex Generators which allow users to test search patterns and are useful when first learning the grammar of regular expressions.

Vim
Vim is a text editor available on most distributions of Linux - the best way to learn is either via the tutorial http://www.openvim.com/ and then maybe by playing the game - http://vim-adventures.com/.

Advanced Skills

If you have familiarised yourself with the above you will be in an excellent starting position - other skills you may consider developing are: 

Python - I would select this over PERL if learning from scratch as it is a more user friendly language

Get yourself an installation of the Enthought python distribution (Its free for academic use):


Marketing Blurb from their website:

"Enthought Canopy is a comprehensive Python analysis environment that provides easy installation of the core scientific analytic and scientific Python packages, creating a robust platform you can explore, develop, and visualize on. In addition to its pre-built, tested Python distribution, Enthought Canopy has valuable tools for iterative data analysis, visualization and application development including:

  • One-Click Python Package Deployment with a Graphical Package Manager
  • Code Editor with IPython Notebook Support
  • Interactive Graphical Python Code Debugger
  • Integrated IPython Prompt
  • Convenient Documentation Browser"
The main advantages in addition to easier package management and the python editor is the iPython Notebook, basically you write your code, annotate it, and produce the plots all in one document exportable as a PDF file - instant lab book of all your computational work.

The best bit if you are starting out in python is 'Enthought  training on demand':


These are a series of training videos (10-20 hours worth) available through the Canopy App.  They are well ordered so you can jump into the areas you want to apply to your project and are aimed specifically at scientific analysis.  It is a bit awkward logging on - simply fire up canopy select training and select Buy training - at the bottom of the page you can request an academic licence.

Tuesday, 27 October 2015

Importing Example S. pombe Expression Data

Purpose:

While writing some of the posts for this blog I realised that it would be useful to provide a sample dataset which anyone trying to follow my code could use when running my scripts.  I had a quick look on the web and couldn't find any processed expression data for pombe which is easily downloadable.  The script below imports some summarised proteomics data from a recent Cell paper (Marguerat et al, 2012) which can be used to test the scripts published on this blog, it also demonstrates how versatile R is when importing data.  Here we use the package 'openxlsx' to get data from an excel spreadsheet composed of multiple worksheets which is hosted on an academic website.  I prefer the 'openxlsx' package over alternatives such as 'xlsx' and 'XLConnect' as in my experience it is more reliable when dealing with the large spreadsheets which bioinformatics produces - the alternatives are liable to freeze.

Implementation:
This expression data is saved as an excel table in the Supplementary Data section on the Cell website so we need to use the package 'openxlsx' to import it into R and do some data 'wrangling' to get it in a useable format:

require(openxlsx) # Used to import the excel table
 
# URL of supplementary tables from the paper: 
expressDat_url <- "http://www.sciencedirect.com/science/MiamiMultiMediaURL/1-s2.0-S0092867412011269/1-s2.0-S0092867412011269-mmc1.xlsx/272196/html/S0092867412011269/89ccbeb370726444c25578be91383841/mmc1.xlsx"
 
# Create a temp file to hold downloaded excel file
temp <- tempfile() 
 
# Download excel file
download.file(expressDat_url ,temp) 
 
#Extract the correct worksheet and ignore top rows which contain comments
excel.file <- read.xlsx(temp, sheet = 5, startRow = 7) 
 
#Remove temp file
unlink(temp)
 
#Extract the proteomics data
expression_data <- excel.file[,c("MM.protein.cpc", "MN.protein.cpc")] 
 
#Change the column names to make them more informative and add gene IDs as row names
colnames(expression_data) <- c("Proliferating.protein.cpc", "Quiescent.protein.cpc")
row.names(expression_data) <- excel.file$Systematic.name
 
#R is importing some of the columns as a character which is causing problems in
#later analysis (these columns contain "< 100" for values less than 100 which forces R 
#to import the column as a character) - this is corrected below:
expression_data <- transform(expression_data, 
                   Proliferating.protein.cpc = as.numeric(Proliferating.protein.cpc),
                   Quiescent.protein.cpc = as.numeric(Quiescent.protein.cpc))
 
# Remove any rows with NAs
expression_data <- na.omit(expression_data)
 
We now have a dataframe with 3 columns containing proteomics data in the format below:


Systematic.name Proliferating.protein.cp MN.protein.cpc
SPAC977.10 199.69 1161.6
SPAC977.14c 17920.94 17001.01
SPAC1F8.07c 500408.2 200256.44

The first column contains the gene ID, the column Proliferating.protein.cpc is the protein count per cell in proliferating cells, and the column Quiescent.protein.cpc is the protein count per cell in quiescent cells.  Note that the data from this paper gives absolute values for the number of proteins per cell.

References:


Marguerat, S., Schmidt, A., Codlin, S., Chen, W., Aebersold, R., & Bähler, J. (2012). Quantitative analysis of fission yeast transcriptomes and proteomes in proliferating and quiescent cells. Cell151(3), 671–83. doi:10.1016/j.cell.2012.09.019