I have a set of aligned RNA-Seq data from S. pombe in the form of BAM files. I want to use several tools in my analysis which take BED files as input not BAM files. The script below automates the process of converting many BAM -> BED files on a linux system using the utility Bedtools.
Implementation:
To convert BAM alignments to BED using Bedtools the general usage is:
To automatically process multiple BAM files I wrote the perl script below which calls the function above on every bam file in the hierarchy below the starting directory. Note that this implementation will take a long time to run if there are many bam files or if it needs to search through many directories:
Script: all_bam_to_bed.pl
Script: all_bam_to_bed.pl
#!/usr/bin/perl # Script to traverse files in folder recursively creating BED files from BAM files: # Usage: # perl all_bam_to_bed.pl directory/path/ use strict; use warnings; use File::Basename; use File::Find; my $dir = shift @ARGV; # Starting directory as specified by the user find(\&process_file, $dir); exit; sub process_file { my $file = $_; my ($name, $dir, $ext) = fileparse($file, "bam"); if ($ext eq "bam") { my $new_ext = "bed";
# Use nohup and & to run with no hang up and in the background:
my $cmd = "nohup bedtools bamtobed -i $file > $name$new_ext &";
# Make a system call using the command created dynamically above:
system($cmd);
# Update log file with progress:
open my $fh, ">>", glob('~/bam_to_bed_logfile.txt')
or die "Can't open the log file: $!";
print $fh "$cmd \n";
close $fh;
}
return;
}
Note that this script is linux specific but can be adapted to work on other operating systems. I used the glob on the tilde (~) to send the log file to the home directory as my first attempt contained an error where it would write a log file in every folder where it found a bam file as it recursively moved through the hierarchy. Also if it finds a lot of BAM files the system may run out of memory as it currently runs bedtools in parallel on all the BAM files detected. Removing the & from the end of $cmd will mean that it runs the files sequential - it will take longer but memory/CPU requirements will be reduced.