Wednesday, 15 June 2016

Shell Script To Prefix All Files in a Directory WIth The Directory Name

I'm currently analysing multiple RNA-seq fastq files for members of my laboratory.   My pipeline for analysing this data currently takes all the fastq files from the specified directory and distributes each as a separate job to a node on our cluster.   Quality control is performed by FASTQC, each sample is aligned to the reference genome using Tophat, and a count table is produced using HTSeq-count.  The results for each sample are output into a separate directory named after the fastq file.   The issue is that the contents of each folder are:

Directory: 
tophat_alignment_1511_6.fastq
Files:
        accepted_hits.bam
       accepted_hits.txt
       deletions.bed
       insertions.bed
       junctions.bed
       prep_reads.info
       unmapped.bam

with no indication of the sample they came from.  This is not a problem for me as my subsequent pipelines use the directory name when importing samples, however as this analysis will be used by many different members of the lab long after I have left there is a potential for confusion.  To remedy this I wrote the script below which takes the directory name, for example tophat_alignment_1511_6.fastq, cuts out the sample name 1511_6 and prefixes it to each file name in that directory.  I kept the script simple so that it was easy to read and test before implementation - as it recursively moves through directories renaming files a bug has the potential to cause massive problems.

for i  in /SEQDATA/RNASEQ/my_mutants/Alignments/*
do
cd $i
for x in $i/*
do
FNAME=$(basename $x)
DNAME=$(dirname $x)
SAMPLENAME=`echo $DNAME | sed "s/^.*tophat_alignment_\(.*\)\.fastq$/\1/"`
NEW=`echo $DNAME"/"$SAMPLENAME"."$FNAME`
OLD=$x
COMMAND=`echo "mv" $OLD $NEW`  
eval $COMMAND
done
done
cd /SEQDATA/RNASEQ/my_mutants/Alignments/

Returning:
       1511_6.accepted_hits.bam
       1511_6.accepted_hits.txt
       1511_6.deletions.bed
       1511_6.insertions.bed
       1511_6.junctions.bed
       1511_6.prep_reads.info
       1511_6.unmapped.bam