cat myfile.fq | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}'
The output would look something like this for some RNA-seq data downloaded from the Galaxy RNA-seq tutorial:
99115 60567 61.1078 ACCTCAGGA 354 0.357161
This is telling you:
- The total number of reads (99,115).
- The number of unique reads (60,567).
- The frequency of unique reads as a proportion of the total (61%).
- The most abundant sequence (useful for finding adapters, linkers, etc).
- The number of times that sequence is present (354).
- The frequency of that sequence as a proportion of the total number of reads (0.35%).
If you have a handful of fastq files in a directory and you'd like to do this for each of them, you can wrap this in a for loop in bash:
for read in `ls *.fq`; do echo -n "$read "; awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}' $read; done
This does the same thing, but adds an extra field at the beginning for the file name. I haven't yet figured out how to wrap this into GNU parallel, but the for loop should do the trick for multiple files.
Check out FASTQC for more extensive quality assessment.