[Playing with Data] 607 pylori part III: Start digging

By November 14, 2017Bio-informatic, [PwD] H. pylori

Hello Folks, it’s been a while! I thought it would be time to continue this big Helicobacter pylori genomes experiment and keep the blog going on.

We left with the previous post on preparing the data for our analyses. The main next step will be re-annotating all the genomes in order to compare everything with the same bias algorithm and database. But before this we can do a quality control step in order to check that we are actually comparing complete genomes.

Quality control

To run the check we use Checkm from the Australian center for ecogenomic. This is quite a neat little tool which gives a relatively quick summary of your metagenomic bins. This software offers multiple analysis options, I used here the “lineage_wf”. This workflow identifies marker genes in bins, infer lineage-specific marker sets for each bin, places bins in the reference genome tree and assess bins for contamination and completeness.

Below is how I ran it. For the script below, I am located in the root of my analysis folder, the “fasta” folder contains the H. pylori genomes renamed and stripped of their plasmids.

mkdir fasta/checkm
checkm lineage_wf -x fna --pplacer_threads 10 -t 10 -f checkm/Checkm.results.log ./fasta/Renamed ./fasta/checkm/

The -x option is used to tell the software to use files with the fna extension, -t determines how many CPU to use, but note that –placer_threads is need to set how many CPUs are used to build the tree. I also used the -f option to have the output saved in a log file instead of stdout.

Here is what the file looks like:

First things first, all the genomes have been predicted to be Epsilonproteobacteria. At least we are sure that we are comparing the “right” things.

Now with some little table juggling you can plot the completeness and contamination as shown below:

 

 

Well that’s good news, overall we do have a very good data set. All are complete or very close. Nevertheless we do see a little drop of quality with a few of the genomes on the right of the graph. A handful of genomes with a completeness value estimated to be below 95%.  The two worst being HP-0467 and HP-0017, the latest even having the second highest contamination level (9.81%). We also have one genome with 10.53% contamination.

From this we could remove all the dodgy genomes from the analysis and just keep the good data. But I am curious about why those are so bad. I will just flag them and carry on my analysis to see what comes out of those.

Annotation

It is time to annotate all of this. To annotate so much in a relatively easy manner I decided to use Prokka, developed by Torsten Seeman. As the website describes it well:

“Prokka is a software tool for the rapid annotation of prokaryotic genomes. A typical 4 Mbp genome can be fully annotated in less than 10 minutes on a quad-core computer, and scales well to 32 core SMP systems. It produces GFF3, GBK and SQN files that are ready for editing in Sequin and ultimately submitted to Genbank/DDJB/ENA”

I used the following little script to run Prokka on the 607 H. pylori genomes:

ls fasta/Renamed/*.fna | sed 's/.fna//g' > ./New.names.tab
mkdir prokka
Nname=$(<New.names.tab)
for i in $Nname;
do prokka --cpus 10 --outdir prokka/$i/ --force --quiet --prefix $i fasta/Renamed/$i.fna && echo "$i annotation done";
done

You can leave it to run for a while, but after a couple of hours I got everything annotated and nicely packed. At the end of the run I will have 607 folders containing the following files:

#HP0001 used as example
HP-0001.wop.err   #Discrepancies log
HP-0001.wop.faa   #Amino acid sequences of the different proteins
HP-0001.wop.ffn   #Nuceotide sequences of the different proteins
HP-0001.wop.fna   #Fasta file of the annotated contigs
HP-0001.wop.fsa   #Nucleotide FASTA file of the input contig sequences, used by "tbl2asn" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines
HP-0001.wop.gbk #Annotations in a genbank format
HP-0001.wop.gff   #Gff annotation file
HP-0001.wop.log
HP-0001.wop.sqn #An ASN1 format "Sequin" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc.
HP-0001.wop.tbl #Annotation Table
HP-0001.wop.txt #text summary of the annotations

Some of the file descriptions have been shamelessly copied from this very good tutorial.

Now we can get our hands dirty and dig into the genome analysis! But this is a story for the next post… *teaser*

Leave a Reply