[Playing with Data] 607 pylori part III: Start digging

By November 14, 2017Bio-informatic, [PwD] H. pylori

Hello Folks, it’s been a while! I thought it would be time to continue this big Helicobacter pylori genomes experiment and keep the blog going on.

We left with the previous post on preparing the data for our analyses. The main point being re-annotating all the genome in order to compare everything with the same bias algorithm and database. But before this we can do a quality control step in order to check that we are actually comparing complete genomes.

Quality control

To tun the check we use Checkm from the Australian center for ecogenomic. This is quite a little neat tool which allow to have a relatively quick summary of your metagenomic bins. This software offers multiple analysis options, I used here the “lineage_wf”. This workflow identify marker genes in bins, infer lineage-specific marker sets for each bin, place bins in the reference genome tree and assess bins for contamination and completeness.

Below is how I ran it. For the script below, I am located in the root of my analysis folder, the “fasta” folder contain the H. pylori genomes renamed and stripped from their plasmids.

mkdir fasta/checkm
checkm lineage_wf -x fna --pplacer_threads 10 -t 10 -f checkm/Checkm.results.log ./fasta/Renamed ./fasta/checkm/

The -x option is used to tell the software to use file with the fna extension, -t determine how many CPU to use, but not that –placer_threads is need to set how many CPU are used to build the tree. I also used the -f option to have the output saved in a log file instead of stdout.

Here is how the file look like:

First thing first, all the genomes have been predicted to be Epsilonproteobacteria. At least we are sure that we are comparing the “right” things.

Now with some little table juggling you can plot the completeness and contamination as below:



Well that’s a good news, overall we do have a very good data set. All are, if not almost, complete. Nevertheless we do see a little drop of quality with few genomes on the right of the graph. A handful of genome with a completeness value estimated to be below 95%.  The two worst being HP-0467 and HP-0017, the latest even having the second highest contamination level (9.81%). We also have one genome with 10.53% contamination.

From this we could remove all the dodgy genomes from the analysis and just keep the good data. But I am curious of why those are so bad. I will just flag them and carry on my analysis to see what comes out of those.


It is time to annotate all of this. To annotate so much in a relatively easy manner I decided to use Prokka, devellopped by Torsten Seeman. As the website describes it well :

“Prokka is a software tool for the rapid annotation of prokaryotic genomes. A typical 4 Mbp genome can be fully annotated in less than 10 minutes on a quad-core computer, and scales well to 32 core SMP systems. It produces GFF3, GBK and SQN files that are ready for editing in Sequin and ultimately submitted to Genbank/DDJB/ENA”

I used the following little script to run prokka on the 607 H. pylori genomes:

ls fasta/Renamed/*.fna | sed 's/.fna//g' > ./New.names.tab
mkdir prokka
for i in $Nname;
do prokka --cpus 10 --outdir prokka/$i/ --force --quiet --prefix $i fasta/Renamed/$i.fna && echo "$i annotation done";

You can leave to run for a while, but after a couple of hours I got everything annotated and nicely packed. At the end of the tun I will have 607 folder containing the following files:

#HP0001 used as example
HP-0001.wop.err   #Discrepancies log
HP-0001.wop.faa   #Amino acid sequences of the different proteins
HP-0001.wop.ffn   #Nuceotide sequences of the different proteins
HP-0001.wop.fna   #Fasta file of the annotated contigs
HP-0001.wop.fsa   #Nucleotide FASTA file of the input contig sequences, used by "tbl2asn" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines
HP-0001.wop.gbk #Annotations in a genbank format
HP-0001.wop.gff   #Gff annotation file
HP-0001.wop.sqn #An ASN1 format "Sequin" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc.
HP-0001.wop.tbl #Annotation Table
HP-0001.wop.txt #text summary of the annotations

Some of the file description have been shamelessly copied from this very good tutorial.

Now we can get our hand dirty and dig into the genome analysis ! But it will be in the next post… *teaser*

Leave a Reply