Skip to main content

[Playing with Data] Comparing Helicobacter pylori genomes

By March 5, 2017November 14th, 2017Bio-informatic, [PwD] H. pylori

With the upcoming gloom of an eventual unemployment period, I decided to fill my non existent free time with a fun little project I’ve has in mind for a while; as one of my previous post might have indicated. It is time to tackle it!

I have been working with a new family of Epsilonproteobacteria (more info here) during my PhD and part of my Postdoctoral contract. Some of that study involved looking at their genomes. Once the genome is assembled, one thing to do is to have a look at the closest relatives and compare what both groups have in common and what they do not. To do this I serachead various databases to see what genomes of close relatives are available. There are a thousands of epsilonproteobacterial genomes available, for example on IMG/ER. This may sound great for doing a comparison, however, looking closer it quickly becomes clear that there is very low diversity among all these genomes. Between 40 and 60 different species. The most common databases are biased by medical studies which focus on two economically important pathogens, namely: Helicobacter pylori and Campylobacter coli/jejuni. This is not ideal for me becuase my environmental epsilonproteobacterial comparison will not be as complete as it could be (although it is still quite full of surprises, stay tuned for that one later).

Actually… What would be the possibilities with the data available on epsilonproteobacterial pathogens? What could comparing all of these genomes highlight? How do we actually compare so many genome in a meaningful way? I am actually very curious about all of these question. So I decided to have a look into this!

In the following weeks, I will dive into a detailed comparison of these genomes and share my progress and frustrations one step at a time. I have decided to start to working with the Helicobacter pylori genomes. There are 607 genomes publicly available on NCBI which is a pretty decent pile of data to process. Inspired by the number I quickly designed the little fun logo above to go with the project.

For now, I will try to cover the following topics:

  • A small Introduction

  • Data preparation & metadata gathering

  • Evolution of Helicobacter pylori

  • Genome plasticity

This blog post will be updated as I go and will serve as a table of content for the future articles. If you have any question, opinions or suggestions don’t hesitate to leave a comment below!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.