{"id":510,"date":"2017-04-17T12:05:58","date_gmt":"2017-04-17T12:05:58","guid":{"rendered":"http:\/\/www.aassie.net\/Pro\/?p=510"},"modified":"2017-11-14T20:37:00","modified_gmt":"2017-11-14T20:37:00","slug":"playing-with-data-607-pylori-part-ii-getting-the-data-ready","status":"publish","type":"post","link":"https:\/\/www.aassie.net\/Pro\/playing-with-data-607-pylori-part-ii-getting-the-data-ready\/","title":{"rendered":"[Playing with Data] 607 pylori part II: getting the data ready"},"content":{"rendered":"[vc_row type=&#8221;in_container&#8221; full_screen_row_position=&#8221;middle&#8221; scene_position=&#8221;center&#8221; text_color=&#8221;dark&#8221; text_align=&#8221;left&#8221; overlay_strength=&#8221;0.3&#8243;][vc_column column_padding=&#8221;no-extra-padding&#8221; column_padding_position=&#8221;all&#8221; background_color_opacity=&#8221;1&#8243; background_hover_color_opacity=&#8221;1&#8243; width=&#8221;1\/1&#8243; tablet_text_alignment=&#8221;default&#8221; phone_text_alignment=&#8221;default&#8221;][vc_column_text]In the <a href=\"https:\/\/www.aassie.net\/Pro\/?p=501\">previous post<\/a> we had a look at some interesting features of <em>Helicobacter pylori.<\/em> Now it is time to finally get our hands on the data. This post will present the data preparation, how to get everything ready in a comparable way for easy analyses.[\/vc_column_text][\/vc_column][\/vc_row][vc_row type=&#8221;in_container&#8221; full_screen_row_position=&#8221;middle&#8221; scene_position=&#8221;center&#8221; text_color=&#8221;dark&#8221; text_align=&#8221;left&#8221; top_padding=&#8221;10&#8243; overlay_strength=&#8221;0.3&#8243;][vc_column column_padding=&#8221;no-extra-padding&#8221; column_padding_position=&#8221;all&#8221; background_color_opacity=&#8221;1&#8243; background_hover_color_opacity=&#8221;1&#8243; width=&#8221;1\/2&#8243; tablet_text_alignment=&#8221;default&#8221; phone_text_alignment=&#8221;default&#8221;][vc_column_text]Now that we know more or less what we are looking for, let&#8217;s have a look at the data available. The first stop is the NCBI genome <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/genome\/?term=Helicobacter+pylori\">section<\/a> for <em>Helicobacter pylori<\/em>. This page displays the very basic genome information available.<\/p>\n<ul>\n<li>The genome size is around 1,6Mb.<\/li>\n<li>The average GC content is around 38,9%.<\/li>\n<li>Most importantly, there are 683 available genomes (Damn&#8230; when I started this whole thing there were only 607. This is how quickly genomic data moves, people!)<\/li>\n<\/ul>\n<p>To make working easier, I set up a directory called &#8220;Epsilon&#8221;, which will be the root of all <del>evil<\/del> our work. I will download the genomes in the NCBI subfolder:<\/p>\n<p><code>mkdir Epsilon<br \/>\ncd Epsilon<br \/>\nmkdir NCBI<br \/>\nrsync --copy-links --recursive --include \"*.gbff.gz\" --verbose rsync:\/\/ftp.ncbi.nlm.nih.gov\/genomes\/genbank\/bacteria\/Helicobacter_pylori\/latest_assembly_versions\/ .\/NCBI\/<\/code><\/p>\n<p>We now have all the available genomes from NCBI in multiple formats: statistics files (assembly report and stats), a Genbank annotated file (<em>gbff<\/em>), raw sequence files (<em>fna<\/em>) which sometimes includes a subfile with only the RNA or CDS sequences, only the annotations (<em>gff<\/em>), the proteins sequences only (<em>faa<\/em>), etc&#8230; A quick scan through the random folders showed some variation in file composition. For example, some genomes don&#8217;t have any annotations at all. So from this file dump, we will mainly focus on two types of files. Firstly the <em>fna<\/em> files, which are the raw unannotated data. We will re-annotate all of the different genomes giving us a complete dataset annotated in the same way (with all annotations suffering the same bias). In parallel, we will use the GenBank files to collect the metadata available in those files (see box on the side).<\/p>\n<p>To have our data tidy and organised, let&#8217;s get the <em>fna<\/em> files out of the NCBI rsync&#8217;d folder. Here I created a separate directory called &#8220;fasta&#8221;. Among the different genomic files, we are only interested in the complete set of sequences, not the CDS or RNA subsets ones (which is why they were not selected &#8211; see script below).[\/vc_column_text][\/vc_column][vc_column enable_animation=&#8221;true&#8221; animation=&#8221;fade-in-from-right&#8221; boxed=&#8221;true&#8221; centered_text=&#8221;true&#8221; column_padding=&#8221;padding-2-percent&#8221; column_padding_position=&#8221;all&#8221; background_color=&#8221;#ffcc56&#8243; background_color_opacity=&#8221;1&#8243; background_hover_color_opacity=&#8221;1&#8243; width=&#8221;1\/2&#8243; tablet_text_alignment=&#8221;default&#8221; phone_text_alignment=&#8221;default&#8221;][vc_text_separator title=&#8221;The Meta data rage&#8221;][vc_column_text]\n<p style=\"text-align: center;\">When I started this project, one important point became obvious to me. Having all these metagenomes to play with is very nice, but a key piece of information is where do they come from? Having the contextual information for each data set is very important, if we want to be able to infer any kind of hypothesis or correlation from the analysis.<br \/>\nGathering this information from more than 20 years of <em>H. pylori<\/em> genome research is actually quite challenging. Tracking the origin of the strain, the condition of the host or even just the country of origin is sometimes impossible. Just as an example: To find where the original <em>H. pylori<\/em> closed genome strain came from, I had to dig through three papers to at the end have a very vague idea that the strain was sampled from either one of the original researchers or came from their laboratory&#8217;s culture collections. Who knows? (No really, Who? That would be helpful).<br \/>\nBut things are getting better, I understand that more than 10 years ago standards were different and the need for metadata wasn&#8217;t as strong as it is today. I manage to collect more than half of the basic metadata that I am using at the moment from the original Genbank files. The information I want is: the country of origin, ethnicity of the host and the disease associated with the host.<\/p>\n<p style=\"text-align: center;\">I have compiled all of this information in a publicly available list, accessible via my GitHub page <a href=\"https:\/\/github.com\/aassie\/607-pylori\">here<\/a> or the big button bellow. Please feel free to use it for your own research.<\/p>\n[\/vc_column_text][nectar_btn size=&#8221;medium&#8221; open_new_tab=&#8221;true&#8221; button_style=&#8221;see-through-2&#8243; color_override=&#8221;#ffffff&#8221; hover_color_override=&#8221;#25cee8&#8243; hover_text_color_override=&#8221;#ffffff&#8221; icon_family=&#8221;fontawesome&#8221; url=&#8221;https:\/\/github.com\/aassie\/607-pylori&#8221; text=&#8221;The Metadata&#8221; icon_fontawesome=&#8221;fa fa-github&#8221; margin_top=&#8221;10Px&#8221;][vc_column_text]\n<p style=\"text-align: center;\">Now, for any person who can help me complete this list, either by pointing out a mistake or a filling a blank send me an e-mail I am willing to send you an <em>H. pylori<\/em> postcard (on <a href=\"https:\/\/www.aassie.net\/Pro\/wp-content\/uploads\/2017\/03\/pylori-logo-small.png\">this template<\/a>) as a thank you.<\/p>\n[\/vc_column_text][\/vc_column][\/vc_row][vc_row type=&#8221;in_container&#8221; full_screen_row_position=&#8221;middle&#8221; scene_position=&#8221;center&#8221; text_color=&#8221;dark&#8221; text_align=&#8221;left&#8221; overlay_strength=&#8221;0.3&#8243;][vc_column column_padding=&#8221;no-extra-padding&#8221; column_padding_position=&#8221;all&#8221; background_color_opacity=&#8221;1&#8243; background_hover_color_opacity=&#8221;1&#8243; width=&#8221;1\/1&#8243; tablet_text_alignment=&#8221;default&#8221; phone_text_alignment=&#8221;default&#8221;][vc_column_text]<code>ls Pylori\/*\/*genomic.fna.gz | grep -v \"cds\" | grep -v \"rna\" &gt; Original.names.w.folder.tab<br \/>\nOname=$(&lt;Original.names.w.folder.tab)<br \/>\nfor i in $Oname; do cp $i fasta\/; done<br \/>\ncd fasta<br \/>\ngunzip -d *<\/code><\/p>\n<p>The NCBI page also mentions the presence of plasmids associated with some of the genomes. Even if we could consider them as part of the genetic potential of our <em>H. pylori, <\/em>we will only compare the genomes for this exercise. We use the following script to remove the plasmid sequences and then change the file and genome sequence names to something shorter, which will make things easier for browsing\/managing\/scripting. Finally, we will place them in their own directory.<\/p>\n<p><code>mkdir Renamed_wo_plasmid<br \/>\na=1;<br \/>\nfor i in Original\/*.fna;<br \/>\ndo old=$(printf $i);<br \/>\nnew=$(printf \"HP-%04d.wop.fna\" \"$a\");<br \/>\nnname=$(printf \"HPwop-%04d\" \"$a\");<br \/>\n#keep track of the old name and the new name in a list<br \/>\necho -e $old ' \\t ' $new &gt;&gt; Old.and.new.wo.plasmid.tab;<br \/>\n#Transform a multiline fasta file in one line per sequence<br \/>\nawk '\/^&gt;\/ {printf(\"\\n%s\\n\",$0);next; } { printf(\"%s\",$0);} END {printf(\"\\n\");}' &lt; $i &gt; Renamed_wo_plasmid\/tmp.$new; let a=a+1;<br \/>\n#Found the (P\/p)lasmid mention in the fasta header and then removes it along with the following line, then rename the sequence header with the name of the renamed file<br \/>\nsed '\/lasmid\/,+1 d' Renamed_wo_plasmid\/tmp.$new | awk -v A=$nname '\/^&gt;\/{print \"&gt;\"A\"_\" ++i; next}{print}' &gt; Renamed_wo_plasmid\/$new;<br \/>\nrm Renamed_wo_plasmid\/tmp.$new;<br \/>\ndone<\/code><\/p>\n<p>However, let&#8217;s not forget about the plasmids completely, we will have a look at them later. For now just run the following script to have only the plasmids in a specific directory.<\/p>\n<p><code>mkdir Renamed_plasmid<br \/>\na=1;<br \/>\nfor i in Original\/*.fna;<br \/>\ndo old=$(printf $i);<br \/>\nnew=$(printf \"HP-%04d.p.fna\" \"$a\");<br \/>\nnname=$(printf \"HPp-%04d\" \"$a\");<br \/>\necho -e $old ' \\t ' $new &gt;&gt; Old.and.new.plasmid.tab;<br \/>\nawk '\/^&gt;\/ {printf(\"\\n%s\\n\",$0);next; } { printf(\"%s\",$0);} END {printf(\"\\n\");}' &lt; $i &gt; Renamed_plasmid\/tmp.$new; let a=a+1;<br \/>\ngrep -A 1 \"lasmid\" Renamed_plasmid\/tmp.$new | awk -v A=$nname '\/^&gt;\/{print \"&gt;\"A\"_\" ++i; next}{print}' &gt; Renamed_plasmid\/$new;<br \/>\ngcount=$(grep -c \"&gt;\" Renamed_plasmid\/$new);<br \/>\nif [ $gcount == 0 ];<br \/>\nthen rm Renamed_plasmid\/$new;<br \/>\nfi;<br \/>\nrm Renamed_plasmid\/tmp.$new;<br \/>\ndone<\/code><\/p>\n<p>Once this is done, we can run the following commands to extract the available metadata from the GenBank files (<em>gbff<\/em>) present in the rsync&#8217;d directory. We move the <em>gbff<\/em> files out of the original directoy then use the following scrapper:<\/p>\n<p><code>mkdir Genbank.original<br \/>\nmv .\/NCBI\/*gbff\u00a0 .\/Genbank.original<br \/>\nfor i in *gbff; do printf \"%s\\t %s\\t %s\\t %s\\t %s\\t %s\\n\" \"$i\" \"$(grep \"TITLE\" $i | head -n 1)\" \"$(grep \"AUTHORS\" $i | head -n 1)\" \"$(grep \"BioProject\" $i | head -n 1)\" \"$(grep \"country\" $i | head -n 1)\" \"$(grep \"LOCUS\" $i | head -n 1)\"; done &gt; Meta.data.tab<\/code><\/p>\n<p>The Meta.data.tab tabular file can then be merged with the &#8220;Old.and.new.wo.plasmid.tab&#8221; and we have a table with the original <em>gbff<\/em> file name, the new name and some of the metadata when available. As mentioned in the box above I had a look at the literature and tried to fill in as much information about the contextual data of the genomes as I could find.<\/p>\n<p>All table operations were performed in excel and exported as a<em> .csv<\/em> file (available on my GitHub page), which we will use in later analyses.<\/p>\n<p>We have now our raw <em>fasta<\/em> and metadata files ready to be processed in the next blog post.[\/vc_column_text][\/vc_column][\/vc_row]\n","protected":false},"excerpt":{"rendered":"<p>[vc_row type=&#8221;in_container&#8221; full_screen_row_position=&#8221;middle&#8221; scene_position=&#8221;center&#8221; text_color=&#8221;dark&#8221; text_align=&#8221;left&#8221; overlay_strength=&#8221;0.3&#8243;][vc_column column_padding=&#8221;no-extra-padding&#8221; column_padding_position=&#8221;all&#8221; background_color_opacity=&#8221;1&#8243; background_hover_color_opacity=&#8221;1&#8243; width=&#8221;1\/1&#8243; tablet_text_alignment=&#8221;default&#8221; phone_text_alignment=&#8221;default&#8221;][vc_column_text]In the previous post we had a look at some interesting features of Helicobacter pylori. Now it is time to finally get our hands on the data. This post will present the data preparation, how to get&#8230;<\/p>\n","protected":false},"author":2,"featured_media":679,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4,56],"tags":[11,25,53,50,54,51],"_links":{"self":[{"href":"https:\/\/www.aassie.net\/Pro\/wp-json\/wp\/v2\/posts\/510"}],"collection":[{"href":"https:\/\/www.aassie.net\/Pro\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aassie.net\/Pro\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aassie.net\/Pro\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aassie.net\/Pro\/wp-json\/wp\/v2\/comments?post=510"}],"version-history":[{"count":29,"href":"https:\/\/www.aassie.net\/Pro\/wp-json\/wp\/v2\/posts\/510\/revisions"}],"predecessor-version":[{"id":562,"href":"https:\/\/www.aassie.net\/Pro\/wp-json\/wp\/v2\/posts\/510\/revisions\/562"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aassie.net\/Pro\/wp-json\/wp\/v2\/media\/679"}],"wp:attachment":[{"href":"https:\/\/www.aassie.net\/Pro\/wp-json\/wp\/v2\/media?parent=510"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aassie.net\/Pro\/wp-json\/wp\/v2\/categories?post=510"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aassie.net\/Pro\/wp-json\/wp\/v2\/tags?post=510"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}