{"id":324,"date":"2016-07-26T14:09:05","date_gmt":"2016-07-26T14:09:05","guid":{"rendered":"http:\/\/www.aassie.net\/Pro\/?p=324"},"modified":"2016-07-26T14:26:56","modified_gmt":"2016-07-26T14:26:56","slug":"fetching-information-from-ncbi-directly-from-the-command-line","status":"publish","type":"post","link":"https:\/\/www.aassie.net\/Pro\/fetching-information-from-ncbi-directly-from-the-command-line\/","title":{"rendered":"Fetching information from NCBI directly from the command line"},"content":{"rendered":"[vc_row type=&#8221;in_container&#8221; scene_position=&#8221;center&#8221; text_color=&#8221;dark&#8221; text_align=&#8221;left&#8221; overlay_strength=&#8221;0.3&#8243;][vc_column column_padding=&#8221;no-extra-padding&#8221; column_padding_position=&#8221;all&#8221; background_color_opacity=&#8221;1&#8243; background_hover_color_opacity=&#8221;1&#8243; width=&#8221;1\/1&#8243;][vc_column_text]I recently (re)discovered the edirect software package published on NCBI. As described on <a href=\"http:\/\/www.ncbi.nlm.nih.gov\/news\/02-06-2014-entrez-direct-released\/\">the website<\/a> this &#8220;<em>software suite that enables users to use the UNIX command line to directly access NCBI databases, as well as to parse and format the data to create customized downloads<\/em>&#8220;.<\/p>\n<p>How does it work? Basically like the &#8220;Entrez&#8221; research system on the NCBI website but with some extra feature. For example, I have a protein accession number and I want to know to what it correspond. I will use the command as follow:<br \/>\n<code>esearch -db protein -query \"AAC43379\"<\/code><br \/>\nAnd it will output the following:<br \/>\n<code>&lt;ENTREZ_DIRECT&gt;<br \/>\n&lt;Db&gt;protein&lt;\/Db&gt;<br \/>\n&lt;WebEnv&gt;NCID_1_3752066_130.14.22.215_9001_1469541026_407945051_0MetA0_S_MegaStore_F_1&lt;\/WebEnv&gt;<br \/>\n&lt;QueryKey&gt;1&lt;\/QueryKey&gt;<br \/>\n&lt;Count&gt;1&lt;\/Count&gt;<br \/>\n&lt;Step&gt;1&lt;\/Step&gt;<br \/>\n&lt;\/ENTREZ_DIRECT&gt;<\/code><br \/>\nBut then if coupled with the efetch script like this :<br \/>\n<code>esearch -db protein -query \"AAC43379\" | efetch -format gb<\/code><br \/>\nThis will output on the terminal the genbank format of the protein as you see it on the <a href=\"http:\/\/www.ncbi.nlm.nih.gov\/protein\/AAC43379.1\">NCBI page<\/a>.[\/vc_column_text][vc_row_inner][vc_column_inner column_padding=&#8221;no-extra-padding&#8221; column_padding_position=&#8221;all&#8221; background_color_opacity=&#8221;1&#8243; width=&#8221;1\/1&#8243;][\/vc_column_inner][\/vc_row_inner][toggles][toggle color=&#8221;Extra-Color-1&#8243; title=&#8221;What will be printed on the terminal (Example)&#8221;][vc_column_text]\n<pre class=\"genbank\">LOCUS       AAC43379                 347 aa            linear   BCT 23-JUN-1995\r\nDEFINITION  RecA [Helicobacter pylori].\r\nACCESSION   AAC43379\r\nVERSION     AAC43379.1  GI:687249\r\nDBSOURCE    locus HPU13756 accession U13756.1\r\nKEYWORDS    .\r\nSOURCE      Helicobacter pylori\r\n  ORGANISM  Helicobacter pylori\r\n            Bacteria; Proteobacteria; Epsilonproteobacteria; Campylobacterales;\r\n            Helicobacteraceae; Helicobacter.\r\nREFERENCE   1  (residues 1 to 347)\r\n  AUTHORS   Thompson,S.A. and Blaser,M.J.\r\n  TITLE     Isolation of the Helicobacter pylori recA gene and involvement of\r\n            the recA region in resistance to low pH\r\n  JOURNAL   Infect. Immun. 63 (6), 2185-2193 (1995)\r\n   PUBMED   7768597\r\nREFERENCE   2  (residues 1 to 347)\r\n  AUTHORS   Thompson,S.A.\r\n  TITLE     Direct Submission\r\n  JOURNAL   Submitted (17-AUG-1994) Stuart A. Thompson, Division of Infectious\r\n            Diseases, Vanderbilt University School of Medicine, A3310 Medical\r\n            Center North, Nashville, TN 37232, USA\r\n<a name=\"comment_687249\"><\/a>COMMENT     Method: conceptual translation.\r\n<a name=\"feature_687249\"><\/a>FEATURES             Location\/Qualifiers\r\n<span id=\"feature_687249_source_0\" class=\"feature\">     source          1..347\r\n                     \/organism=\"Helicobacter pylori\"\r\n                     \/strain=\"84-183 (ATCC 53726)\"\r\n                     \/db_xref=\"taxon:210\"\r\n                     \/clone=\"pSAT105\"\r\n                     \/clone_lib=\"lambda ZAP II\"\r\n<\/span><span id=\"feature_687249_Protein_0\" class=\"feature\">     Protein         1..347\r\n                     \/product=\"RecA\"\r\n<\/span><span id=\"feature_687249_Region_0\" class=\"feature\">     Region          1..347\r\n                     \/region_name=\"recA\"\r\n                     \/note=\"recombinase A; Provisional; PRK09354\"\r\n                     \/db_xref=\"CDD:236476\"\r\n<\/span><span id=\"feature_687249_Region_1\" class=\"feature\">     Region          6..331\r\n                     \/region_name=\"recA\"\r\n                     \/note=\"RecA is a bacterial enzyme which has roles in\r\n                     homologous recombination, DNA repair, and the induction of\r\n                     the SOS response. RecA couples ATP hydrolysis to DNA\r\n                     strand exchange; cd00983\"\r\n                     \/db_xref=\"CDD:238483\"\r\n<\/span><span id=\"feature_687249_Site_0\" class=\"feature\">     Site            order(14..15,17..18,21,25..29,98..99,101..107,112..113,\r\n                     215,219,267,313)\r\n                     \/site_type=\"other\"\r\n                     \/note=\"hexamer interface [polypeptide binding]\"\r\n                     \/db_xref=\"CDD:238483\"\r\n<\/span><span id=\"feature_687249_Site_1\" class=\"feature\">     Site            67..74\r\n                     \/site_type=\"other\"\r\n                     \/note=\"Walker A motif\"\r\n                     \/db_xref=\"CDD:238483\"\r\n<\/span><span id=\"feature_687249_Site_2\" class=\"feature\">     Site            order(69..75,97,101,104,145,195,229,242,264..267)\r\n                     \/site_type=\"other\"\r\n                     \/note=\"ATP binding site [chemical binding]\"\r\n                     \/db_xref=\"CDD:238483\"\r\n<\/span><span id=\"feature_687249_Site_3\" class=\"feature\">     Site            141..145\r\n                     \/site_type=\"other\"\r\n                     \/note=\"Walker B motif\"\r\n                     \/db_xref=\"CDD:238483\"\r\n<\/span><span id=\"feature_687249_CDS_0\" class=\"feature\">     CDS             1..347\r\n                     \/gene=\"recA\"\r\n                     \/coded_by=\"U13756.1:350..1393\"\r\n                     \/transl_table=11\r\n<\/span>ORIGIN      \r\n        1 <span id=\"gi_687249_1\" class=\"ff_line\">maidedkqka islaikqidk vfgkgalvrl gdkqvekida istgslgldl algiggvpkg<\/span>\r\n       61 <span id=\"gi_687249_61\" class=\"ff_line\">riieiygpes sgkttlslhi iaecqknggv cafidaehal dvyyakrlgv dtenllvsqp<\/span>\r\n      121 <span id=\"gi_687249_121\" class=\"ff_line\">stgeealeil etitrsggid lvvvdsvaal tpkaeidgdm gdqhvglqar lmshalrkit<\/span>\r\n      181 <span id=\"gi_687249_181\" class=\"ff_line\">gvlhkmnttl ifinqirmki gmtgygspet ttggnalkfy asvridirri aalkqneqhi<\/span>\r\n      241 <span id=\"gi_687249_241\" class=\"ff_line\">gnrakakvvk nkvappfrea efdimfgegi skegeiidyg vkldivdksg awlsyqdkkl<\/span>\r\n      301 <span id=\"gi_687249_301\" class=\"ff_line\">gqgrenakal lkedkalane itlkikesig sneeimplpd epleeme<\/span>\r\n\/\/<\/pre>\n[\/vc_column_text][vc_column_text][\/vc_column_text][\/toggle][\/toggles][\/vc_column][\/vc_row][vc_row type=&#8221;in_container&#8221; scene_position=&#8221;center&#8221; text_color=&#8221;dark&#8221; text_align=&#8221;left&#8221; overlay_strength=&#8221;0.3&#8243;][vc_column column_padding=&#8221;no-extra-padding&#8221; column_padding_position=&#8221;all&#8221; background_color_opacity=&#8221;1&#8243; background_hover_color_opacity=&#8221;1&#8243; width=&#8221;1\/1&#8243;][vc_column_text]And from this, giving a good NCBI page the possibilities are endless!<\/p>\n<p>This tool actually just made my day. I was working with a huge de novo transcriptome assembly composed of several thousands of contigs. One important step of the analysis is to know which contig code fore which gene. To have a quick and dirty idea of the annotation on function and taxonomy level, I checked if the contig blasted against something on the data base.<\/p>\n<p>I just used diamond software against the ncbi non redundant (nr) database and outputted a tabular file looking like this :<br \/>\n<code>TRINITY_DN177_c0_g1_i1 gi|908406704|ref|XP_013094042.1| 62.7 220 75 3 799 1455 320 533 3.7e-73 284.6<br \/>\nTRINITY_DN177_c0_g1_i1 gi|908406704|ref|XP_013094042.1| 48.5 206 94 6 52 651 1 200 5.5e-45 191.0<br \/>\nTRINITY_DN177_c0_g1_i1 gi|524908906|ref|XP_005109542.1| 64.2 193 68 1 880 1455 357 549 9.0e-72 280.0<br \/>\nTRINITY_DN177_c0_g1_i1 gi|524908906|ref|XP_005109542.1| 52.0 198 78 6 52 633 1 185 3.4e-47 198.4<br \/>\nTRINITY_DN177_c0_g1_i1 gi|918288237|gb|KOF67725.1| 50.8 264 88 4 805 1476 239 500 1.6e-65 259.2<br \/>\nTRINITY_DN177_c0_g1_i1 gi|443722686|gb|ELU11446.1| 54.0 213 89 6 814 1434 180 389 1.8e-56 229.2<br \/>\nTRINITY_DN177_c0_g1_i1 gi|926634111|ref|XP_013782890.1| 48.1 216 107 3 829 1464 193 407 1.5e-55 226.1<br \/>\nTRINITY_DN184_c0_g1_i1 gi|762125178|ref|XP_011445607.1| 92.4 92 7 0 281 6 1 92 6.8e-40 171.4<br \/>\nTRINITY_DN184_c0_g1_i1 gi|919047297|ref|XP_013406842.1| 87.1 93 12 0 281 3 1 93 1.1e-37 164.1<br \/>\nTRINITY_DN184_c0_g1_i1 gi|919047103|ref|XP_013406743.1| 87.8 90 11 0 281 12 1 90 2.7e-36 159.5<\/code><br \/>\nThen in order to match the Accession number to a protein product, I did the following command line jujitsu:<br \/>\n<code>awk -F \"|\" '{print $4}' Diamond.result.tab &gt; temp ##Extract the accession number in a temporary file<br \/>\na=$(&lt;temp) # put the accession name in a variable<br \/>\nfor i in $a;<br \/>\ndo esearch -db protein -query $i | efetch -format gb &gt; source\/$i.gbk;<br \/>\ncat source\/$i.gbk | sed -e ':a' -e 'N' -e '$!ba' -e 's\/\\n\/ \/g'| sed -n 's:.*DEFINITION\\(.*\\)\\b].*\\b:'$i'\\t\\1:p' | sed 's\/[[].*$\/\/' &gt;&gt; result\/d.$SGE_TASK_ID.result; # extract definition name of the protein without the specie name between brackets and add it after the Accession number and a tabulation<br \/>\ncat source\/$i.gbk | sed -e ':a' -e 'N' -e '$!ba' -e 's\/\\n\/ \/g'| sed -n 's:.*ORGANISM\\(.*\\)\\b[A-Z].*\\b:\\1:p' | sed 's\/[.].*$\/\/' | awk -F \" \" '{for (i=2; i&lt;=(NF-1); i++) printf $i \" \"; print $NF}' &gt;&gt; result\/t.resu\u013at; # extract taxonomy name of the protein in one line<br \/>\ndone<br \/>\npaste result\/d.result result\/t.result &gt; annotation.tab<\/code><\/p>\n<p>This output two file with some basic information about my contigs which can be used to have an idea of what is inside my transcriptome. The final file looks like this :<\/p>\n<p><code>XP_013094042.1 PREDICTED: headcase protein homolog Eukaryota; Metazoa; Lophotrochozoa; Mollusca; Gastropoda; Heterobranchia; Euthyneura; Panpulmonata; Hygrophila; Planorboidea; Planorbidae; Biomphalaria<br \/>\nXP_013094042.1 PREDICTED: headcase protein homolog Eukaryota; Metazoa; Lophotrochozoa; Mollusca; Gastropoda; Heterobranchia; Euthyneura; Panpulmonata; Hygrophila; Planorboidea; Planorbidae; Biomphalaria<br \/>\nXP_005109542.1 PREDICTED: headcase protein homolog Eukaryota; Metazoa; Lophotrochozoa; Mollusca; Gastropoda; Heterobranchia; Euthyneura; Euopisthobranchia; Aplysiomorpha; Aplysioidea; Aplysiidae; Aplysia<br \/>\nXP_005109542.1 PREDICTED: headcase protein homolog Eukaryota; Metazoa; Lophotrochozoa; Mollusca; Gastropoda; Heterobranchia; Euthyneura; Euopisthobranchia; Aplysiomorpha; Aplysioidea; Aplysiidae; Aplysia<br \/>\nKOF67725.1 hypothetical protein OCBIM_22008945mg Eukaryota; Metazoa; Lophotrochozoa; Mollusca; Cephalopoda; Coleoidea; Neocoleoidea; Octopodiformes; Octopoda; Incirrata; Octopodidae; Octopus<br \/>\nELU11446.1 hypothetical protein CAPTEDRAFT_23984, partial Eukaryota; Metazoa; Lophotrochozoa; Annelida; Polychaeta; Scolecida; Capitellida; Capitellidae; Capitella<br \/>\nXP_013782890.1 PREDICTED: headcase protein-like Eukaryota; Metazoa; Ecdysozoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus<\/code>[\/vc_column_text][\/vc_column][\/vc_row]\n","protected":false},"excerpt":{"rendered":"<p>[vc_row type=&#8221;in_container&#8221; scene_position=&#8221;center&#8221; text_color=&#8221;dark&#8221; text_align=&#8221;left&#8221; overlay_strength=&#8221;0.3&#8243;][vc_column column_padding=&#8221;no-extra-padding&#8221; column_padding_position=&#8221;all&#8221; background_color_opacity=&#8221;1&#8243; background_hover_color_opacity=&#8221;1&#8243; width=&#8221;1\/1&#8243;][vc_column_text]I recently (re)discovered the edirect software package published on NCBI. As described on the website this &#8220;software suite that enables users to use the UNIX command line to directly access NCBI databases, as well as to parse and format the&#8230;<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[37,36,7,6],"_links":{"self":[{"href":"https:\/\/www.aassie.net\/Pro\/wp-json\/wp\/v2\/posts\/324"}],"collection":[{"href":"https:\/\/www.aassie.net\/Pro\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aassie.net\/Pro\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aassie.net\/Pro\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aassie.net\/Pro\/wp-json\/wp\/v2\/comments?post=324"}],"version-history":[{"count":14,"href":"https:\/\/www.aassie.net\/Pro\/wp-json\/wp\/v2\/posts\/324\/revisions"}],"predecessor-version":[{"id":346,"href":"https:\/\/www.aassie.net\/Pro\/wp-json\/wp\/v2\/posts\/324\/revisions\/346"}],"wp:attachment":[{"href":"https:\/\/www.aassie.net\/Pro\/wp-json\/wp\/v2\/media?parent=324"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aassie.net\/Pro\/wp-json\/wp\/v2\/categories?post=324"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aassie.net\/Pro\/wp-json\/wp\/v2\/tags?post=324"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}