Skip to main content

Fetching information from NCBI directly from the command line

By July 26, 2016Bio-informatic

I recently (re)discovered the edirect software package published on NCBI. As described on the website this “software suite that enables users to use the UNIX command line to directly access NCBI databases, as well as to parse and format the data to create customized downloads“.

How does it work? Basically like the “Entrez” research system on the NCBI website but with some extra feature. For example, I have a protein accession number and I want to know to what it correspond. I will use the command as follow:
esearch -db protein -query "AAC43379"
And it will output the following:
<ENTREZ_DIRECT>
<Db>protein</Db>
<WebEnv>NCID_1_3752066_130.14.22.215_9001_1469541026_407945051_0MetA0_S_MegaStore_F_1</WebEnv>
<QueryKey>1</QueryKey>
<Count>1</Count>
<Step>1</Step>
</ENTREZ_DIRECT>

But then if coupled with the efetch script like this :
esearch -db protein -query "AAC43379" | efetch -format gb
This will output on the terminal the genbank format of the protein as you see it on the NCBI page.

What will be printed on the terminal (Example)

LOCUS       AAC43379                 347 aa            linear   BCT 23-JUN-1995
DEFINITION  RecA [Helicobacter pylori].
ACCESSION   AAC43379
VERSION     AAC43379.1  GI:687249
DBSOURCE    locus HPU13756 accession U13756.1
KEYWORDS    .
SOURCE      Helicobacter pylori
  ORGANISM  Helicobacter pylori
            Bacteria; Proteobacteria; Epsilonproteobacteria; Campylobacterales;
            Helicobacteraceae; Helicobacter.
REFERENCE   1  (residues 1 to 347)
  AUTHORS   Thompson,S.A. and Blaser,M.J.
  TITLE     Isolation of the Helicobacter pylori recA gene and involvement of
            the recA region in resistance to low pH
  JOURNAL   Infect. Immun. 63 (6), 2185-2193 (1995)
   PUBMED   7768597
REFERENCE   2  (residues 1 to 347)
  AUTHORS   Thompson,S.A.
  TITLE     Direct Submission
  JOURNAL   Submitted (17-AUG-1994) Stuart A. Thompson, Division of Infectious
            Diseases, Vanderbilt University School of Medicine, A3310 Medical
            Center North, Nashville, TN 37232, USA
COMMENT     Method: conceptual translation.
FEATURES             Location/Qualifiers
     source          1..347
                     /organism="Helicobacter pylori"
                     /strain="84-183 (ATCC 53726)"
                     /db_xref="taxon:210"
                     /clone="pSAT105"
                     /clone_lib="lambda ZAP II"
     Protein         1..347
                     /product="RecA"
     Region          1..347
                     /region_name="recA"
                     /note="recombinase A; Provisional; PRK09354"
                     /db_xref="CDD:236476"
     Region          6..331
                     /region_name="recA"
                     /note="RecA is a bacterial enzyme which has roles in
                     homologous recombination, DNA repair, and the induction of
                     the SOS response. RecA couples ATP hydrolysis to DNA
                     strand exchange; cd00983"
                     /db_xref="CDD:238483"
     Site            order(14..15,17..18,21,25..29,98..99,101..107,112..113,
                     215,219,267,313)
                     /site_type="other"
                     /note="hexamer interface [polypeptide binding]"
                     /db_xref="CDD:238483"
     Site            67..74
                     /site_type="other"
                     /note="Walker A motif"
                     /db_xref="CDD:238483"
     Site            order(69..75,97,101,104,145,195,229,242,264..267)
                     /site_type="other"
                     /note="ATP binding site [chemical binding]"
                     /db_xref="CDD:238483"
     Site            141..145
                     /site_type="other"
                     /note="Walker B motif"
                     /db_xref="CDD:238483"
     CDS             1..347
                     /gene="recA"
                     /coded_by="U13756.1:350..1393"
                     /transl_table=11
ORIGIN      
        1 maidedkqka islaikqidk vfgkgalvrl gdkqvekida istgslgldl algiggvpkg
       61 riieiygpes sgkttlslhi iaecqknggv cafidaehal dvyyakrlgv dtenllvsqp
      121 stgeealeil etitrsggid lvvvdsvaal tpkaeidgdm gdqhvglqar lmshalrkit
      181 gvlhkmnttl ifinqirmki gmtgygspet ttggnalkfy asvridirri aalkqneqhi
      241 gnrakakvvk nkvappfrea efdimfgegi skegeiidyg vkldivdksg awlsyqdkkl
      301 gqgrenakal lkedkalane itlkikesig sneeimplpd epleeme
//

And from this, giving a good NCBI page the possibilities are endless!

This tool actually just made my day. I was working with a huge de novo transcriptome assembly composed of several thousands of contigs. One important step of the analysis is to know which contig code fore which gene. To have a quick and dirty idea of the annotation on function and taxonomy level, I checked if the contig blasted against something on the data base.

I just used diamond software against the ncbi non redundant (nr) database and outputted a tabular file looking like this :
TRINITY_DN177_c0_g1_i1 gi|908406704|ref|XP_013094042.1| 62.7 220 75 3 799 1455 320 533 3.7e-73 284.6
TRINITY_DN177_c0_g1_i1 gi|908406704|ref|XP_013094042.1| 48.5 206 94 6 52 651 1 200 5.5e-45 191.0
TRINITY_DN177_c0_g1_i1 gi|524908906|ref|XP_005109542.1| 64.2 193 68 1 880 1455 357 549 9.0e-72 280.0
TRINITY_DN177_c0_g1_i1 gi|524908906|ref|XP_005109542.1| 52.0 198 78 6 52 633 1 185 3.4e-47 198.4
TRINITY_DN177_c0_g1_i1 gi|918288237|gb|KOF67725.1| 50.8 264 88 4 805 1476 239 500 1.6e-65 259.2
TRINITY_DN177_c0_g1_i1 gi|443722686|gb|ELU11446.1| 54.0 213 89 6 814 1434 180 389 1.8e-56 229.2
TRINITY_DN177_c0_g1_i1 gi|926634111|ref|XP_013782890.1| 48.1 216 107 3 829 1464 193 407 1.5e-55 226.1
TRINITY_DN184_c0_g1_i1 gi|762125178|ref|XP_011445607.1| 92.4 92 7 0 281 6 1 92 6.8e-40 171.4
TRINITY_DN184_c0_g1_i1 gi|919047297|ref|XP_013406842.1| 87.1 93 12 0 281 3 1 93 1.1e-37 164.1
TRINITY_DN184_c0_g1_i1 gi|919047103|ref|XP_013406743.1| 87.8 90 11 0 281 12 1 90 2.7e-36 159.5

Then in order to match the Accession number to a protein product, I did the following command line jujitsu:
awk -F "|" '{print $4}' Diamond.result.tab > temp ##Extract the accession number in a temporary file
a=$(<temp) # put the accession name in a variable
for i in $a;
do esearch -db protein -query $i | efetch -format gb > source/$i.gbk;
cat source/$i.gbk | sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/ /g'| sed -n 's:.*DEFINITION\(.*\)\b].*\b:'$i'\t\1:p' | sed 's/[[].*$//' >> result/d.$SGE_TASK_ID.result; # extract definition name of the protein without the specie name between brackets and add it after the Accession number and a tabulation
cat source/$i.gbk | sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/ /g'| sed -n 's:.*ORGANISM\(.*\)\b[A-Z].*\b:\1:p' | sed 's/[.].*$//' | awk -F " " '{for (i=2; i<=(NF-1); i++) printf $i " "; print $NF}' >> result/t.resuĺt; # extract taxonomy name of the protein in one line
done
paste result/d.result result/t.result > annotation.tab

This output two file with some basic information about my contigs which can be used to have an idea of what is inside my transcriptome. The final file looks like this :

XP_013094042.1 PREDICTED: headcase protein homolog Eukaryota; Metazoa; Lophotrochozoa; Mollusca; Gastropoda; Heterobranchia; Euthyneura; Panpulmonata; Hygrophila; Planorboidea; Planorbidae; Biomphalaria
XP_013094042.1 PREDICTED: headcase protein homolog Eukaryota; Metazoa; Lophotrochozoa; Mollusca; Gastropoda; Heterobranchia; Euthyneura; Panpulmonata; Hygrophila; Planorboidea; Planorbidae; Biomphalaria
XP_005109542.1 PREDICTED: headcase protein homolog Eukaryota; Metazoa; Lophotrochozoa; Mollusca; Gastropoda; Heterobranchia; Euthyneura; Euopisthobranchia; Aplysiomorpha; Aplysioidea; Aplysiidae; Aplysia
XP_005109542.1 PREDICTED: headcase protein homolog Eukaryota; Metazoa; Lophotrochozoa; Mollusca; Gastropoda; Heterobranchia; Euthyneura; Euopisthobranchia; Aplysiomorpha; Aplysioidea; Aplysiidae; Aplysia
KOF67725.1 hypothetical protein OCBIM_22008945mg Eukaryota; Metazoa; Lophotrochozoa; Mollusca; Cephalopoda; Coleoidea; Neocoleoidea; Octopodiformes; Octopoda; Incirrata; Octopodidae; Octopus
ELU11446.1 hypothetical protein CAPTEDRAFT_23984, partial Eukaryota; Metazoa; Lophotrochozoa; Annelida; Polychaeta; Scolecida; Capitellida; Capitellidae; Capitella
XP_013782890.1 PREDICTED: headcase protein-like Eukaryota; Metazoa; Ecdysozoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus

Join the discussion 4 Comments

  • Stefan says:

    Hey Adrien,

    nice post on the subject. You can even go further and let the edirect tools handle the extraction of the information. Something along the following lines (the RegEx could of course be tweaked further):

    [code language=”bash”]
    awk -F “|” ‘{print $4}’ diamond.tab > temp
    for i in $(> my_annotation.tab
    done
    rm temp
    [/code]

    Furthermore, you could think about deduplicating the input accessions or using a conditional in the for loop to circumvent downloading GenBank files with the same accession over and over again.

    Best Regards
    Stefan

    • Stefan says:

      It still seems to filter the code. The relevant line is:

      esearch -db protein -query $i | efetch -format gb -mode xml | xtract -pattern GBSeq -element GBSeq_accession-version GBSeq_definition GBSeq_taxonomy | sed “s:\(.*\) \[.*\]\(.*\):\1\2:g”

      • Adrien says:

        Hej Stefan,

        Thank you very much to pass by and leave a comment,
        I will look for including a way to leave code in the comment!

        I will give a try to your suggestion seems a nice option !

        Cheers!

        Adrien

  • Stefan says:

    Hello again 😉

    I tweaked the script a little bit. It should now avoid additional downloads by caching information in an array (should require bash 4). I don’t know how it will perform wit a larger data set. If you have time, feel free to test it out. Replace (lt) and (gt) with the “lowerthan” and “greaterthan” signs:

    [code language=”bash”]
    #!/bin/bash
    declare -A accversion=() # declare associative array to cache information and avoid unnecessary downloads
    awk -F “|” ‘{print $4}’ diamond.tab > temp # write accession.version to temporary file
    for i in $((lt)temp); do
    tr_i=$(echo $i | tr ‘.’ ‘-‘) # “dot” is not allowed in variable/key names, replace it by “dash”
    if [[ -z “${accversion[$tr_i]}” ]]; then
    accversion[$tr_i]=$(esearch -db protein -query $i | efetch -format gb -mode xml | xtract -pattern GBSeq -element GBSeq_accession-version GBSeq_definition GBSeq_taxonomy | sed “s:\(.*\) \[.*\]\(.*\):\1\2:g”) # fetch information
    echo “${accversion[$tr_i]}” (gt)(gt) my_annotation.tab
    else
    echo “${accversion[$tr_i]}” (gt)(gt) my_annotation.tab
    fi
    done
    unset accversion # delete array
    rm temp # delete temporary file
    [/code]

    Stefan

Leave a Reply to Stefan Cancel Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.