November 7, 2009









Home | Research Activities | Publications | Personnel | Nitrogenase | Lab Site | Database | Links | News






Database
nifH database (23.25 Mb file size, 13,952 total records)
ARB, constructed from GenBank entries
Aligned by Hidden Markov Model from PFAM
Last updated March 30, 2009

The nifH Gene

(Interpro IPR000392): Nitrogen fixing bacteria possess a nitrogenase enzyme complex PUBMED:2672439 that comprises 2 components that catalyse the reduction of molecular nitrogen to ammonia PUBMED:6327620, (Norel F. and Elmerich C., J. Gen. Microbiol. 133 1987). Component I (nitrogenase MoFe protein or dinitrogenase) contains 2 molecules each of 2 non-identical subunits; component II (nitrogenase Fe protein or dinitrogenase reductase) is a homodimer, the monomer being coded for by the nifH gene PUBMED:6327620. Component II has 2 ATP-binding domains and one 4Fe-4S cluster per homodimer: it supplies energy by ATP hydrolysis, and transfers electrons from reduced ferredoxin or flavodoxin to component I for the reduction of molecular nitrogen to ammonia PUBMED:2491672. There are a number of conserved regions in the sequence of these proteins: in the N-terminal section there is an ATP-binding site motif 'A' (P-loop) and in the central section there are two conserved cysteines which have been shown, in nifH, to be the ligands of the 4Fe-4S cluster.

Nitrogenase genes form a closely related family that likely arose from a common ancestor (nifH, nifD, nifK, nifE, and nifN, and others). Alternative nitrogenases also contain a subunit encoded by anfG (in the alternative nitrogenases, the nifH, nifD, and nifK genes are termed anfH, anfD, anfK, respectively). Because it is most highly conserved, nifH has been the target of most studies. This is particularly true regarding environmental studies. Thus, there are now thousands of nifH genes available. Recovering these genes, their coding regions, and metadata, and aligning them in a coherent manner is problematic. This is largely due to the lack of shared conventions for data storage among the major genomic repositories, and the large volume of legacy data where information is presented in an inconsistent manner.

The nifH Database

The ARB software environment is a useful environment for visualizing and manipulating aligned and unaligned sequences, and for maintaining metadata on sources, publications etc. ARB also contains features for probe design, and the construction of publishable phylogenetic trees. However, ARB is not well suited to performing a large number of certain repetitive tasks, for computationally demanding tasks, or for downloading and validating new data. Therefore, we have designed a semi-automated process for constructing the nifH database from public genomic data sources, which proceeds as follows:

  1. Using MyNCBI, download the GenBank protein records annontated with a feature of "/gene=nifH".
  2. Use eFetch to retrieve the GenBank nucleotide records of interest, as specified in the "/coded_by" and "/codon_start" features of the protein records from step 1.
  3. Reformat the retrieved GenBank nucleotide records into a modified EMBL format for importation into ARB using a custom version of the "ebi_2002_wl.ift" ARB filter.
  4. In ARB, merge the newly added nifH records with the existing ones, translate all nucleotide sequences to amino acid sequences and export them, then align with "hmmalign" using the "Fer4_NifH_fs.hmm" model from Pfam.
  5. Import the aligned amino acid sequences into ARB and realign the corresponding nucleotide sequences to the aligned amino acids.

Notes on Using the nifH Database

There are a few points to keep in mind when using the nifH database.

  1. If a nifH sequence does not have a GenBank protein accession with a feature of "/gene=nifH" and valid entries for "coded_by" and "codon_start," it will not appear in this database. There are over 2,000 such sequences in GenBank. If you own a sequence and instruct GenBank to correct it, the next update to the nifH database should include your corrected sequence.
  2. Many nifH nucleotide sequences have uncertain base positions that prevent ARB from aligning the nucleotide sequence to the aligned amino acid sequence. A field called "CantAlign" contains a value of "Y" for these sequences (209 as of March 2009).
  3. Misannotations identified by amino acid sequences that do not align at all to the HMM model, and very short sequences (identified visually; usually they are from promoter studies) are not included in the database.
  4. The "ali-nifHDNA" field contains the aligned nucleotides and "ali_nifHprot" contains the aligned amino acids. These are the fields to use for probe design and construction of phylogenetic trees.

If you have questions about this database email htripp at ucsc dot edu.






University of California Santa Cruz
Ocean Sciences Department
Marine Microbiology Laboratory/Zehr Lab
1156 High Street, Santa Cruz, CA 95064
T: 831-459-3128
F: 831-459-4882