Bioinformatics and the Internet

doi:10.1515/ci.1999.21.2.33

Article Publicly Available

Bioinformatics and the Internet

Published/Copyright: September 1, 2009

Published by

Become an author with De Gruyter Brill

Author Information

From the journal Chemistry International -- Newsmagazine for IUPAC Volume 21 Issue 2

_

News from IUPAC

Bioinformatics and the Internet

Dr. Jürgen Pleiss and Professor Rolf D. Schmid, Chairman and Titular Members of the IUPAC Commission on Biotechnology (Institute for Technical Biochemistry, University of Stuttgart, Allmandring 31, D-70569 Stuttgart, Germany; e-mail: jpleiss@tebio1.biologie.uni-stuttgart.de; rolf.d.schmid@rus.uni-stuttgart.de), contributed the following article on the combination of two new technologies that are having a major impact on the pharmaceutical, agrochemical, and food industries.

Introduction

Explosive Growth of the World Wide Web

Life Sciences and the World Wide Web

Protein Sequencing Databanks

Bioinformatics Databanks and Web Sites

Challenges to Bioinformatics

Future of Bioinformatics

References

Introduction

At the turn of the millennium, two young technologies can be singled out which have a major impact on science, industry, and society: recombinant DNA and information technology. As they combine in the field of bioinformatics, they are transforming the pharmaceutical, agrochemical, and food industries and, as a consequence, university education. Much of today's information in the life sciences is generated by collaborative efforts at different locations worldwide, and effective communication is essential for success. Thus, the huge amount of data generated by large-scale genome sequencing activities, e.g., the human genome project, depends heavily on computing and telecommunications and stimulates further efforts in this area.

Explosive Growth of the World Wide Web

In information technology, the World Wide Web (WWW) has become the dominant global communication network. It is based on the Internet, which has served already for more than 20 years as a communication resource among scientists. But only when the hypertext transfer protocol (HTTP) was introduced in 1990 did communication via the Internet became sufficiently easy and inexpensive to allow its general use. Moreover, HTTP is hardware-independent and thus accessible even through inexpensive personal computers which are connected directly to the Internet or via a modem to an Internet provider.

Fig. 1

Number of Internet hosts advertised in the DNS - Internet Domain Survey, July 1998, http://www.nw.com/zone/WWW/report.html)

This development has stimulated all kinds of commercial activities, and the number of Internet hosts and Internet web sites has reached nearly 40 and 4 million (Fig. 1), respectively. At present, the number of web sites doubles every year, 100 million people worldwide are estimated to be active Internet users, and business on the order of USD 8 billion is done via the Internet. It is expected that within two more years the number of active users might increase tenfold to reach 1 billion, a dramatic increase driven mainly by the populous Asian nations, and that Internet-based sales will account for USD 300 billion or 1% of all global sales within only four years.

Life Sciences and the World Wide Web

Though by now a majority of the 4 million web sites have a commercial background, the scientific use of the WWW will increase as well. Among the initiatives to enhance its quality and speed up transfer of large volumes of data, the Internet2 project is the most ambitious. It will start by mid-1999 with 141 participating universities and 14 companies across the United States. The Internet2 will serve exclusively scientific purposes and "facilitate and coordinate the development, deployment, operation, and technology transfer of advanced, network-based applications and network services to further U.S. leadership in research and higher education and accelerate the availability of new services and applications on the Internet".

Even now in the era of Internet commerce, many thousands of WWW sites are devoted to the global science network. In fact, many recent discoveries and developments, particularly in the life sciences, would be unthinkable without the Internet. The modern era of life sciences started in the 1950s and accelerated in the early 1970s, when the modern tools of genetic engineering were developed, i.e., how to isolate, sequence, and clone DNA and express it in a host organism of one's choice. In those early days, DNA sequencing was cumbersome and restricted to single genes, minor gene clusters, or small virus genomes. In order to store the resulting DNA sequences, the National Biomedical Research Foundation, Washington, DC, USA, created the first sequence databank in 1965.

_

News from IUPAC

Bioinformatics and the Internet

Introduction

Explosive Growth of the World Wide Web

Life Sciences and the World Wide Web

Protein Sequencing Databanks

Bioinformatics Databanks and Web Sites

Challenges to Bioinformatics

Future of Bioinformatics

References

Protein Sequencing Databanks

When DNA sequence information started to grow exponentially during the 1980s, three DNA sequence databanks were established as GenBank (National Center for Biotechnology Information) in Bethesda, MD, USA; the European Molecular Biology Laboratory (EMBL/EBI) Nucleotide Sequence Database, now at the European Bioinformatics Institute (EBI) in Hinxton, UK; and the DNA Data Bank of Japan (DDBJ), Mishima, Japan, serving as mirror sites to each other.

As shown in Fig. 2, the DNA databases contained 40,000 DNA sequences with a total of 50 million base pairs in 1990, but within only a decade this number has increased 40-fold, now reaching 2 billion base pairs. This increase is due largely to advances in DNA technology and robot-assisted sequencing, allowing a shift from genetics to genomics; by now, the complete genomes of 14 bacteria, baker's yeast, 12 viruses and organelles, and the nematode Caenorhabditis elegans have been published on the Internet, and many others are approaching completion, among them the human genome with a total of about 3 billion base pairs alone. This enormous increase in numbers made new types of databases possible and necessary, e.g., web sites devoted to particular organisms such as the chromosome maps of the mouse. As the number of sequenced genomes increases and can be compared to individual geno- and phenotypes ("polymorphisms"), more and more important conclusions about the structure and regulation of single genes and proteins and their interrelation in health and disease can be drawn.

On the level of individual proteins, the first sequence databanks were set up in the mid 1980s, including SwissProt at the Swiss Institute of Bioinformatics, Geneva, Switzerland, and the Protein Information Resource established by the National Biomedical Research Foundation, Washington, DC, USA. When protein structure analysis by X-ray crystallography and later by NMR spectroscopy began to grow rapidly in the 1970s, the Protein Data Bank (PDB) was established at the Brookhaven National Laboratory, Upton, Long Island, NY, USA. It contains at present over 9000 entries on protein structures. Protein science, for a long time focused on protein structure and architecture, is now in a vigorous development in its own right; comparison of protein sequences based on DNA analysis and prediction of their tertiary structure ("from sequence to structure") is an active area of research, fueled by the quest for the so-called proteome, the sum of proteins expressed by a genome under different conditions of regulation and metabolism.

Bioinformatics Databanks and Web Sites

Table 1 lists a few important examples of the many extremely useful web sites related to the life sciences. Much of the experimental work required to arrive at such findings includes the use of complex algorithms which can, in turn, often be found on appropriate Internet pages. Finally, owing to its widespread accessibility, the Internet has also become a huge blackboard for scientific information, including online versions of scientific journals, free science information (such as the public database PubMed offered over the Internet by the National Library of Medicine at Bethesda, MD, USA, which allows free access to over 9 million scientific publications), tutorials, conference announcements, and information on grants and job offers. As a particular consequence of the Internet, the access to information of scientists working in less developed countries has dramatically increased. Thus, as just four among dozens of examples, there now exist the following web sites:

an Asia-Pacific Network of Science and Technology Centers:
http://www.sci-ctr.edu.sg/apnstc/
an African Network for Essential National Health Research:
http://www.healthnet.org/afronets/enhr.htm
a West Africa Research Network (WARN):
http://www.yorku.ca/research/crs/prevent/warn.htm
Uninet - The South African Academic and Research Network:
http://www.idrc.ca/acacia/outputs/op-unin.htm

Challenges to Bioinformatics

The present shift from sequencing single genes to sequencing whole genomes is expected to expand widely our understanding of the regulation of expression, the interaction of proteins, and, finally, of the function of cells and multicellular organisms. Such progress implies new challenges to bioinformatics. There are at present two major problems:

Databases that deal with protein sequences and structures, on one hand, or with the function of whole cells, on the other, contain quite different, though interrelated, types of data. Research groups active in either area tend to chose data formats optimized for their particular purpose. As a result, consistency and coherence of databases can become a major problem.
The higher the complexity of the data, the more difficult is their analysis and their graphical presentation. Most future projects will be highly interdisciplinary, requiring the collaboration of experts from several or even many fields. In this situation, it will be inevitable to support the interaction with databases by expert systems, which integrate the knowledge of specialists and are user-friendly.

Future of Bioinformatics

As a probable consequence of all these developments, the biological and biochemical experiments of the future will, to some extent, be carried out not only in vivo and in vitro, but also in silico. Biology-related information will be the pertinent raw material, available from databases through the WWW, which can be profitable. As seen already in the case of the "gene hunt in silico", it becomes more and more feasible to transform this computer-based information into valuable research results or even products. Thus, it is becoming a reality that novel targets for drugs or new powerful biocatalysts can be identified in the huge and growing mass of computer-based genomic sequence information and that metabolic fluxes in living beings can be clustered, via a bioinformatics approach, to allow the genetic reengineering of metabolic pathways in microorganisms, plants, animals, or man.

References

Internet Domain Survey, July 1998, http://www.nw.com/zone/WWW/top.html;
The Netcraft Web Server Survey,
http://www.netcraft.com/Survey/;
Internet Statistics: Growth and Usage of the Web and the Internet, http://www.mit.edu/people/mkgray/net/;
eMarketer,
http://www.e-land.com/;
Hermes project,
http://www-personal.umich.edu/~sgupta/hermes/
The Internet2 project, http://www.internet2.edu/

_

News from IUPAC

Bioinformatics and the Internet

Table 1. Examples of Useful Web Sites in Bioinformatics

DNA and Protein Sequence Databases

Genomics

Protein Structure

Literature Searches

Homology Searches

Structure Prediction

Protein Architectures

International Organizations

_

Database Type	Description	URL
DNA and Protein Sequence Databases
SRS	SRS Browser for 38 databanks in molecular biology	http://www.embl-heidelberg.de/srs5/
SWISS-PROT and TrEMBL	Annotated protein sequence database (78,082 and 178,957 sequences, respectively)	http://expasy.hcuge.ch/sprot/sprot-top.html
PIR	Protein Information Resource (116,372 sequences)	http://www-nbrf.georgetown.edu/pir/
EMBL	Nucleotide Sequence DNA sequence database (3,046,471 Database sequences)	http://www.ebi.ac.uk/ebi_docs/embl_db/ ebi/topembl.html
GenBank	DNA sequence database (3,044,000 sequences)	http://www.ncbi.nlm.nih.gov/Entrez/ nucleotide.html
DDBJ	DNA Data Bank of Japan (3,073,166 sequences)	http://www.ddbj.nig.ac.jp/
Genomics
Pedant at MIPS	Software system for completely automatic and exhaustive analysis of protein sequence sets (21 complete, 21 unfinished genomes)	http://pedant.mips.biochem.mpg.de/
TIGR Database	Microbial database (20 published genomes, 60 genomes in progress)	http://www.tigr.org/tdb/tdb.html
Sanger Center	Human genome and 24 more genomes	http://www.sanger.ac.uk/
Protein Structure
PDB	Archive of experimentally determined three-dimensional structures (9,179 entries)	http://www.pdb.bnl.gov/
Literature Searches
Medline	Search for citations	http://www4.ncbi.nlm.nih. gov/PubMed/
SWISS-PROT	journals list List of online journals	http://www.expasy.ch/cgi-bin/ jourlist?jourlist.txt
Homology Searches
BLAST	Sequence similarity search in 22 sequence databases and 42 genomes	http://www.ncbi.nlm.nih.gov/BLAST/
FASTA	Sequence similarity search in 25 sequence databases	http://www2.ebi.ac.uk/fasta3/
Structure Prediction
Swiss-Model	Homology modeling	http://expasy.hcuge.ch/swissmod/SWISS-MODEL.html
Biotech Validation Suite for Protein Structures	Quality checks of protein structures	http://biotech.embl-heidelberg.de:8400/
PredictProtein	Prediction of aspects of protein structure	http://www.embl-heidelberg.de/predictprotein/ predictprotein.html
Protein Architectures
SCOP	Protein structure classification	http://scop.mrc-lmb.cam.ac.uk/scop/
CATH	Protein structure classification	http://www.biochem.ucl.ac.uk/bsm/cath/
International Organizations
FAO	Partnership programs of FAO	http://www.fao.org/GENINFO/partner/ default.htm
UNESCO	Biotechnology fellowship programs of UNESCO	http://www.unesco.org/general/ eng/programmes/science/life/index.htm

Published Online: 2009-09-01

Published in Print: 1999-03

Articles in the same Issue

https://doi.org/10.1515/ci.1999.21.2.33