RJ-ROBBINS database Fundamentals

Database Fundamentals

Robert J. Robbins

General definition: A database is any collection of related data.

Restrictive definition: A database is a persistent, logically coherent collection of inherently meaningful data, relevant to some aspects of the real world.

A database management system (DBMS) is a collection of programs that enables users to create and maintain a database. According to the ANSI/SPARC DBMS Report (1977), a DBMS should be envisioned as a multi-layered system:

Database management systems provide several functions in addition to simple file management:

allow concurrency
control security
maintain data integrity
provide for backup and recovery
control redundancy
allow data independence
provide non-procedural query language
perform query optimization

Many different individuals are involved with a database management system over its life:

systems analysts
database designers
database administrators
application developers
users

The top layer of a database system is where users and administrators interact with the system.

The middle layer of a database system is where the logic of the system occurs. The structure of the database is created and maintained using a data definition language (DDL) and the content of the database is added, managed, and searched using a data manipulation language (DML).

The bottom layer is where the system interacts with the operating system and the hardware of the underlying computer infrastructure.

Rarely does any one individual have a complete understanding of all aspects of these different layers.

In large database implementations (say, a financial system), people with domain expertise have mastery of the subject matter of the database and they work closely with application programmers to ensure that the behavior of the systems meets the needs of the end users. Database administrators with knowledge of the system being used (Oracle, Sybase, SQL Server, etc) are needed to make sure that the logical layers behave appropriately. System administrators and other infrastructure experts are required to keep the overall IT infrastructure working, productively and safely.

What is a relational database?

a database that treats all of its data as a collection of relations

What is a relation?

a kind of set
a subset of a cartesian product
an unordered set of ordered tuples

Some elementary discussions of relational databases say that they are called relational because they allow users to relate data in one file (or table) to data in another file (or table). This is incorrect.

E. F. Codd, the inventor of relational database theory, recognized that all business data could easily be represented in a format that satisfied the formal conditions necessary to be considered a relation in the set-theoretic sense. He further realized that IF data were always constrained to exist in the form of relations, then a database management system could be created where the deep machinery would carry out manipulations of the data purely from the point of view of set theory, with no need for knowledge of the meaning, or semantics, of the data.

This incredibly important insight let others begin to implement a database system in a way that would allow constant improvements to the underlying machinery, with no disruption to the overlying meaning of the data.

SET	any collection of distinct entities of any sort
examples	A = {1,2,3,4,5,} B = {H,T} C = {R,B} D = {Grant,Sherman,Lee}
CARTESIAN PRODUCT	a set of ordered pairs, produced by combining each element of one set with each element of another set.
example	B x C = {<H,R>,<H,B>,<T,R>,<T,B>}
Note	Cartesian products may be generated by multiplying any number of sets together. The actual number of sets involved in a particular case is said to be the degree or arity of that Cartesian product.
Relation	a subset of a Cartesian product
example	Q = {<H,R>,<H,B>}
Note	relations may be of any degree (arity).

A set is usually indicated by a comma-delimited list of the names of its members within a pair of wavy brackets.

R	=	{ 1,2,3,4,5,6 }
G	=	{ Marshall, Eisenhower, Bradley }

The members of a set are unordered. Two sets are considered equivalent if and only if they contain exactly the same members, without regard for the order in which the members are listed.

R	=	{ 1,2,3,4,5,6 }
	=	{ 3,2,1,6,4,5 }
G	=	{ Marshall, Eisenhower, Bradley }
	=	{ Bradley, Marshall, Eisenhower }

An ordered double (or triple or quadruple or n-tuple) is usually indicated by including a comma-delimited list of the names of its members within a pair of pointed brackets:

S	=	< 2,4>
C	=	< Marshall, Eisenhower, Bradley >

Order must be maintained in ordered n-tuples. Two tuples are considered different if they contain the same members in a different order.

S	=	< 2,4 > ≠ < 4,2 >
C	=	< Marshall, Eisenhower, Bradley >
	≠	< Bradley, Eisenhower, Marshall >

A set may consist of an unordered collection of ordered tuples. For example, we could imagine the set of all ordered pairs of integers, such that the first element is the square root of the second element.

{ < 1,1 > , < 2,4 > , < 3,9 > , ...}

LET	R be the set of possible outcomes when rolling a single red die.

{ 1,2,3,4,5,6 }

LET	B be the set of possible outcomes when rolling a single blue die.

{ 1,2,3,4,5,6 }

THEN	The Cartesian product R x B gives the set of all possible outcomes when the two dice are rolled together:

A Cartesian product of two sets can be generated by combining every member of one set with every member of the other set. This results in a complete set of ordered pairs, consisting of every possible combination of one member of the first set with one member of the second set. The number of elements in a cartesian product is equal to M x N, where M and N give the number of members in each set.

	Starting two sets
	A Cartesian product of two sets, shown as a list of ordered pairs.
	A Cartesian product of two sets, shown as a connection diagram, with each member of the first set connected to each member of the other set.

A Cartesian product pairs every member of the first set with every member of the second set.

A relation pairs some members of the first set with some members of the second set.

A relation, therefore, must always be representable as a subset of a Cartesian product.

A binary relation is a set of ordered doubles, with one element a member of the first set and one element a member of the second set. Generally, we could represent a set of ordered doubles as below. S1 is the first set and S2 the second.

By adding sets, relations can be extended to include ordered triples, ordered quadruples or, in general, any ordered n-tuple, as below. A relation with n participating sets is said to be of degree n or to possess arity n.

Data models consist of data structures and permitted operations on those data structures. Part of Codd's genius was to recognize that many of the standard set operators that can take relations as operands map nicely to real data manipulation problems:

cartesian product
union
intersection
difference

Codd devised some additional operators to provide extra manipulatory power:

SELECT
PROJECT
JOIN
DIVIDE

The original operators have now been extended to include some additional useful manipulations:

OUTER JOIN
OUTER UNION

Considerable study has been made of the properties of relations as they affect the behavior of relational databases. The results of these studies are captured in the definition of normal forms.

First Normal Form:

A relation is in first normal form (1NF) if and only if all underlying domains contain atomic values only.

Second Normal Form:

A relation is in second normal form (2NF) if and only if it is in 1NF and every non-key attribute is fully dependent on the primary key.

Third Normal Form:

A relation is in third normal form (3NF) if and only if it is in 2 NF and the non-key attributes are mutually independent.

The concepts of normal form were developed in the early days of relational database systems, when many of the older systems being replaced (or designed) had been designed on a very ad-hoc basis, often following the format of paper records used before automation was in place.

The notions of normal form helped to establish guidelines to shape data-modelling behavior away from the old, ad-hoc methods.

Now that database design is seen as a professional endeavor in its own right, and now that notions of good data modelling practices are more widespread, the concepts of normal form are encountered less often.

The Entity-Relationship (E-R) data model is a semantically rich model that can be mapped to a relational system.

The three files represented above are all relations in the formal sense. Chen (1976) noted that different relations may play different roles in a database and that being able to recognize and document those roles is a key part of database design. The "faculty" and the "department" relations above both store information about particular real-world entities. The "member-of" relation, on the other hand, stores information about specific relationships involving individual pairs of real-world entities.

Different needs for access and use of the database can be supported through different user views

Layers may be added to a conceptual design in order to increase the semantic richness available at the top design level.

If layered conceptual models are used, the layering may be perceived differently by the system's users and developers. Users often see the database only in terms of the views that they employ. System analysts and designers may think primarily about the E-R schema, whereas the database administrator is likely to deal primarily with the relational schema and the physical system.

Sets of real-world entities are represented with named rectangles:

Relationships between members of entity sets are represented with named diamonds that are connected to the rectangles of the participating entity sets with directed arcs:

Arcs are drawn with an orientation that "points" from foreign keys to primary keys. The min:max participation cardinality can be indicated by placing pairs of numbers on each arc. Here, "4,n" means that every department is required to have at least four student majors, but can have many more; "1,2" means that each student is required to have at least one major and is permitted to have no more than two majors. Sometimes only the maximum participation cardinalities are shown.

Many different cardinalities are possible. Documenting the cardinalities is an essential part of database analysis and design.

Faculty and departments entities could be related by a many-to-many "member-of" relationship:

They could also be related by a one-to-one "chairman-of" relationship:

Combining these two relationships into a single diagram, we would have:

A database design derived from the figure above would allow a faculty member to chair a department of which he/she was not a member. To indicate an integrity constraint that requires membership in a department in order to chair the department, the E-R diagram would be modified as below:

Class hierarchies ("ISA" hierarchies) could be indicated as below:

Relationships may be recursive. Here, this E-R figure represents all possible mother-child relationships among all humans.

RJR Experience and Expertise

Researcher

Robbins holds BS, MS, and PhD degrees in the life sciences. He served as a tenured faculty member in the Zoology and Biological Science departments at Michigan State University. He is currently exploring the intersection between genomics, microbial ecology, and biodiversity — an area that promises to transform our understanding of the biosphere.

Educator

Robbins has extensive experience in college-level education: At MSU he taught introductory biology, genetics, and population genetics. At JHU, he was an instructor for a special course on biological database design. At FHCRC, he team-taught a graduate-level course on the history of genetics. At Bellevue College he taught medical informatics.

Administrator

Robbins has been involved in science administration at both the federal and the institutional levels. At NSF he was a program officer for database activities in the life sciences, at DOE he was a program officer for information infrastructure in the human genome project. At the Fred Hutchinson Cancer Research Center, he served as a vice president for fifteen years.

Technologist

Robbins has been involved with information technology since writing his first Fortran program as a college student. At NSF he was the first program officer for database activities in the life sciences. At JHU he held an appointment in the CS department and served as director of the informatics core for the Genome Data Base. At the FHCRC he was VP for Information Technology.

Publisher

While still at Michigan State, Robbins started his first publishing venture, founding a small company that addressed the short-run publishing needs of instructors in very large undergraduate classes. For more than 20 years, Robbins has been operating The Electronic Scholarly Publishing Project, a web site dedicated to the digital publishing of critical works in science, especially classical genetics.

Speaker

Robbins is well-known for his speaking abilities and is often called upon to provide keynote or plenary addresses at international meetings. For example, in July, 2012, he gave a well-received keynote address at the Global Biodiversity Informatics Congress, sponsored by GBIF and held in Copenhagen. The slides from that talk can be seen HERE.

Facilitator

Robbins is a skilled meeting facilitator. He prefers a participatory approach, with part of the meeting involving dynamic breakout groups, created by the participants in real time: (1) individuals propose breakout groups; (2) everyone signs up for one (or more) groups; (3) the groups with the most interested parties then meet, with reports from each group presented and discussed in a subsequent plenary session.

Designer

Robbins has been engaged with photography and design since the 1960s, when he worked for a professional photography laboratory. He now prefers digital photography and tools for their precision and reproducibility. He designed his first web site more than 20 years ago and he personally designed and implemented this web site. He engages in graphic design as a hobby.

Collection of publications by R J Robbins

Reprints and preprints of publications, slide presentations, instructional materials, and data compilations written or prepared by Robert Robbins. Most papers deal with computational biology, genome informatics, using information technology to support biomedical research, and related matters.

Research Gate page for R J Robbins

ResearchGate is a social networking site for scientists and researchers to share papers, ask and answer questions, and find collaborators. According to a study by Nature and an article in Times Higher Education , it is the largest academic social network in terms of active users.

Curriculum Vitae for R J Robbins

short personal version

Curriculum Vitae for R J Robbins

long standard version