I. Architecture

The server side of the phenotype database management system (PhD) includes a MySql database server and a Linux Apache Tomcat Web Server. Physically, the database server and the web server are not necessary separated - they can reside on the same machine. In our case, a SUN Microsystem with UNIX OS hosts both database server and web server. The client side of PhD includes development Personal Computers (PCs), lab application PCs and laptops.

MySQL Database Server: MySQL is a popular free ware of relational database that virtually runs on all kinds of OS, such as Windows, Linux and UNIX. It may be downloaded from http://www.mysql.com.

Sun Java Application Server: it is freely available online at http://java.sun.com/.

 

ii. Database Design

An “EAV” database may also use conventional tables when necessary and convenient. Figure 1 is a diagram for the database design for PhD, which includes two parts: EAV design and Entity-Relation (E-R) design. In an EAV database model, attributes with different data types are separated into different tables. Therefore, each EAV table is a data type specific table, suitable for data of various types such as integer, real, string etc.

 

In the EAV design, Trait is the key entity, which stores description for each trait. (Throughout, we use bold and capitalized fonts to indicate a table of the database and a bracket indicates a field of a table). [Trait ID], which is a long integer generated by the system as a unique identifier for a trait, is the primary key of the table Trait. [Trait name], which denotes the name of a trait, must be unique because different traits with the same name will confuse users. [Create User ID] records the user ID of a person who creates the trait. As a rule, a trait can only be modified or deleted by either the user who creates this trait or an administrator.  In correspondence with traits of various data types, such as real number, integer, string with maximum length of 255, string with minimum length of 256, Boolean and date, we define six tables in the database, EAV_Double, EAV_Int, EAV_String_Short, EAV_String_Long, EAV_Boolean and EAV_Date, respectively. The data type of a trait is defined in [Data type] in Trait.

 

Some other tables, such as USER and PROJECT (Figure 1), may be more suitable for E-R design; therefore, these tables are in conventional table form and follow E-R design rules. Because we mainly focus on the EAV design in this paper, we will not elaborate on the E-R design.

 

Figure 1 Database schema

Note: An entity is represented as a rectangle with two sections. Field(s) in the top section of a rectangle denotes primary key(s). FK denotes foreign key.

 

iii. Unified Modeling Language (UML) Design

 

Requirements Analysis

In Molecular and Statistical Genetics field, managements of phenotype traits whether in some available database or not is far from efficient and satisfying. Some notable problems in traditional database where each trait is represented in a separate column in a table are: 1) It’s usually not efficient to add a new trait into a database with fixed structure and phenotype traits are lack of adequate definition; 2) Storage space sometime are not taken advantage of for highly spare data; 3) It’s very hard to flexibly manage the traits since the trait definition and its data are not separated. Recently the Entity-Attribute-Value (EAV) data model originally devised as a general means of information representation in the context of artificial intelligence research become popular in some special fields such as the HELP system and the Columbia-Presbyterian clinical data repository. It is characterized by its Flexibility and Space-efficient storage for highly sparse data. Therefore, in order to efficiently manage phenotypes for Statistical Genetic analysis, we are planning to develop a database with EAV model after much communication with a lot of potential users (Statistic Genetics Investigators). The investigators who directly interact with the system are defined as actors.

The uses-cases for the phenotype management are as follows when actors are a general user:

1.      Define phenotype trait

2.      Define valid values for both quantitative and qualitative traits

3.      Customize phenotype entity forms and input phenotype data by typing

4.      Import phenotype data from other resources

5.      Check whether data are beyond range or not

6.      Remove or update phenotype trait

7.      Query available phenotype data with condition

8.      Jointly compile selected phenotype with pedigree and genotype information and export the results

 

UML Use-case diagram for general users:

 

When actors are an administrator, they are:

1.      Add or remove projects

2.      Add or remove users

3.      Authorize or un-authorize users in projects

4.      Assign and remove traits for projects

 

UML Use-case diagram for administrators:

 

Domain Analysis

After understanding the user’ specifications through use-case analysis, some key classes can be figured out conceptually. It’s just the work of Domain analysis. The key classes in the system are: Trait, Project, User, Subject, PhenotypeManager, EAVManager, FileManager, DatabaseManager. The first four are entity classes and the rest are control classes.

 

Architecture Design
The system will comprise a lot of classes. In order to efficiently manage these classes, they can be classified several categories (packages) technologically.

User-Interface Package.  In this project, Java Server Page (JSP) is responsible to generate graphic interface. Therefore, this package makes up of JSPs rather than some defined classes. However, it has to cooperate with the Business-Objects package, which contains the classes where the data is actually stored.

Business-Objects Package. This includes the domain classes from the analysis model such as Trait, Project, User, Subject, PhenotypeManager and so on. Furthermore, these classed are also separated into two group, entity classes and control classes. The latter focus on complex interaction between the former. The business-object package cooperates with the database package in that all business-object classes operate database by function or classes in Database Package.

Database Package. The Database package supplies services to other classes in the Business-Object package so that they can be stored persistently. The class DatabaseManager provides the direct interface where the business objects can access database. Under the interfaces, some other classes in this package assist in implementing these functions.

Utility Package. The Utility package contains services that are used in other packages in the system. It is used to refer to persistent objects throughout the system including the User-Interface, Business-Object, and Database packages.

 

 

Detailed Design

In this section, all classes and their interaction are described in detail.

The class diagram is as follows (Some classes in Database Package and Utility Package are not shown.)

Four entity classes are User, Project, Trait and Subject. Five important control classes are PhenotypeMnager, Administrator, EAVManager, DatabaseManager and FileManager.

 

Based on the functions involved in phenotype management, six modules are designed where these classes interact with each other.

 

Module one: Define phenotype traits and set control values

Overview:

A graphic interface will be provided for users to set attributes (including the phenotype name) of each new trait. Among these attributes there are two key ones characterizing the EAV model, data type and control type. Data type refers to the format with which phenotype data of a trait will be stored in database. Six types are allowed for in the current version, Double, Integer, Short String, Long String, Boolean and Date. Each trait should be set a data type. Data type of a trait is mainly determined by the feature of the trait and its function. Two kinds of control type are defined as quantitative-trait control and qualitative-trait control. Different kinds of traits based on the control type have different definition of valid values. For quantitative trait, lower and upper bound can be set. Trait values beyond these ranges will be excluded when imported. For qualitative traits, discrete values instead of ranges are needed. Other attributes of a trait are the measure unit, description, creating user and date, where the last two can be filled automatically by system. System will check the definition before storing them into database. Duplicated traits are invalid.

 

Sequence Diagram:

 

Module two: Import phenotypes from other resources

Overview:

In this module, phenotypes will be imported by users conveniently form an Excel or text file with the first containing names of fields. System will recognize each field. If there some fields which the system cannot recognize, some exception will be thrown and all actions will be stopped. On successfully recognizing all fields, new traits, which are not available in database, are captured for users to define them. Until the definition of metadata and control data is complete, system will automatically transform the format of the data from the conventional to the EAV and import them into database. System will obtain trait IDs of the traits from database in a table and replace the traits name as trait IDs; rule out invalid values of the traits based the defined criteria, i.e., control data; select a corresponding EAV table to store the trait values. It turns out that all valid values of each trait are stored into database with EAV format but the invalid are picked out to inform users as a violation-report. Users can check these values in the original file according to the report.

Sequence Diagram:

 

Module Three: Assign phenotype Traits for projects and users

Overview:

Usually, there is more than one project for a laboratory and a project involves a lot of members. Some relations between the projects and traits should be defined. For example, we hope each phenotype 1) belongs to at least one project; 2) can only be used by users of its projects; 3) can only be delete or modified by the user who create it. In this model, we can define our projects and add available phenotypes and users for each project. The phenotypes in the database can be modified and deleted by its creator. Particularly, when a phenotype is deleted, all records in the corresponding EAV data table should be deleted to protect the Referential Integrity. In addition, user can also define some classes in this module. These classes can contribute to organizing the phenotypes systematically.

 

Sequence Diagram:

 

Module Four: Customize the entry forms and type phenotype data

Overview:

Any authorized user can customize his or her interested entry forms of phenotype traits.  After the customization, system can directly show entry forms for a user to input phenotype data by typing. The user can input a series of phenotype data and submit to the Web Server. System will check whether the data are valid or not and transform the valid into EAV format. Finally, valid data will be stored into database with EAV format and the invalid show in front of users.

 

Sequence Diagram:

 

Module Five: updating or deleting phenotype traits

Overview:

User can update and delete available phenotype traits. It’s a routine operation for database management. Some additional alteration is automatically done by the system following users’ changing the data type and control type of a trait. A data type exclusively corresponds to an EAV table in database as stated above. Therefore, an alteration of a trait’s data type from one to another will lead to the automatic deletion of available values of the trait in the former table. Similarly, various control types are uniquely connected with various tables storing control values, quantitative traits with QuantitativeTraitControl and qualitative traits with QualitativeTraitContol. In order to keep integrity between various control types, an auto-deletion by system also takes place succeeding the change of control type.

 

A soft-deletion mechanism was built for the trait deletion. The soft-deletion means that system will not really delete a defined trait after receiving the deletion command. Instead, it just makes the trait unobtainable by routine operation. To be specific, the CreatedUserID of the trait in MetaofTraits will be changed to be \uFFFDC1 and the association between the trait and projects will be canceled (which is indicated in TraitsforProjects). Other parameters in the MetaofTraits table and trait values in corresponding EAV tables will be reserved. Administrators’ filling an available user ID into the CreatedUserID of a soft-deleted trait can restore the trait without loss of information. It should be a safe strategy and thus a rewarding work for most users.

 

Sequence Diagram:

 

Module Six: Ad hoc Query, Jointly compile and Export Phenotypes

Overview:

A powerful query function is built for users to facilitate the retrieval of phenotype data as they expected. Authorized users can choose the phenotypes of interest in a project and easily construct a complex condition for the query simply by clicking some buttons. It is a hard work to efficiently implement this query in the EAV design where database sever cannot directly execute it although it is in a conventional design. Therefore the system rather than the database server has to parse the query, retrieve all involved traits one by one, select and compile the data. Some excellent algorisms such as quick sort and merge sort have to be designed to make the procedure more efficient.

 

Not only phenotype can be queried and compiled, but also pedigree and genotype can be jointly compiled with the phenotype. Users can select some items about pedigree information, marker and phenotype names as they expect in list-boxes. System will search these data in database and re-compile the data. Ultimately, these data will be compiled into a conventional table, where each column represents a select-item and each row list an individual’s information. A hyperlink is also dynamically provided for user to download the compile results in an Excel file.

 

Sequence Diagram: