|
I. Architecture The
server side of the phenotype database management system (PhD) includes a MySql database server and a Linux Apache Tomcat Web
Server. Physically, the database server and the web server are not necessary
separated - they can reside on the same machine. In our case, a SUN Microsystem with UNIX OS hosts both database server and
web server. The client side of PhD includes development Personal Computers
(PCs), lab application PCs and laptops. MySQL Database Server: MySQL is a popular
free ware of relational database that virtually runs on all kinds of OS, such
as Windows, Linux and UNIX. It may be downloaded from http://www.mysql.com. Sun Java Application Server: it is freely available online at http://java.sun.com/. ii.
Database Design An
“EAV” database may also use conventional tables when necessary and convenient.
Figure 1 is a diagram for the database design for PhD, which includes two
parts: EAV design and Entity-Relation (E-R) design. In an EAV database model,
attributes with different data types are separated into different tables.
Therefore, each EAV table is a data type specific table, suitable for data of
various types such as integer, real, string etc. In the EAV design, Trait
is the key entity, which stores description for each trait. (Throughout,
we use bold and capitalized fonts to indicate a table of the database and a
bracket indicates a field of a table). [Trait ID], which is a long integer
generated by the system as a unique identifier for a trait, is the primary
key of the table Trait. [Trait name], which
denotes the name of a trait, must
be unique because different traits with the same name will confuse users.
[Create User ID] records the user ID of a person who creates the trait. As a
rule, a trait can only be modified or deleted by either the user who creates
this trait or an administrator. In
correspondence with traits of various data types, such as real number,
integer, string with maximum length of 255, string with minimum length of
256, Boolean and date, we define six tables in the database, EAV_Double,
EAV_Int, EAV_String_Short, EAV_String_Long, EAV_Boolean and EAV_Date, respectively. The data
type of a trait is defined in [Data type] in Trait. Some
other tables, such as USER and PROJECT (Figure 1), may be more suitable for
E-R design; therefore, these tables are in conventional table form and follow
E-R design rules. Because we mainly focus on the EAV design in this paper, we
will not elaborate on the E-R design. Figure 1 Database schema
Note:
An
entity is represented as a rectangle with two sections. Field(s) in the top
section of a rectangle denotes primary key(s). FK denotes foreign key. iii. Unified Modeling Language (UML) Design Requirements Analysis In Molecular and Statistical Genetics field,
managements of phenotype traits whether in some available database or not is
far from efficient and satisfying. Some notable problems in traditional
database where each trait is represented in a separate column in a table are:
1) It’s usually not efficient to add a new trait into a database with fixed
structure and phenotype traits are lack of adequate definition; 2) Storage
space sometime are not taken advantage of for highly spare data; 3) It’s very
hard to flexibly manage the traits since the trait definition and its data
are not separated. Recently the
Entity-Attribute-Value (EAV) data model originally devised as a general means
of information representation in the context of artificial intelligence
research become popular in some special fields such as the HELP system and
the Columbia-Presbyterian clinical data repository.
It is characterized by its Flexibility and Space-efficient storage for highly
sparse data. Therefore, in order to efficiently manage phenotypes for
Statistical Genetic analysis, we are planning to develop a database with EAV
model after much communication with a lot of potential users (Statistic
Genetics Investigators). The investigators who directly interact with the
system are defined as actors. The uses-cases for the phenotype management are
as follows when actors are a general user: 1.
Define phenotype trait 2.
Define valid values for both quantitative
and qualitative traits 3.
Customize phenotype entity forms and input
phenotype data by typing 4.
Import phenotype data from other resources 5.
Check whether data are beyond range or not 6.
Remove or update phenotype trait 7.
Query available phenotype data with
condition 8.
Jointly compile selected phenotype with pedigree
and genotype information and export the results UML Use-case diagram for general users:
When actors are an administrator, they are: 1.
Add or remove projects 2.
Add or remove users 3.
Authorize or un-authorize users in
projects 4.
Assign and remove traits for projects UML Use-case diagram for administrators:
Domain Analysis After
understanding the user’ specifications through use-case analysis, some key
classes can be figured out conceptually. It’s just the work of Domain
analysis. The key classes in the system are: Trait, Project, User, Subject, PhenotypeManager, EAVManager, FileManager, DatabaseManager.
The first four are entity classes and the rest are control classes. Architecture Design User-Interface Package. In this
project, Java Server Page (JSP) is responsible to generate graphic interface.
Therefore, this package makes up of JSPs rather
than some defined classes. However, it has to cooperate with the
Business-Objects package, which contains the classes where the data is
actually stored. Business-Objects Package. This includes the domain classes from the
analysis model such as Trait, Project, User, Subject, PhenotypeManager and so on. Furthermore, these classed are also
separated into two group, entity classes and control classes. The latter
focus on complex interaction between the former. The business-object package
cooperates with the database package in that all business-object classes
operate database by function or classes in Database Package. Database Package. The Database package supplies services to other
classes in the Business-Object package so that they can be stored
persistently. The class DatabaseManager provides
the direct interface where the business objects can access database. Under
the interfaces, some other classes in this package assist in implementing
these functions. Utility Package. The Utility package contains services that are
used in other packages in the system. It is used to refer to persistent
objects throughout the system including the User-Interface, Business-Object,
and Database packages. Detailed Design In
this section, all classes and their interaction are described in detail. The
class diagram is as follows (Some classes in Database Package and Utility Package are not shown.)
Four
entity classes are User, Project, Trait and Subject. Five important control
classes are PhenotypeMnager, Administrator, EAVManager, DatabaseManager and
FileManager. Based
on the functions involved in phenotype management, six modules are designed
where these classes interact with each other. Module
one: Define phenotype traits and set control values Overview: A graphic interface will be provided for users to
set attributes (including the phenotype name) of each new trait. Among these
attributes there are two key ones characterizing the EAV model, data type and
control type. Data type refers to the format with which phenotype data of a
trait will be stored in database. Six types are allowed for in the current
version, Double, Integer, Short String, Long String, Boolean and Date. Each
trait should be set a data type. Data type of a trait is mainly determined by
the feature of the trait and its function. Two kinds of control type are
defined as quantitative-trait control and qualitative-trait control.
Different kinds of traits based on the control type have different definition
of valid values. For quantitative trait, lower and upper bound can be set.
Trait values beyond these ranges will be excluded when imported. For
qualitative traits, discrete values instead of ranges are needed. Other
attributes of a trait are the measure unit, description, creating user and
date, where the last two can be filled automatically by system. System will
check the definition before storing them into database. Duplicated traits are
invalid. Sequence
Diagram:
Module two:
Import phenotypes from other resources Overview: In
this module, phenotypes will be imported by users conveniently form an Excel or text file with the first containing names
of fields. System will recognize each field. If there some fields which the
system cannot recognize, some exception will be thrown and all actions will
be stopped. On successfully recognizing all fields, new traits, which are not
available in database, are captured for users to define them. Until the
definition of metadata and control data is complete, system will
automatically transform the format of the data from the conventional to the
EAV and import them into database. System will obtain trait IDs of the traits
from database in a table and replace the traits name as trait IDs; rule out
invalid values of the traits based the defined criteria, i.e., control data;
select a corresponding EAV table to store the trait values. It turns out that
all valid values of each trait are stored into database with EAV format but
the invalid are picked out to inform users as a violation-report. Users can
check these values in the original file according to the report. Sequence Diagram:
Module Three: Assign phenotype Traits for
projects and users
Overview: Usually, there is more than one project for a laboratory
and a project involves a lot of members. Some relations between the projects
and traits should be defined. For example, we hope each phenotype 1) belongs
to at least one project; 2) can only be used by users of its projects; 3) can
only be delete or modified by the user who create it. In this model, we can
define our projects and add available phenotypes and users for each project.
The phenotypes in the database can be modified and deleted by its creator.
Particularly, when a phenotype is deleted, all records in the corresponding
EAV data table should be deleted to protect the Referential Integrity. In
addition, user can also define some classes in this module. These classes can
contribute to organizing the phenotypes systematically. Sequence Diagram:
Module Four: Customize the entry forms and type phenotype dataOverview: Any authorized user can customize his or her
interested entry forms of phenotype traits.
After the customization, system can directly show entry forms for a
user to input phenotype data by typing. The user can input a series of
phenotype data and submit to the Web Server. System will check whether the
data are valid or not and transform the valid into EAV format. Finally, valid
data will be stored into database with EAV format and the invalid show in
front of users. Sequence Diagram:
Module
Five: updating or deleting phenotype traits Overview: User
can update and delete available phenotype traits. It’s a routine operation
for database management. Some additional alteration is automatically done by
the system following users’ changing the data type and control type of a
trait. A data type exclusively corresponds to an EAV table in database as
stated above. Therefore, an alteration of a trait’s data type from one to another
will lead to the automatic deletion of available values of the trait in the
former table. Similarly, various control types are uniquely connected with
various tables storing control values, quantitative traits with QuantitativeTraitControl and qualitative traits
with QualitativeTraitContol. In order
to keep integrity between various control types, an auto-deletion by system
also takes place succeeding the change of control type. A soft-deletion mechanism was built for
the trait deletion. The soft-deletion means that system will not really
delete a defined trait after receiving the deletion command. Instead, it just
makes the trait unobtainable by routine operation. To be specific, the CreatedUserID of the trait in MetaofTraits will be changed to be \uFFFDC1 and
the association between the trait and projects will be canceled (which is
indicated in TraitsforProjects).
Other parameters in the MetaofTraits table
and trait values in corresponding EAV tables will be reserved. Administrators’
filling an available user ID into the CreatedUserID
of a soft-deleted trait can restore the trait without loss of information. It
should be a safe strategy and thus a rewarding work for most users. Sequence Diagram:
Module Six: Ad hoc Query, Jointly compile
and Export Phenotypes
Overview: A powerful query function is built for users to
facilitate the retrieval of phenotype data as they expected. Authorized users
can choose the phenotypes of interest in a project and easily construct a
complex condition for the query simply by clicking some buttons. It is a hard
work to efficiently implement this query in the EAV design where database
sever cannot directly execute it although it is in a conventional design.
Therefore the system rather than the database server has to parse the query,
retrieve all involved traits one by one, select and compile the data. Some
excellent algorisms such as quick sort and merge sort have to be designed to
make the procedure more efficient. Not only phenotype can be queried and
compiled, but also pedigree and genotype can be jointly compiled with the
phenotype. Users can select some items about pedigree information, marker and
phenotype names as they expect in list-boxes. System will search these data
in database and re-compile the data. Ultimately, these data will be compiled
into a conventional table, where each column represents a select-item and
each row list an individual’s information. A hyperlink is also dynamically
provided for user to download the compile results in an Excel file. Sequence Diagram:
|