%@Import Namespace="System.Configuration"%>
System Requirements
MAGI will run on both Windows and Linux systems. To date, it has been tested on Windows 98, 2000, XP, and Linux Redhat 7.x. Optimally, at least 200 megabytes of RAM should be available, but this can increase or decrease, depending on the size of the sequences being processed. Generally, sequences of less than one million megabases can be run on systems with less than 200MB of RAM. Additionally, any Windows system running MAGI should have Sun Microsystem's JRE v1.4.1 SE, available at the Sun Java homepage. If you're using GNU/Linux, then use the Blackdown version of JRE v1.4.1 (the Sun version of the JRE has known fatal bugs when run on Linux systems).
Magein, the accompanying file that pre-processes sequence and table files, can run on both Windows and Linux systems as well. However, the version included in this build of MAGI is compatible with Linux systems only.Installation
To install MAGI, unzip the compression file, and place the "Magi.jar" file in your Linux bin folder. Remove the "magein" file and also place that in the bin folder. You should now be able to execute MAGI by double-clicking its icon in the directory explorer, or using shell to execute it by typing "java -jar Magi.jar".Using the Template File
Included in the compressed file with MAGI is also a file called "template.dat". This file is used to obtain initial values for default folders and settings for MAGI. You can open and change this file to suit your individual needs. The fields are:Notice that the values set to each field immediately follow the "=", and have no trailing spaces or characters. Any or all of the fields can be left blank (MAGI will use its built-in defaults if that is the case).exec - The executable file to call Magein. default_seq - The default name used when a sequence file isn't loaded (note that this name must end in a "."). folder - The default folder where Glimmer scripts are located. If Glimmer is checked as an algorithm to use in the 'New Session' window (see below), then a dialog will appear asking you to select the appropriate Glimmer script to use. output - The default folder where MAGI-generated files will be saved. If this is blank, then the subdirectory "pred" will be used. data - The default folder where datafiles, header files, and sequence files are located. If this is blank, then the subdirectory "data" will be used. glimmer - The default threshold setting for Glimmer algorithms. tcode - The default threshold setting for TestCode algorithms. gscan - The default threshold setting for GeneScan algorithms. codon - The default threshold setting for Codon Usage algorithms.
MAGI Startup
Start by initiating MAGI either via double-clicking its icon, or calling it up through the shell. Verify the program is booting up by seeing a short splash screen. Once the program is completely booted, enter a name for your current session (such as "chromosome 4") and press 'OK'.
The next screen that pops up is the 'New Session' window, allowing you to pre-process files based on the algorithms of your choosing, and setting some session defaults. Essentially, base predictions are made from this window. From top to bottom, the settings are:After setting the initial parameters, press "OK". If you loaded a sequence file and marked some algorithms for use, then MAGI will call Magein, which will generate the initial prediction files based on the algorithms selected. Depending on the size of the sequence file, you will eventually get a pop-up informing you that the process is complete. Press "OK" to go to the operation window. If you left the sequence file field, and the algorithm checkboxes unmarked, MAGI will load straight to the operation window.Output Directory - The directory where you want all the data that MAGI generates placed. When making predictions, MAGI generates several files that you may or may not choose to view. This is where they are saved. This is also where all the pre-processed data goes. This field is required. Sequence File - The location of the sequence file that you want to process. Currently, MAGI supports FASTA and EMBL format files. This field is optional (everytime you open MAGI you do not have to create new predictions - you can always just open existing ones). Glimmer Directory - The directory where all your Glimmer executables are. This is only important if you want to run Glimmer predictions. This field is optional. Algorithms - Mark the corresponding checkbox for the algorithms you want to use for the predictions. Each algorithm will generate its own results file. If you don't mark any checkbox, then MAGI will assume you don't want to generate new predictions, and just want to open existing ones. These fields are optional. Header Settings - Mark the checkbox if you want to also generate a header (CDS) file from the sequence file, creating a file with a list of identified ORFs. Check this only if EMBL format files are used. This will split the sequence file into two separate fils - "header.dat" and "data.txt" - the former containing the header data, and the latter containing the sequnce in FASTA format. This field is optional. Window Scan Settings - Mark the checkbox if you want to specify the window scan settings, used during the reading of the sequence file. After checking the box, you can set the window length and slide length accordingly (increments are by nucleotide). This field is optional. ORF Settings - The numerical value in this field is the minimum size of any identified gene, in terms of nucleotide, used during predictions. Genes whose length is below this cutoff are ignored. Loading Datafiles
The operation window is the main window in MAGI, and where most of the prediction work will be done. If you created new predictions using the 'New Session' window, you can open the datafiles generated by going to "File > Open Datafile" and browsing to the directory specified in the "Output Directory" field of the 'New Session' window (see "MAGI Startup"). Otherwise, browse to already existing pre-processed prediction files. Datafiles are the files that actually contain base predictions - basically, all the ORFs in a sequence, along with a signal value assigned to it determined by an algorithm.
If you used MAGI to make your prediction files, open one. The normal format of the files is "<name_of_sequence_file>_.dat" (e.g. "chr4_codon.dat", "seq102a_gscan.dat"). When a datafile is selected to be opened, it appears in the operation window as a tab. If you open more than one datafile, notice that the names of the datafiles you have opened line up left to right. Click on the tab of a file to jump to that datafile. There are several buttons and fields on each datafile tab that shall be discussed later. Loading Headerfiles
To open a header file go to "File > Open Header" and browse to the directory which contains the header file you want to open. Note that you can preview the file in a small window to the right of the selection screen. After selecting a file, it will appear as a tab, with its label starting with "HEADER - ". Notice that unlike the datafile tabs, the header tab doesn't have many buttons or fields. Also notice that once a header file has been opened, you cannot open another one, to prevent confusion. In order to open a different header file, you will need to close the existing one first.Loading Sequences
To open a sequence file ,Click "Browse" button to browse to the directory which contains the sequence file you want to open. Then Click "Upload" button to load into memory, a message will appear informing you that the sequence file is uploades.Pasting DNA Sequence
Or Copy a Fasta format sequence onto the text box under the header "Paste DNA Sequence"While the sequence file is not visible from the operations window, you might need to load it in order to use some of MAGI's features, such as the Skew and BLAST preparation.
Sequence Data Section
The sequence data section is a text view of your ORF set. At the main menu, there is an option called "Data Views" that allows you to switch between what you see in each of the datafile tabs. The options are:Raw Data - View all the ORFs that the algorithm specifies. All the ORFs that are identified are shown here, with the only filter being the ORF length (specified in the 'New Session' window). Predictions - View all the ORFs, after filters are applied. By default, the only filter applied when a new datafile is opened is no overlapping ORFs. When new filters are added, the predicted ORFs in this set will change. Header Comparison - If you loaded a header file along with the datafile, you can view your predictions compared to what the header says, which can be useful for calibrating MAGI's filters. In this view, there are three columns - the start, the stop, and the comparison column. In the comparison column a '0.0' equals a match, a '1.0' equals a false positive, and a '-1.0' equals a false negative. If you have loaded a header file, selected this option, and your datafile ORF viewer comes up blank or gray, this means that the header and datefile have not been compared yet. To do this, press the "Run Analysis" button in the 'Prediction Paramters' section and the comparison will appear. Prediction Parameters Section
From this section, you can set exactly what filters you want applied to the ORFs predicted in the algorithm's datafile tab. The options are:In addition to the settings, there are two buttons that can be pressed:Threshold Setting - The threshold setting is a floor value for each ORFs signal. All ORFs whose signal is below this threshold are discarded, and not used in predictions. Filter Density - Checking this box will mean that the length:signal ratio of all ORFs will be considered in creating predictions. Skew - Checking this box will mean that MAGI will discard ORFs that violate the sequence's cumulative skew pattern (more information about this is available at Filters). Algorithm Filter - Checking this box will mean that the current algorithm's predicted ORFs will be compared with the ORFs of an algorithm you specify, before predictions are made (more information about this filter is available at Filters). Run Analysis - After setting the filters, press this button to actually generate the predictions. Once generation is complete, the predicted ORFs will appear in the 'Sequence Data' section in the viewer, when the "Data Views" is set to "Predictions". Threshold Validation - Pressing this will load another window, allowing you to create run a sequential series of prediction analyses (basically, you can run many predictions, each with a different threshold). This may be useful to trying to find the optimum threshold for an organism, but it will only work if you have a header file for the sequence loaded. The window that pops up, the 'Threshold Validation' window, allows you to set a file to save the data, the starting threshold, the stopping threshold, and the increment level you want to proceed at. Once those are set, you can press the 'Validate' button to run the series. The resulting file will be a list with the threshold and the accompanying false discovery rate, false positives, and false negatives (which is why a header is required). Prediction Statistics Section
This section contains information about the predictions you generate. Some of the information here applies only when you have a header file loaded:At the bottom of this section, there are two arrow buttons. These allow you to flip between the separate analyses, and view each individually.Analysis Number - The ID number of each prediction, to differentiate. Each analysis number applies only to the currently viewed datafile tab. Datafile - The name of the file associated with the current datafile tab, to let you know which algorithm you are currently viewing. Threshold - The threshold setting for this prediction. Skew - "True" if skew was used for this prediction, "false" otherwise. Density Filter - "True" if the density filter was used for this prediction, "false" otherwise. Algorithm Filter - Name of the algorithm used as the algorithm filter on the current datafile tab. Blank if no algorithm filter was used. Matches - If a header is compared, the number of ORFs predicted by MAGI that match with the header. False Positives - If a header is compared, the number of ORFs predicted by MAGI that do not appear in the header. False Negatives - If a header is compared, the number of ORFs not predicted by MAGI that appear in the header. False Discovery Rate - If a header is compared, the value that represents the overall accuracy of MAGI's prediction, based on the number of matches, false positives, and false negatives. Visuals Section
This section is where graphs of the individual algorithm predictions can be launched. After predictions are generated by pressing the "Run Analysis" button in the 'Prediction Parameters' section, you can use this to view the predictions visually. If the "Predictions" radio-button is checked, then the graph displayed will be of just the predictions (note that a header is not required if you want to just view the predictions). If the "Comparisons" radio-button is chcked then the graph displayed will be of the predictions and of its comparison to the header file. Pressing "Launch Codon Viewer" will open a new window with the graph.
In addition, there is a second button, labeled "Save Plotting Data". Pressing this converts the prediction data into a text file, represented by positions in a graph that can be used by a plotting program such as EasyPlot. It will not generate header comparison data, however.
Creating Combined Predictions
After creating the individual predictions by setting the appropriate filters and pressing the "Run Analysis" button, go to "Create > Compile Predictions". A new window will appear, the 'Compile Predictions' window. From this window, several options are available:Once the parameters have been chosen, press "OK" and a 'Save File' dialog will come up. Browse to the folder where you want to save the combined prediction data, and enter a name to save the file as. Its important to know that the file will be saved in ART format (comaptible with EMBL). Press "OK" to save the file. When saving is finished, a pop-up will appear informing you. Combining the predictions and saving them to an ART file is generally the final stage of prediction, and although the term "combining" is used, the combination process can be used with any number of algorithms, from one to many.All Predictions - If checked, all predictions from all algorithms will be included for combination. By Occurrence - If checked, and a range entered, only predictions that appear within the range will be used (e.g. if the range is 2 to 3, and 2 algorithms find ORF x, then ORF x is included in the combination process; if only 1 algorithm finds ORF x, then ORF x is not included in the combination process). By Algorithm - If checked, and an algorithm specified, all ORFs used in combination must have been found by the specified algorithm (e.g. if algorithm Y is specified, and algorithms X, Y, Z find ORF g, then ORF g is included; if algorithm Y is specified and only algorithms X, Z find ORF g, then ORF g is omitted). Currently, up to three different algorithms can be given preference. Viewing Combined Prediction Files
After saving the combined predictions file, as explained above, you can view it by opening it in a text editor such as Notepad, Vi, or Emacs (its in plaintext). This would allow you to view the predictions textually. Conversely, you can use Artemis to view the final predictions graphically.Viewing Combined Prediction Graphs
If you don't want to use Artemis to view the final predictions, or want to just view the genes without the need of advanced features, then you can use MAGI's built-in viewers. For combined predictions, there are two possible views:View ORFs - This will let you view all of the individual algorithm predictions at once, allowing you to see which algorithms found the same ORFs, and which algorithms missed particular ORFs. To open this view, go to "Graphical Views > View ORFs". A new window appears with a format similar to that of the datafile tab visuals. View Final Predictions - This view is the final prediction view, as it was saved in the combined prediction file created through the 'Compile Predictions' window. To open this view, go to "Graphical Views > View Final Predictions". The format is the same as that used in the datafile tabs individual algorithm visuals.
Threshold Cutoff
When you use the 'New Session' window to create predictions, each ORF is assigned a signal(s) in each algorithm's datafile. The threshold cutoff is a method of removing ORFs with lower signals. When creating individual aglorithm predictions, all ORFs below the set threshold are discarded for predictions. In addition, if you are using Codon Usage as an algorithm, then the signals are used as an inherent filter. ORFs that are out of frame in Codon Usage - that is, the first signal is less than either or both of the other two signals - are discarded from predictions, and from the creation of the algorithm filter (see below).Density Filter
Density for each ORF is determine by dividing its length by its signal value (the first signal value, if multiple signal values exist in an algorithm). There use of the density filter is in ranking ORFs in individual algorithm predictions. The density score for each ORF is calculated, and ORFs with the highest score get top priority. Overlaps are checked, and if an overlap between two or more ORFs exist, then the ORF with the highest density score is the one kept, while the rest are discarded.Skew
Sometimes, the location of genes in a sequence may be visible through its nucleotide skew. Using the skew filter, you can "force" predictions onto one strand or the other, depending on the current slope of the skew. This can be useful for organisms which display strand-clustering characterstics in its gene structure. To set the initial skew details, go to "Session Settings > Skew Parameters". From there you can choose to change either the skew correlation or the skew type. If editing the skew correlation:If editing the skew type:Positive - If the slope of the skew is positive at range xy, then all predictions within range xy will appear on the top strand. Negative - If the slope of the skew is negative at range xy, then all predictions within range xy will appear on the bottom strand. The default settings are negative correlation, GC skew.GC - Base the skew on the guanine and cytosine nucleic acids. AT - Base the skew on the adenine and thymine nucleic acids. Pyrimidine - Base the skew on the thymine and adenine nucleic acids. Purine - Base the skew on the adenine and guanine nucleic acids.
Once the you are satisifed with the settings, you can actually calculate the skew by going to "Session Settings > Set Skew Parameters > Run Skew". After the skew calculation is complete, you can check the "Skew" checkbox in the individual algorithm prediction datafile tabs to use them in generating predicted ORFs.
When calculating the skew, MAGI automatically creates buffer regions between areas where the skew switches from a positive to negative slope, and vice versa. This is meant to alleviate the errors caused by determining the exact point of switching, and making an incorrect absolute point of switch can result in an increase in false positives and/or false negatives when the skew is used.Algorithm Filter
The algorithm filter allows for using the ORFs of one algorithm to filter out the ORFs of another algorithm (e.g. using Codon Usage's ineherent out-of-frame filtering to remove excess GeneScan ORFs). To use this filter, first load the datafile of the algorithm which you want to use as a filter. Then, go to "Session Settings > Set Algorithm-based Filter". A dialog box will appear with a drop-down box allowing you to choose which algorithm datafile you want to use as the filter. It will also contain a field you can change to limit the threshold of the ORFs used from that algorithm. Select the algorithm datafile you want to use, enter any threshold you want to set (default at 0.0). and press "OK". Now, to apply this filter to any other algorithm datafile, check the "Algorithm Filter" checkbox in the individual datafile tabs.
Gautam Aggarwal, gaggarwal@sbri.org
Peter Myler, mylerpj@sbri.org
Eithon Cadag, ecadag@sbri.org
For post, remit to:
Seattle Biomedical Research Institute
307 Westlake Ave ,Suite 500
Seattle, WA 98109