
DomClust: Hierarchical Clustering for Orthologous Domain Classification

                          --------------------

Author: Ikuo Uchiyama
Laboratory of Genome Informatics
National Institute for Basic Biology
National Institutes of Natural Sciences
E-mail: uchiyama@nibb.ac.jp

Please cite the following reference:

Uchiyama, I.: Hierarchical clustering algorithm for comprehensive
orthologous domain classification in multiple genomes.
Nucleic Acids Res. 34, 647-658 (2006)

=============================================================
LICENSE

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

=============================================================
INTRODUCTION

DomClust is a hierarchical clustering tool for orthologous grouping
in multiple genomes. The method takes as input all-against-all
similarity data and classifies genes based on the traditional
hierarchical clustering algorithm UPGMA. In the course of clustering,
the method detects domain fusion or fission events, and splits
clusters into domains if required. The subsequent procedure splits
the resulting trees such that intra-species paralogous genes are
divided into different groups so as to create plausible orthologous
groups. As a result, the procedure can split genes into the domains
minimally required for ortholog grouping. 

DomClust outputs a set of hierarchical clustering trees,
but these trees may overlap with each other. The overlapping
trees actually result from the domain fusion/fission event,
and are the salient feature of the DomClust program.

DomClust has been used for classifying more than 200 microbial
genomes in MBGD (Microbial genome database for comparative analysis;
http://mbgd.genome.ad.jp/).

=============================================================

1. INSTALLATION

Simply type 'make' to compile the program. Compiled executables
are produced in a directory named "bin". You can test
the program by typing 'make test', if Perl was installed on
your computer.

You can also directly invoke the program as follows:

% bin/domclust tst/test.hom tst/test.gene

=============================================================
2. INPUT FILES

DomClust requires two files as input:

1) A gene information file containing a list of genes to be classified
and their information.  The file must be space delimited and contain
the following data in this order: organism_name, gene_name,
sequence_length (a.a.), position (or order), and direction (1/-1).
Actually, only the first three columns are mandated; the last two
are optional and are not used in the current default setting.

Example:
bha BH0001 449 1 1
bha BH0002 380 2 1
bha BH0003 73 3 1
bha BH0004 371 4 1
bha BH0005 89 5 1

See also the "tst/test.gene" file.

2) A similarity file containing similarity relationships between
genes. The file must be space-delimited and must contain the following
data in this order: geneid1, geneid2, from1, to1, from2,
to2, distance, score. Here gene identifiers must be an organism name
followed by a colon, followed by a gene name, where the combination
of organism name and gene name must be in the gene information file.

Example:
bha:BH2418 bsu:POLC 3 1433 2 1437 37 5289
bha:BH0126 bsu:RPOB 1 1163 1 1162 13 4988
bsu:CLPE sau:SA1400 316 586 4 266 216 50
bsu:FLIK sau:SA0520 237 277 133 173 163 47

See also the "tst/test.hom" file.

A set of simple sample scripts for constructing input files from a
FASTA-formatted sequence file is given in the "build_input"
directory. You can also obtain similarity relationships from the
MBGD server at http://mbgd.genome.ad.jp/getHomology.

Optionally, the following file can also be specified:

3) A tree file containing sets of phylogenetically related organisms.
A set of related organisms must be comma-separated, and be enclosed in
braces, like {organism_name1, organism_name2, ... }, where organism
names must be the ones listed in the gene information file.
This information is used for correcting the effective number of
organisms contained in clusters such that the number of organisms in
the same group are counted only once.

Example:
{bsu,bha}
{eco,ecz}

=============================================================
3. DOMCLUST COMMAND

Usage: domclust [options ... ] similarity_file gene_file

Options: (default values are indicated in square brackets)
    -S     use similarity as a measure of relatedness [on]
    -d     use distance (or disimilarity) as a measure of relatedness
    -c#    cutoff score/distance (can also be spcified as -S# or -d#) [60]
    -m#    score/distance for missing relationships (m<c)
    -mr#   specify a missing score as a ratio to c (0<mr<1) [0.95]
    -C#    cutoff score for domain split (c<=C)
    -V#    alignment coverage for domain split (0<=V<=1)
    -n#    minimum # of organisms in clusters to be output [2]
    -ne#   minimum # of entries in clusters to be output [2]
    -p#    ratio of phylogenetic pattern overlap for tree cutting [0.5]
    -H     homology clustering (i.e. skip the tree cutting)
    -ai#   member overlap for absorbing adjacent small clusters (0<=ai<=1)
    -ao#   member overlap for merging adjacent clusters (0<=ao<=ai)
    -t<fn> use a tree file for weighting related genomes (see above)
    -R<fn> restore from dump file (see below)
    -o#    output format (see below)

NOTE: Currently the default cutoff score is set based on the JTT-PAM250
scoring matrix in 1/3 bit units. So you may have to change the default
setting when you use other scoring systems.

=============================================================
4. OUTPUT FORMAT
You can specify an output format by -o option. Possibly useful
formats are as follows.


A. Default format
A simple list of cluster members with a format of
"organism:genename(domain) from to". Domain numbers are assigned
in the order from the N-terminal to C-terminal, when the protein
was split by the program.

Example:
Cluster 334
bsu:MTLA(2) 479 610
sau:SA1962 1 144
bha:BH3852 1 145
oih:OB2601 1 145


B. Tree format (-o1)
A text display of the dendrograms of the clustering trees.

Example:
Cluster 334
  +- bsu:MTLA(2) 479 610
+-| 323.0
  | +- sau:SA1962 1 144
  +-| 342.0
    | +- bha:BH3852 1 145
    +-| 371.0
      +- oih:OB2601 1 145


C. Newick format without branch lengths (-o2),
                 or with branch lengths (-o3)

A format known as the Newick format, which is a representation of a tree
by nested parentheses. In this format, ":" between a species name and a
gene name is replaced with "_", because ":" is not allowed for a gene
name in the Newick format. Domain number is also written preceded by
a hash mark (#). With an option of "-o4", branch lengths are also
added, but currently this option is effective only when "distance"
is used as a measure of relatedness (with "-d").

Example (without lengths):
Cluster 334
(bsu_MTLA#2, (sau_SA1962, (bha_BH3852, oih_OB2601)));

Example (with lengths):
Cluster 552
(bsu_MTLA#2:102.0 , (sau_SA1962:83.5 , (bha_BH3852:66.0 , oih_OB2601:66.0 ):17.5 ):18.5 );


D. Cluster table format (-o5)
A tab-delimited tabular format in which each row contains all members
of each cluster. The first column of the table contains a cluster id
and the following columns contain space-delimited lists of the member
genes in each species.

Example:
334	bha:BH3852	bsu:MTLA(2)	oih:OB2601	sau:SA1962


E. Graph format (-o6)
Output parent-child node connections of the hierarchical trees.
For graphical display, you can convert this into the "dot" format
of the GraphViz (http://www.graphviz.org/) package by a Perl
script "convgraph.pl" in the "Script" directory.

Example:
Cluster 334
21205 15267
21205 21008
15267 bsu:MTLA
15267 15266 L
21008 sau:SA1962
21008 20697
20697 bha:BH3852
20697 oih:OB2601

Example session:
domclust tst/test.hom tst/test.gene  -o6 | agrep -dCluster bsu:MTLA | \
	Script/convgraph.pl | dotty -

[ To execute this, you need also the agrep command included in the
Glimplse package as well as the GraphViz package. ]


F. Table format (-o9)
A simple space-delimted tabular format containing the following
columns: cluster_id, "organism:genename", domain_id, from, to.

Example:
334 bsu:MTLA 2 479 610
334 sau:SA1962 1 1 144
334 bha:BH3852 1 1 145
334 oih:OB2601 1 1 145


G. Dump format (-o10)
Dump all internal information of DomClust into a text file
after the clustering procedure is completed, but before
the post-clustering procedures are executed. You can restore the
information by -R<dump_file> option.  This option is particularly
useful when you want to execute the program several times with
changing the parameters that affect only in the post-processing
phase including output format; such parameters are -n, -ne, -p, -H,
-ai, -ao, -t and -o among the above options.

Example session:
domclust homfile genefile -o10 > dumpfile
domclust -Rdumpfile > result1
domclust -Rdumpfile -p0.6 -o1 > result2
