The COG datasets used in the DomClust paper. Please see the paper: Uchiyama, I.: Hierarchical clustering algorithm for comprehensive orthologous domain classification in multiple genomes. Nucleic Acids Res. 34, 647-658 (2006) Also you need the DomClust program available at: http://mbgd.genome.ad.jp/domclust/ (please download the domclust.tgz file) ============ 1. Archives cog02.tgz: The COG02 dataset used in the DomClust paper (the 2002 version of the COGs database). cog03.tgz: The COG03 dataset used in the DomClust paper (the 2003 version of the COGs database). To extract files, type tar xvfz cog02.tgz if the 'tar' command on your machine support the -z option. Alternatively, type gzip -d -c cog02.tgz | tar xvf - ============ 2. Files Each archive cotains the following files: cog.seq: The protein sequences used for constructing the original COG database, which were obtained from the NCBI ftp site. cog.tab: Gene information file that can be used for the input of the DomClust program. Basically, each column represents organism_name, gene_name, sequence_length (a.a.), position (or order), and direction (1/-1) (see README file of the DomClust program), but here the last two columns that are not currently used by the program are filled with arbitrary values. selout_cog: The result of the all-against-all similarity search using the "cog.seq" file. The protocol for calculating the similarities is described in the DomClust paper. Each column represents geneid1, geneid2, from1, to1, from2, to2, distance, score (see README file of the DomClust program). cog.tax: The taxonomy file that specifies the groups of closely related organisms; in the domclust program, the number of organisms in the same group are counted only once. cog.clust: The original COG classification that was reformatted into the default DomClust output format. cog.tit: Function categories and titles that were assigned to COGs; the data were taken from the original distribution. wdog.list: The list of well-defined orthologous groups (WDOGs) that was introduced in the DomClust paper. Each column represents COGID, the number of genes contained in the COG, the number of organisms contained in the COG, the average number of 'in-paralogs' that is defined as the number of genes divided by the number of organisms. ============ 3. Reconstruction Basically, you can reconstruct clusters using these data with the DomClust program by the following command: % domclust selout_cog cog.tab [options ...] For example, % domclust selout_cog cog.tab -tcog.tax -S60 -C80 -V.6 -p.5 -ai.95 -ao.8 -n3 This is the selected setting in the DomClust paper.