PhyloCGN: Beta Release

Mar 3, 2026: Software Release


Phylogeny and Conserved Genome Neighborhood: “PhyloCGN”

Our laboratory is pleased to announce the release of PhyloCGN, a bioinformatics tool designed to extract and visualize gene sets functionally related to a target gene by integrating phylogeny and genomic neighborhood conservation. The source code is now available on GitHub.

This tool is a modernized, lightweight implementation of the analytical framework described in our recent paper: Kosaka and Matsutani, Microbes and Environments (2025).

Pipeline Modernization: Speed and Scalability

We have completely overhauled the legacy toolstack to significantly reduce computational costs. This “modern stack” enables large-scale genomic analysis even on standard laptop environments, without the need for high-performance computing (HPC) clusters.

Process Legacy Workflow PhyloCGN
Homology Search BLASTp diamond
Clustering MCL MMseqs2
Alignment muscle3 muscle5
Tree Building NJ (Neighbor-Joining) VeryFastTree
Visualization R (Static images) HTML + JS (Interactive)


AI-Assisted Development & Automation via Rake

In a novel approach to development, we collaborated with AI assistants (Gemini and Claude) to refactor the pipeline. Specifically, the tree clustering logic—previously dependent on TreeCluster.py—has been natively re-implemented as a Ruby script.

By utilizing Rakefile to manage the entire workflow, users can now execute the full analysis—from data retrieval to visualization—with a single command: rake do_all.

Input Specification: High-Precision Single-Query Analysis

In this beta version (v0.9.0), the pipeline is optimized for a single amino acid sequence (one target protein) as input.

This design ensures maximum clarity in interpreting the evolutionary relationship between the query and its genomic neighbors. For multiple targets, users are encouraged to run analyses in separate directories.

Mission and Future Outlook

PhyloCGN is currently in public beta. Future updates may include support for multi-sequence processing.

Integrating evolutionary context (phylogeny and neighborhood) is a powerful strategy for identifying complex protein machineries or essential maturation factors. We hope PhyloCGN becomes a valuable platform for researchers exploring the functions of unknown genes.


To Posts list