Tutorial

This is a tutorial with example data for seabreeze. Please install the software, create the conda enviroment and activate it first.

To get started, navigate to the root of the seabreeze directory, and copy the example/ directory:

cp -r example/ data/

The newly created data/ folder should look like this:

|---data/
|   |
|   |---data.csv
|   |---ori_dif_sequences.csv
|   |---02_genomes/
|   |   |
|   |   |---REL606_evolved_1.fasta
|   |   |---REL606_evolved_2.fasta
|   |   |---REL606.fasta

For this tutorial, structural variant mutations have been simulated on the E. coli strain REL606 genome using breseq. These mutated genome assemblies are REL606_evolved_1.fasta and REL606_evolved_2.fasta. We aim to identify the structural variants in these assemblies relative to their ancestor, REL606.fasta. These pairwise comparisons to be performed are specified in a csv file called data.csv in the data/ directory. For more information about the valid format of this file, please see the usage page.

To view this csv file:

cat data/data.csv

It should look like this:

assembly ancestor
REL606 REL606
REL606_evolved_1 REL606
REL606_evolved_2 REL606

We include a comparison of REL606 to itself as a quality check step, as we do not expect any mutations when a genome is compared to itself.

All of the following commands should be run from the seabreeze root directory. Here, 4 cores have been allocated to run seabreeze but this number can be higher on lower depending on your hardware.

Analyse genome sizes

This command compares the sizes of the assemblies to their specified ancestors in a pairwise manner.

snakemake --use-conda --cores 4 analyse_genome_sizes

The output file generated is data/04_rename_genome/genome_size_stats.csv. Please see output for more information about the fields in this table. This table should look this this:

assembly ancestor size_assembly size_ancestor difference percent_change
REL606 REL606 4629812 4629812 0 0.0
REL606_evolved_1 REL606 4617111 4629812 -12701 -0.2743
REL606_evolved_2 REL606 4549910 4629812 -79902 -1.7258

As we expected, there is no size difference when REL606 is compared to itself. We can see that both of the assemblies REL606_evolved_1 and REL606_evolved_2 are smaller than the ancestor.

Find insertion sequences

This command uses ISEscan to annotate partial and full-length IS elements in the genomes.

snakemake --use-conda --cores 4 predict_IS_elements

All of the output generated is stored in data/05_isescan_tables/. This tutorial should generate the following files in this directory:

|---data/
|   |---05_isescan_tables/
|   |   |
|   |   |---REL606.csv
|   |   |---REL606_evolved_1.csv
|   |   |---REL606_evolved_2.csv
|   |   |---REL606/
|   |   |---REL606_evolved_1/
|   |   |---REL606_evolved_2/

Each assembly has a .csv file that describes the locations of IS elements in that genome, and a directory where the additionally files generated by ISEScan are made. For information about interpreting this csv file, we refer users to the official ISEScan documentation.

Predict SVs

This command aligns the assembly-ancestor pairs, predicts structural variants and generates synteny plots.

snakemake --use-conda --cores 4 predict_structural_variants

This will generate the following directories:

|---data/
|   |---06_nucmer_alignment/
|   |   |
|   |   |---REL606/
|   |   |---REL606_evolved_1/
|   |   |---REL606_evolved_2/
|   |   |
|   |---07_syri_output/
|   |   |
|   |   |---REL606/
|   |   |---REL606_evolved_1/
|   |   |---REL606_evolved_2/

We are primarily interested in a table describing the SVs and the synteny plot. This command also generates other files that may be useful, but are not required. Please see output for information about those files.

Let's look at the SVs for REL606_evolved_1

cd data/07_syri_output/REL606_evolved_1
cat REL606_evolved_1_clean.syri.out | cut --complement -f 4-5

This command prints the tsv file REL606_evolved_1_clean.syri.out to the terminal, while hiding the 4th and 5th columns. These columns show the reference/query sequences for indels, and hence can be quite large. Note that older versions of the UNIX cut command may not support the --complement option, and you may need to either update cut or use other UNIX commands to hide the 4th and 5th columns. You can also view this file with a spreadsheet application, as it a tsv file.

genome  1       590471  genome  1       595344  SYN1    -   SYN -
genome  590473  599656  genome  595344  595344  DEL -   DEL -
genome  599657  700001  genome  595345  697137  SYN5    -   SYN -
genome  699997  2100900 genome  697133  2105267 INV15   -   INV -
genome  2100896 2775884 genome  2105263 2784492 SYN7    -   SYN -
genome  2775970 2789333 genome  2784492 2784492 DEL -   DEL -
genome  2789334 3697156 genome  2784487 3686254 SYN11   -   SYN -
genome  3002909 3008938 genome  2998062 2998062 DEL6    SYN11   DEL -
genome  3697154 4415713 genome  3687698 4400119 SYN12   -   SYN -
genome  3894996 3901133 genome  3885540 3885540 DEL7    SYN12   DEL -
genome  4415710 4629812 genome  4401562 4617111 SYN13   -   SYN -
genome  4615673 4629812 genome  4602972 4617111 SYN14   -   SYN -

We refer the user to the official SyRI documentation for information on interpreting this table. You may notice that the reference and query chromosome ID may read as "genome". This is because SyRI attempts to call variants between homologous chromosomes, and hence expects the reference and query ID to be the same string. To prevent this error from occurring, the genomes are renamed to a placeholder string 'genome'.

Let us view the synteny plot REL606_evolved_1.plot.2.pdf in this same directory. It can be opened with any application that allows you to view a PDF.

plot_1

The ancestor genome is the blue line, and the orange line is the evolved genome REL606_evolved_1. The grey ribbons show syntenic regions. We can see that there was one large inversion (in yellow) and several smaller deletions across the genome (in red).

One of the steps that seabreeze performs is to remove false positive SV calls and misassigned SVs. As an example of what this synteny plot looks like before this processing, we can view the plot REL606_evolved_1.plot.pdf

plot_2

Two deletions (~0.5Mb and ~2.8Mb) were not correctly called. There also appear to be many small "duplications". These correspond to spurious alignments between the many copies of IS elements, and do not always represent bona fide instances of structural variant mutations. It should also be noted that seabreeze does not track IS transposition events.

Annotate genes in SVs

To annotate the genes present in the deletions, inversions and amplifications, run this command. Make sure you navigate back to the seabreeze root directory before that.

snakemake --use-conda --cores 4 annotate_SV_regions

The primary output generated by this command are a series of html files in

|---data/
|   |---12_genome_diff_tables
|   |   |---html
|   |   |   |---REL606.html
|   |   |   |---REL606_evolved_1.html
|   |   |   |---REL606_evolved_2.html

Let us look at REL606_evolved_1.html as an example. This file can be opened by your web browser application. html_1

This table describes the mutations, their type, location and the genes present in them. For more information on this table, please see breseq documentation.

Predict SV mechanism

Most structural variant mutations occur through recombination between homologous sites and insertion sequences in particular are often involved. This command annotates the insertion sequences at the boundaries of structural variants (for deletions and inversions only) and predicts putative mechanisms.

snakemake --use-conda --cores 4 predict_SV_mechanism

This command generates several files: two summary files deletion_mechanism.csv and inversion_mechanism.csv, and three files for each of the three assemblies:

|---data/
|   |---11_annotated_boundaries
|   |   |---deletion_mechanism.csv
|   |   |---inversion_mechanism.csv
|   |   |---REL606_boundaries.csv
|   |   |---REL606_deletion.csv
|   |   |---REL606_inversion.csv
|   |   |---REL606_evolved_1_boundaries.csv
|   |   |---REL606_evolved_1_deletion.csv
|   |   |---REL606_evolved_1_inversion.csv
|   |   |---REL606_evolved_2_boundaries.csv
|   |   |---REL606_evolved_2_deletion.csv
|   |   |---REL606_evolved_2_inversion.csv

Let's look at the output for the assembly REL606_evolved_1_deletion.csv, which describes the putative mechanism of the deletions:

ref_start ref_stop query_start query_stop tag_3 L_ref L_ref_distance R_ref R_ref_distance L_query L_query_distance R_query R_query_distance Mechanism Evidence
590473 599656 595344 595344 DEL 0 0 0 0 other NA
2775970 2789333 2784492 2784492 DEL 0 0 IS3_61 -7 IS3_61 7 IS_mediated evolved
3002909 3008938 2998062 2998062 DEL 0 0 0 0 other NA
3894996 3901133 3885540 3885540 DEL IS3_61 0 0 IS3_61 0 IS3_61 0 IS_mediated full

We see that two deletions seem to involve IS3_61. For more information about this table, please see the output documentation.

Now let us look at the output for the assembly REL606_evolved_1_inversion.csv, which describes the putative mechanism of the inversions:

ref_start ref_stop query_start query_stop tag_3 L_ref L_ref_distance R_ref R_ref_distance L_query L_query_distance R_query R_query_distance Mechanism Evidence
699997 2100900 697133 2105267 INV 0 0 0 0 other NA

There was only a single inversion in this assembly, which occurred by an unknown mechanism seabreeze could not predict.

Looking at the summary file deletion_mechnism.csv, we can at a glance look at the mechanism for deletions in all of the assemblies:

clone total between_IS IS_mediated other
REL606 0 0 0 0
REL606_evolved_2 4 0 1 3
REL606_evolved_1 4 0 2 2

Similarly, for the inversions in inversion_mechanism.csv:

clone total between_IS IS_mediated other
REL606 0 0 0 0
REL606_evolved_2 2 1 0 1
REL606_evolved_1 1 0 0 1

Predict replichore and inversion balance

For bacterial genomes that have a single origin and terminus, the genome can be divided into two replichores (halves) demarcated by the origin-terminus axis. It can be useful to know how the lengths of these two replichores differ between the ancestor-assembly pairs. In this figure, the two replichores have been depicted in yellow and green.

inversion_replichore_1

The origin-terminus axis can also be used to classify inversions as inter-replichore (if they occur across the axis) or intra-replichore if they are contained within a single replichore. This figure depicts an inter-replichore inversion across the origin, and an intra-replichore inversion within the green replichore.

inversion_replichore_2

For inter-replichore inversions, we can describe the symmetry of the inversion across the axis. Asymmetric inversions can cause the length of the two replichores to change.

inversion_replichore_3

This step requires user input (in addition to data/data.csv) to specify the sequences of the origin and terminus. For the origin, we recommend the oriC sequence, and for the terminus, we recommend the dif sequence. seabreeze looks for these sequences (or the reverse complement) in the genome for an exact and unique match, and will cause an error if an exact match is not found, or more than one exact match is found. There is no minimum/maximum length requirements for the supplied oriC and dif sequences as long as they meet the above criteria. seabreeze requires these two sequences for each ancestor, and assumes that the sequences have not mutated in the corresponding ancestor. This information is specified in a csv file in data/ori_dif_sequences.csv.

Let us view this file for our example:

cat data/ori_dif_sequences.csv

We should be able to see this table:

ancestor ori dif
REL606 GGATCCTGGGTATTAAAA TCTTCCTTGGTTTATATT

These sequences are the first few bases of the oriC and dif loci for E. coli. More information about what sequences are acceptable in this table are on the usage page.

We can now run the command from the seabreeze root directory:

snakemake --use-conda --cores 4 predict_replichore_balance

This step generates several output files. First, let us look at the file that describes the lengths of the replichores in the assemblies data/08_reindex_genome_oric/replichore_arms.csv.

clone ori dif length arm_1 arm_2 ratio percent
REL606 0.0 2311092.0 4629812.0 2311092.0 2318720.0 1.003 50.082
REL606_evolved_1 0.0 1972601.0 4617111.0 1972601.0 2644510.0 1.341 57.276
REL606_evolved_2 0.0 1708781.0 4549910.0 1708781.0 2841129.0 1.663 62.444

For more information, about this table, please see the output page. arm_1 and arm_2 describe the length of the two replichores, and percent describes the percent of the total length of the genome that is in the longer replichore. This value is 50 for a perfectly balanced genome, and increases with increasing imbalance. We can see that both REL606_evolved_1 and REL606_evolved_2 are less balanced than REL606.

Now, let's look at the classification of the inversions in the assemblies in data/11_annotated_boundaries/inversion_replichores_long.csv. This file contains a list of all of the inversions across all of the assemblies.

clone classification length mechanism symmetry_percent
REL606_evolved_2 across_dif 1500794 other 75.52768734416581
REL606_evolved_2 across_ori 1070041 between_IS 57.700686235387245
REL606_evolved_1 across_dif 1400903 other 61.914707870566346

We can see that the assembly REL606_evolved_2 had two inversions, one across the terminus and one across the origin, while REL606_evolved_1 had only one inversion across the terminus. The symmetry of each of these inversions is described by the symmetry_percent field, which is the percent of the total length of the inversion that is in the longer arm of the inversion. This value is 50 for a perfectly symmetric inversion and becomes larger as the inversion becomes more asymmetric.