RepLoc

Installation

Download the software package at Downloads page and uncompress it. The software package includes 4 python scripts (query.py, kcov.py, locate.py, & extract_er.py) and 2 shell script files (reploc.sh & repem.sh). Please make sure that these files are in the same directory. All programs are tested on Ubuntu Linux platform.

Requirements:

Python3 -- all 4 programs are written in python3
Jellyfish v2.3.0 -- k-mer counting tool
MACS2 v2.1.2 -- peak calling software
BEDTools v2.29.2 -- genome arithmetic tool

Optional:

pyfaidx -- a python package (split the genome by chromosomes)
bedGraphToBigWig -- a UCSC program (convert bedGraph file to bigWig)

Note: all programs should be added to your system executable path.

Run the software step by step (Note: shell scripts are recommended for basic use)

The RepLoc workflow comprises 3 main steps: k-mer counting, sequence repetitiveness computing, and repeat locating & merging.

Preparation

The multiprocessing of RepLoc is realized by running the program for every chromosome or sequence in the input genome file seperately. A python package named pyfaidx is recommanded to split your genome into chromosomes. Here we use a test genome file in RepLoc package for tutorial.

faidx -f -x test_genome.fa

The advantage of this strategy is that users can customize the sequences they want to run (it can be any sequences such as some chromosomes or even certain genes in the genome). Note the splited chromosomes or user-selected sequences should be moved to a directory for further use.

Part 1. K-mer counting by Jellyfish

The jellyfish program reads in DNA sequence file in fasta format and output a database file. Then program query.py is used to query the k-mer frequencies from the genome.

Use jellyfish to count all k-mers in the input genome:

jellyfish count -m 16 -s 15M -L 2 -t 5 -C test/genome/test_genome.fa -o test_mer16.jf

Parameters: -m is the k-mer length (can be calculated by floor[log4(genome_length) + 5]); -s is the initial hash size (better set a value larger than the genome length); -L is the filter threshold (k-mer with count < 2 will not be output); -t is the thread number; -C means canonical representation (count both strand); -o is the output file name.

Use query.py to query k-mer frequencies from jellyfish database:

python3 query.py -m 16 -j test_mer16.jf -i test/sequences/ -o count_mer16/ -p 5

Parameters: -m is the k-mer length; -j is the jellyfish database file; -i is the input directory including all or partial chromosomes in the genome; -o is the output directory; -p is the number of cpu to use.

Part 2. Sequence repetitiveness computing by kcov.py

Use kcov.py to compute repetitiveness for each nucleotide based on weighted k-mer coverage.

python3 kcov.py -m 16 -ct count_mer16/ -i test/sequences/ -o rmap_mer16/ -p 5

Parameters: -m is the k-mer length; -ct is the input directory of counting results; -i is the input directory of sequences in the genome; -o is the output directory; -p is the number of cpu to use.

The result of RepLoc.py is in bedGraph format. The user can use the bedGraphToBigWig program to convert the bedGraph file to bigWig format, which can be visualized in genome browser tools like IGV.

Part 3. Repeat locating and merging by locate.py

Use locate.py to determine the repetitive regions on the Rmap and merge them based on gap size and repetitiveness difference (relies on MACS2).

python3 locate.py -l 16 -i rmap_mer16/ -o repeats_mer16/ -g 50 -f 100 -s 1 -p 5

Parameters: -l is the minimum length of a peak; -i is the input directory of Rmap; -o is the output directory; -g is the gap size threshold between two adjacent repeat peaks; -f is the fold change of the repetitiveness of two adjacent peaks; -s is the filtering threshold to control false positives; -p is the number of cpu to use.

Note: optimal gap size and fold change may vary for different species according to our research. In smaller genomes, it performs well with smaller gap size and fold change, but the opposite in larger genomes. The default values of gap size and fold change are set to 50 and 100 to meet most situations (details are discussed in the article).

Run the shell scripts (recommended)

Use reploc.sh to detect all repeats in the input genome.

/bin/bash reploc.sh test/genome/test_genome.fa 16 test/sequences/ repeats_mer16/ 5

The usage of RepLoc.sh:

Usage: reploc.sh [genome] [mer_size] [sequence_dir] [output_dir] [cpu]

Use repem.sh to extract embedded repeats in segmental duplications (tested in the human genome).

/bin/bash repem.sh rmap_mer20/ repeats_mer20/ 20 400 100 er_out/

The usage of repem.sh:

Usage: repem.sh [bedgraph_dir] [repeat_dir] [mer_size] [cut_off] [ratio] [output_dir]

Note: the cut_off here is set by (k-mer length * 2 * 10).

Marine Microorganisms:

RepLoc Sequence repetitiveness quantification and de novo repeat detection