Rosalind Armory 문제풀이
생물 정보학 분석을 위해 사용할 수 있는 소프트웨어는 이미 많습니다. Rosalind_Stronghold 에서는 알고리즘을 직접 구현했다면, 여기 Rosalind_Armory 에서는 이미 존재하는 도구를 사용하여 비슷한 문제를 풀어봅니다.
Rosalind 는 프로젝트 오일러, 구글 코드 잼 에서 영감을 얻었습니다. 이 프로젝트의 이름은 DNA 이중나선을 발견하는 데 기여한 로잘린드 프랭클린 에서 따왔습니다. Rosalind 는 프로그래밍 실력을 키우고자 하는 생물학자와 분자생물학의 계산 문제를 접해본 적이 없는 프로그래머들에게 도움이 될 것입니다.
1 Introduction to the Bioinformatics Armory
This initial problem is aimed at familiarizing you with Rosalind’s task-solving pipeline. To solve it, you merely have to take a given DNA sequence and find its nucleotide counts; this problem is equivalent to “Counting DNA Nucleotides” in the Stronghold.
Of the many tools for DNA sequence analysis, one of the most popular is the Sequence Manipulation Suite. Commonly known as SMS 2, it comprises a collection of programs for generating, formatting, and analyzing short strands of DNA and polypeptides.
One of the simplest SMS 2 programs, called DNA stats
, counts the number of occurrences of each nucleotide in a given strand of DNA. An online interface for DNA stats
can be found here.
Given: A DNA string \(s\) of length at most 1000 bp.
Return: Four integers (separated by spaces) representing the respective number of times that the symbols ‘A’, ‘C’, ‘G’, and ‘T’ occur in \(s\).
Note: You must provide your answer in the format shown in the sample output below.
1.1 Sample Dataset
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
1.2 Sample Output
20 12 17 21
1.3 Solution
from collections import Counter
def count_dna_symbols(dna):
# Define the symbols we're interested in
= ["A", "C", "G", "T"]
symbols
# Use Counter to count occurrences of each symbol in the DNA
= Counter(dna)
dna_counter
# Create a dictionary with counts for each symbol of interest
= {symbol: dna_counter.get(symbol, 0) for symbol in symbols}
symbols_count
return symbols_count
= """
sample_input AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
"""
# Get the symbols count
= count_dna_symbols(sample_input)
symbols_count
# Print the counts in the desired format
print(" ".join(str(symbols_count[symbol]) for symbol in ["A", "C", "G", "T"]))
2 GenBank Introduction
GenBank comprises several subdivisions:
- Nucleotide: a collection of nucleic acid sequences from several sources.
- Genome Survey Sequence (GSS): uncharacterized short genomic sequences.
- Expressed Sequence Tags, (EST): uncharacterized short cDNA sequences.
Searching the Nucleotide database with general text queries will produce the most relevant results. You can also use a simple query based on protein name, gene name or gene symbol.
To limit your search to only certain kinds of records, you can search using GenBank’s Limits page or alternatively use the Filter your results
field to select categories of records after a search.
If you can not find what you are searching for, check how the database interpreted your query by investigating the Search details
field on the right side of the page. This field automatically translates your search into standard keywords.
For example, if you search for Drosophila
, the Search details
field will contain (Drosophila[All Fields]
), and you will obtain all entries that mention Drosophila (including all its endosymbionts). You can restrict your search to only organisms belonging to the Drosophila genus by using a search tag and searching for Drosophila[Organism]
.
Given: A genus name, followed by two dates in YYYY/M/D format.
Return: The number of Nucleotide GenBank entries for the given genus that were published between the dates specified.
2.1 Sample Dataset
Anthoxanthum
2003/7/25
2005/12/27
2.2 Sample Output
7
2.3 Solution
from Bio import Entrez
def get_nucleotide_genbank_entries(genus_name, start_date, end_date):
"""
Retrieve the count of Nucleotide GenBank entries for a given genus and date range.
Args:
genus_name (str): The name of the genus to search for.
start_date (str): The start date of the publication range (format: YYYY/MM/DD).
end_date (str): The end date of the publication range (format: YYYY/MM/DD).
Returns:
int: The count of entries found.
"""
= "byterube@gmail.com" # Replace with your email
Entrez.email
= f'"{genus_name}"[Organism] AND ("{start_date}"[Publication Date] : "{end_date}"[Publication Date])'
query
try:
with Entrez.esearch(db="nucleotide", term=query) as handle:
= Entrez.read(handle)
record return int(record["Count"])
except Exception as e:
print(f"An error occurred: {e}")
return 0
# Sample input data
= """
sample_input Anthoxanthum
2003/7/25
2005/12/27
""".strip().split("\n")
# Extracting input values
= sample_input[0]
genus_name = sample_input[1]
start_date = sample_input[2]
end_date
# Get the count of GenBank entries
= get_nucleotide_genbank_entries(genus_name, start_date, end_date)
count
# Print the result
print(count)
3 Data Formats
GenBank can be accessed here. A detailed description of the GenBank format can be found here. A tool, from the SMS 2 package, for converting GenBank to FASTA can be found here.
Given: A collection of \(n\) (\(n≤10\)) GenBank entry IDs.
Return: The shortest of the strings associated with the IDs in FASTA format.
3.1 Sample Dataset
FJ817486 JX069768 JX469983
3.2 Sample Output
>JX469983.1 Zea mays subsp. mays clone UT3343 G2-like transcription factor mRNA, partial cds
ATGATGTATCATGCGAAGAATTTTTCTGTGCCCTTTGCTCCGCAGAGGGCACAGGATAATGAGCATGCAA
GTAATATTGGAGGTATTGGTGGACCCAACATAAGCAACCCTGCTAATCCTGTAGGAAGTGGGAAACAACG
GCTACGGTGGACATCGGATCTTCATAATCGCTTTGTGGATGCCATCGCCCAGCTTGGTGGACCAGACAGA
GCTACACCTAAAGGGGTTCTCACTGTGATGGGTGTACCAGGGATCACAATTTATCATGTGAAGAGCCATC
TGCAGAAGTATCGCCTTGCAAAGTATATACCCGACTCTCCTGCTGAAGGTTCCAAGGACGAAAAGAAAGA
TTCGAGTGATTCCCTCTCGAACACGGATTCGGCACCAGGATTGCAAATCAATGAGGCACTAAAGATGCAA
ATGGAGGTTCAGAAGCGACTACATGAGCAACTCGAGGTTCAAAGACAACTGCAACTAAGAATTGAAGCAC
AAGGAAGATACTTGCAGATGATCATTGAGGAGCAACAAAAGCTTGGTGGATCAATTAAGGCTTCTGAGGA
TCAGAAGCTTTCTGATTCACCTCCAAGCTTAGATGACTACCCAGAGAGCATGCAACCTTCTCCCAAGAAA
CCAAGGATAGACGCATTATCACCAGATTCAGAGCGCGATACAACACAACCTGAATTCGAATCCCATTTGA
TCGGTCCGTGGGATCACGGCATTGCATTCCCAGTGGAGGAGTTCAAAGCAGGCCCTGCTATGAGCAAGTC
A
3.3 Solution
from Bio import Entrez, SeqIO
from typing import List
def get_shortest_sequence(entry_ids: List[str]) -> None:
"""
Fetch nucleotide sequences for given entry IDs and print the shortest one in FASTA format.
Args:
entry_ids (List[str]): A list of GenBank entry IDs.
Raises:
Exception: If there's an error fetching or parsing the sequences.
"""
= "byterube@gmail.com" # Replace with your email
Entrez.email
try:
# Fetch sequences
with Entrez.efetch(db="nucleotide", id=entry_ids, rettype="fasta", retmode="text") as handle:
= list(SeqIO.parse(handle, "fasta"))
records
if not records:
print("No sequences found for the given entry IDs.")
return
# Find and print the shortest sequence
= min(records, key=lambda record: len(record.seq))
shortest_record print(shortest_record.format("fasta"))
except Exception as e:
print(f"An error occurred: {e}")
# Sample input data
= """
sample_input FJ817486 JX069768 JX469983
"""
= sample_input.split()
entry_ids
# Get and print the shortest sequence
get_shortest_sequence(entry_ids)
4 New Motif Discovery
The novel-motif finding tool MEME can be found here.
Given: A set of protein strings in FASTA format that share some motif with minimum length 20.
Return: Regular expression for the best-scoring motif.
4.1 Sample Dataset
>Rosalind_7142
PFTADSMDTSNMAQCRVEDLWWCWIPVHKNPHSFLKTWSPAAGHRGWQFDHNFFVYMMGQ
FYMTKYNHGYAPARRKRFMCQTFFILTFMHFCFRRAHSMVEWCPLTTVSQFDCTPCAIFE
WGFMMEFPCFRKQMHHQSYPPQNGLMNFNMTISWYQMKRQHICHMWAEVGILPVPMPFNM
SYQIWEKGMSMGCENNQKDNEVMIMCWTSDIKKDGPEIWWMYNLPHYLTATRIGLRLALY
>Rosalind_4494
VPHRVNREGFPVLDNTFHEQEHWWKEMHVYLDALCHCPEYLDGEKVYFNLYKQQISCERY
PIDHPSQEIGFGGKQHFTRTEFHTFKADWTWFWCEPTMQAQEIKIFDEQGTSKLRYWADF
QRMCEVPSGGCVGFEDSQYYENQWQREEYQCGRIKSFNKQYEHDLWWCWIPVHKKPHSFL
KTWSPAAGHRGWQFDHNFFSTKCSCIMSNCCQPPQQCGQYLTSVCWCCPEYEYVTKREEM
>Rosalind_3636
ETCYVSQLAYCRGPLLMNDGGYGPLLMNDGGYTISWYQAEEAFPLRWIFMMFWIDGHSCF
NKESPMLVTQHALRGNFWDMDTCFMPNTLNQLPVRIVEFAKELIKKEFCMNWICAPDPMA
GNSQFIHCKNCFHNCFRQVGMDLWWCWIPVHKNPHSFLKTWSPAAGHRGWQFDHNFFQMM
GHQDWGTQTFSCMHWVGWMGWVDCNYDARAHPEFYTIREYADITWYSDTSSNFRGRIGQN
4.2 Sample Output
DLWWCWIPVHK[NK]PHSFLKTWSPAAGHRGWQFDHNFF
4.3 Solution
install MM
5 Pairwise Global Alignment
An online interface to EMBOSS’s Needle
tool for aligning DNA and RNA strings can be found here.
Use:
- The DNAfull scoring matrix; note that DNAfull uses IUPAC notation for ambiguous nucleotides.
- Gap opening penalty of 10.
- Gap extension penalty of 1.
For our purposes, the “pair” output format will work fine; this format shows the two strings aligned at the bottom of the output file beneath some statistics about the alignment.
Given: Two GenBank IDs.
Return: The maximum global alignment score between the DNA strings associated with these IDs.
5.1 Sample Dataset
JX205496.1 JX469991.1
5.2 Sample Output
257
5.3 Solution
from Bio import Entrez, SeqIO
from Bio import Align
from Bio.Seq import Seq
from typing import List, Optional
def fetch_sequences(genbank_ids: List[str]) -> List[Seq]:
"""
Fetch sequences from GenBank given a list of IDs.
Args:
genbank_ids (List[str]): List of GenBank IDs.
Returns:
List[Seq]: List of Seq objects representing the fetched sequences.
"""
= "byterube@gmail.com" # Replace with your email
Entrez.email try:
with Entrez.efetch(db="nucleotide", id=genbank_ids, rettype="fasta", retmode="text") as handle:
return [record.seq for record in SeqIO.parse(handle, "fasta")]
except Exception as e:
print(f"Error fetching sequences: {e}")
return []
def align_sequences(seq1: Seq, seq2: Seq) -> Optional[float]:
"""
Perform global alignment of two sequences using Bio.Align.PairwiseAligner.
Args:
seq1 (Seq): First sequence.
seq2 (Seq): Second sequence.
Returns:
Optional[float]: Alignment score, or None if alignment fails.
"""
try:
= Align.PairwiseAligner()
aligner = 'global'
aligner.mode = 5
aligner.match_score = -4
aligner.mismatch_score = -10
aligner.open_gap_score = -1
aligner.extend_gap_score
= aligner.align(seq1, seq2)
alignments return alignments.score
except Exception as e:
print(f"Error during alignment: {e}")
return None
# Sample input data
= """
sample_input JX462666.1 NM_001251956.1
"""
= sample_input.strip().split()
genbank_ids
# Fetch sequences
= fetch_sequences(genbank_ids)
sequences
# Perform alignment
= align_sequences(sequences[0], sequences[1])
alignment_score
if alignment_score is not None:
print(f"{alignment_score}")
6 FASTQ format introduction
Sometimes it’s necessary to convert data from FASTQ format to FASTA format. For example, you may want to perform a BLAST search using reads in FASTQ format obtained from your brand new Illumina Genome Analyzer.
Links:
A FASTQ to FASTA converter can be accessed from the Sequence conversion website
A free GUI converter developed by BlastStation is available here for download or as an add-on to Google Chrome.
There is a FASTQ to FASTA converter in the Galaxy web platform. Note that you should register in the Galaxy and upload your file prior to using this tool.
Given: FASTQ file
Return: Corresponding FASTA records
6.1 Sample Dataset
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!*((((***+))%%%++)(%%%%)***-+*****))**55CCF>>>>>>CCCCCCC65
6.2 Sample Output
>SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
6.3 Solution
from Bio import SeqIO
from io import StringIO
def convert_fastq_to_fasta(fastq_string):
"""
Convert a FASTQ string to a FASTA string.
Args:
fastq_string (str): Input FASTQ formatted string.
Returns:
str: FASTA formatted string.
"""
# Create file-like objects for input and output
= StringIO(fastq_string)
fastq_handle = StringIO()
fasta_handle
# Perform the conversion
'fastq', fasta_handle, 'fasta')
SeqIO.convert(fastq_handle,
# Get the FASTA string and reset the StringIO
0)
fasta_handle.seek(= fasta_handle.read()
fasta_string
return fasta_string
# Sample FASTQ input
= """
fastq @SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!*((((***+))%%%++)(%%%%)***-+*****))**55CCF>>>>>>CCCCCCC65
""".strip()
# Convert FASTQ to FASTA
= convert_fastq_to_fasta(fastq)
fasta
# Print the resulting FASTA
print(fasta)
7 Read Quality Distribution
A version of FastQC can be downloaded here and run locally on any operating system with a suitable Java Runtime Environment (JRE) installed.
An online version of FastQC is also available here in the “Andromeda” Galaxy instance.
Given: A quality threshold, along with FASTQ entries for multiple reads.
Return: The number of reads whose average quality is below the threshold.
7.1 Sample Dataset
28
@Rosalind_0041
GGCCGGTCTATTTACGTTCTCACCCGACGTGACGTACGGTCC
+
6.3536354;<211/0?::6/-2051)-*"40/.,+%)
@Rosalind_0041
TCGTATGCGTAGCACTTGGTACAGGAAGTGAACATCCAGGAT
+
AH@FGGGJ<GB<<9:GD=D@GG9=?A@DC=;:?>839/4856
@Rosalind_0041
ATTCGGTAATTGGCGTGAATCTGTTCTGACTGATAGAGACAA
+
@DJEJEA?JHJ@8?F?IA3=;8@C95=;=?;>D/:;74792.
7.2 Sample Output
1
7.3 Solution
from Bio import SeqIO
from io import StringIO
def phre(data):
= 0
count # Convert the string input to a file-like object
with StringIO(data) as f:
= int(f.readline().strip())
threshold # Parse the FASTQ data from the string
for record in SeqIO.parse(f, "fastq"):
= record.letter_annotations["phred_quality"]
quality = sum(quality) / len(quality)
average_quality if average_quality < threshold:
+= 1
count return count
# Sample input data
= """
sample_input 28
@Rosalind_0041
GGCCGGTCTATTTACGTTCTCACCCGACGTGACGTACGGTCC
+
6.3536354;<211/0?::6/-2051)-*"40/.,+%)
@Rosalind_0041
TCGTATGCGTAGCACTTGGTACAGGAAGTGAACATCCAGGAT
+
AH@FGGGJ<GB<<9:GD=D@GG9=?A@DC=;:?>839/4856
@Rosalind_0041
ATTCGGTAATTGGCGTGAATCTGTTCTGACTGATAGAGACAA
+
@DJEJEA?JHJ@8?F?IA3=;8@C95=;=?;>D/:;74792.
""".strip()
= phre(sample_input)
result print(result)
8 Protein Translation
The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet (all letters except for B, J, O, U, X, and Z). Protein strings are constructed from these 20 symbols. The RNA codon table shows the encoding from each RNA codon to the amino acid alphabet.
The Translate
tool from the SMS 2 package can be found here in the SMS 2 package
A detailed list of genetic code variants (codon tables) along with indexes representing these codes (1 = standard genetic code, etc.) can be obtained here.
For now, when translating DNA and RNA strings, we will start with the first letter of the string and ignore stop codons.
Given: A DNA string ss of length at most 10 kbp, and a protein string translated by ss.
Return: The index of the genetic code variant that was used for translation. (If multiple solutions exist, you may return any one.)
8.1 Sample Dataset
ATGGCCATGGCGCCCAGAACTGAGATCAATAGTACCCGTATTAACGGGTGA
MAMAPRTEINSTRING
8.2 Sample Output
1
8.3 Solution
from Bio.Seq import translate
def find_genetic_code(dna, protein):
# List of genetic code table IDs to check
= [1, 2, 3, 4, 5, 6, 9, 10, 11, 12, 13, 14, 15,]
table_ids
for table in table_ids:
try:
# Translate DNA using the current table
= translate(dna, table=table, to_stop=True)
translated
# Check if the translated protein matches the given protein
if translated == protein:
return table
except:
# If there's an error with a particular table, skip it
continue
# If no matching table is found
return None
# Read input
= """
sample_input ATGGCCATGGCGCCCAGAACTGAGATCAATAGTACCCGTATTAACGGGTGA
MAMAPRTEINSTRING
""".strip().split("\n")
= sample_input[0]
dna = sample_input[1]
protein
# Find the genetic code
= find_genetic_code(dna, protein)
result
# Print the result
if result:
print(result)
else:
print("No matching genetic code found.")
9 Read Filtration by Quality
Poor-quality reads can be filtered out using the FASTQ Quality Filter
tool from the FASTX toolkit. A command-line version of FASTX can be downloaded for Linux or MacOS from its website. An online interface for the FASTQ Quality Filter
is also available here within the Galaxy web platform.
Given: A quality threshold value qq, percentage of bases pp, and set of FASTQ entries.
Return: Number of reads in filtered FASTQ entries
9.1 Sample Dataset
20 90
@Rosalind_0049_1
GCAGAGACCAGTAGATGTGTTTGCGGACGGTCGGGCTCCATGTGACACAG
+
FD@@;C<AI?4BA:=>C<G=:AE=><A??>764A8B797@A:58:527+,
@Rosalind_0049_2
AATGGGGGGGGGAGACAAAATACGGCTAAGGCAGGGGTCCTTGATGTCAT
+
1<<65:793967<4:92568-34:.>1;2752)24')*15;1,*3+*!
@Rosalind_0049_3
ACCCCATACGGCGAGCGTCAGCATCTGATATCCTCTTTCAATCCTAGCTA
+
B:EI>JDB5=>DA?E6B@@CA?C;=;@@C:6D:3=@49;@87;::;;?8+
9.2 Sample Output
2
9.3 Solution
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
def quality_filtration(data):
# Read the threshold and percentage from the first line
= map(int, data[0].strip().split())
t, p
= 0
count # Process the FASTQ data
for i in range(1, len(data), 4):
= data[i+1].strip()
sequence = data[i+3].strip()
quality_string
# Convert quality string to Phred scores
= [ord(char) - 33 for char in quality_string]
phred_quality
# Perform quality check
= sum(ph >= t for ph in phred_quality)
passes if (passes / len(phred_quality)) * 100 >= p:
+= 1
count
return count
# Sample input
= """
sample_input 20 90
@Rosalind_0049_1
GCAGAGACCAGTAGATGTGTTTGCGGACGGTCGGGCTCCATGTGACACAG
+
FD@@;C<AI?4BA:=>C<G=:AE=><A??>764A8B797@A:58:527+,
@Rosalind_0049_2
AATGGGGGGGGGAGACAAAATACGGCTAAGGCAGGGGTCCTTGATGTCAT
+
1<<65:793967<4:92568-34:.>1;2752)24')*15;1,*3+*!
@Rosalind_0049_3
ACCCCATACGGCGAGCGTCAGCATCTGATATCCTCTTTCAATCCTAGCTA
+
B:EI>JDB5=>DA?E6B@@CA?C;=;@@C:6D:3=@49;@87;::;;?8+
""".strip().split("\n")
# Call the function with the sample input
= quality_filtration(sample_input)
count print(count)
10 Complementing a Strand of DNA
Recall that in a DNA string ss, ‘A’ and ‘T’ are complements of each other, as are ‘C’ and ‘G’. Furthermore, the reverse complement of \(s\) is the string \(s^c\) formed by reversing the symbols of ss and then taking the complement of each symbol (e.g., the reverse complement of “GTCA” is “TGAC”).
The Reverse Complement
program from the SMS 2 package can be run online here.
Given: A collection of \(n\) (\(n≤10\)) DNA strings.
Return: The number of given strings that match their reverse complements.
10.1 Sample Dataset
>Rosalind_64
ATAT
>Rosalind_48
GCATA
10.2 Sample Output
1
10.3 Solution
from Bio import SeqIO
from Bio.Seq import Seq
from io import StringIO
def is_palindrome(seq):
"""Check if a sequence is equal to its reverse complement."""
= Seq(seq)
my_seq = my_seq.reverse_complement()
reverse_seq return my_seq == reverse_seq
def count_palindromic_sequences(seqs):
"""Count the number of palindromic sequences in the list."""
return sum(is_palindrome(seq) for seq in seqs)
def parse_fasta_input(fasta_input):
"""Parse FASTA formatted input and return a list of sequences."""
= StringIO(fasta_input)
fasta_io return [str(record.seq) for record in SeqIO.parse(fasta_io, "fasta")]
# Sample input
= """
sample_input >Rosalind_64
ATAT
>Rosalind_48
GCATA
"""
# Parse the sequences
= parse_fasta_input(sample_input)
sequences
# Count palindromic sequences
= count_palindromic_sequences(sequences)
palindrome_count
# Print the result
print(palindrome_count)
11 Base Quality Distribution
Quality of the bases can vary depends on position in read due to nature of the sequencing procedure. One can check this quality distribution using “Per Base Sequence Quality” module of the FastQC program.
Average accepted quality values is a 10 for the lower quartile and 25 for median. If the values falls below this limit, then the module returns a warning.
Note that for the reads >50bp long FastQC will group the bases. To show data for every base in the read use “–nogroup” option.
Given: FASTQ file, quality threshold \(q\)
Return: Number of positions where mean base quality falls below given threshold
11.1 Sample Dataset
26
@Rosalind_0029
GCCCCAGGGAACCCTCCGACCGAGGATCGT
+
>?F?@6<C<HF?<85486B;85:8488/2/
@Rosalind_0029
TGTGATGGCTCTCTGAATGGTTCAGGCAGT
+
@J@H@>B9:B;<D==:<;:,<::?463-,,
@Rosalind_0029
CACTCTTACTCCCTAGCCGAACTCCTTTTT
+
=88;99637@5,4664-65)/?4-2+)$)$
@Rosalind_0029
GATTATGATATCAGTTGGCTCCGAGAGCGT
+
<@BGE@8C9=B9:B<>>>7?B>7:02+33.
11.2 Sample Output
17
11.3 Solution
from Bio import SeqIO
from io import StringIO
def parse_threshold_and_fastq(data):
"""Parse the threshold and FASTQ data from the input string."""
= data.strip().split('\n')
lines = int(lines[0])
threshold = '\n'.join(lines[1:])
fastq_data return threshold, fastq_data
def extract_quality_scores(fastq_data):
"""Extract quality scores from FASTQ data."""
= StringIO(fastq_data)
fastq_io return [record.letter_annotations["phred_quality"]
for record in SeqIO.parse(fastq_io, "fastq")]
def count_below_threshold(qualities, threshold):
"""Count positions where the average quality score is below the threshold."""
= len(qualities)
num_sequences = len(qualities[0])
num_positions = 0
count
for i in range(num_positions):
= sum(q[i] for q in qualities) / num_sequences
average_quality if average_quality < threshold:
+= 1
count
return count
def bphr(data):
"""Calculate the number of positions with average quality below the threshold."""
= parse_threshold_and_fastq(data)
threshold, fastq_data = extract_quality_scores(fastq_data)
qualities return count_below_threshold(qualities, threshold)
# Sample input
= """
sample_input 26
@Rosalind_0029
GCCCCAGGGAACCCTCCGACCGAGGATCGT
+
>?F?@6<C<HF?<85486B;85:8488/2/
@Rosalind_0029
TGTGATGGCTCTCTGAATGGTTCAGGCAGT
+
@J@H@>B9:B;<D==:<;:,<::?463-,,
@Rosalind_0029
CACTCTTACTCCCTAGCCGAACTCCTTTTT
+
=88;99637@5,4664-65)/?4-2+)$)$
@Rosalind_0029
GATTATGATATCAGTTGGCTCCGAGAGCGT
+
<@BGE@8C9=B9:B<>>>7?B>7:02+33.
"""
= bphr(sample_input)
count print(count)
12 Finding Genes with ORFs
An ORF begins with a start codon and ends either at a stop codon or at the end of the string. We will assume the standard genetic code for translating an RNA string into a protein string (i.e., see the standard RNA codon table).
ORF finder
from the SMS 2 package can be run online here.
Given: A DNA string \(s\) of length at most 1 kbp.
Return: The longest protein string that can be translated from an ORF of \(s\). If more than one protein string of maximum length exists, then you may output any solution.
12.1 Sample Dataset
AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG
12.2 Sample Output
MLLGSFRLIPKETLIQVAGSSPCNLS
12.3 Solution
from Bio.Seq import Seq
def gene_ORFs(dna_sequence):
"""Finds and prints the longest open reading frame (ORF) from a DNA sequence."""
def translate_sequence(seq):
"""Translate a sequence in all three reading frames and split by stop codons."""
return [str(seq[i:].translate(to_stop=False)).split("*") for i in range(3)]
# Create a Seq object from the DNA sequence
= Seq(dna_sequence)
seq = seq.reverse_complement()
seq_reverse
# Translate the sequence and its reverse complement in all reading frames
= []
proteins
proteins.extend(translate_sequence(seq))
proteins.extend(translate_sequence(seq_reverse))
# Flatten the list of lists
= [protein for sublist in proteins for protein in sublist]
proteins
# Find ORFs starting with 'M' and sort by length
= sorted([protein[protein.find("M"):] for protein in proteins if "M" in protein], key=len, reverse=True)
orfs return orfs
# Sample input
= """
sample_input AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG
"""
# Strip any leading/trailing whitespace
= sample_input.strip()
dna
# Find and print the longest ORF
= gene_ORFs(dna)
orfs print(orfs[0])
13 Base Filtration by Quality
Bad quality bases can be easily trimmed out using certain threshold (defined by quality plot similar to what we did in “Base Quality Distribution”) There is a lot of trimming tools, you can try one of following:
FASTQ Quality Trimmer tool on the Galaxy. It uses a “sliding window” approach so for a simple trimming of the ends you should set window size 1.
Trimmomatic. It is a command-line java-based tool, detail description and download link can be found here. For a simple trimming from both ends you should specify parameters LEADING and TRAILING.
Given: FASTQ file, quality cut-off value \(q\), Phred33 quality score assumed.
Return: FASTQ file trimmed from the both ends (removed leading and trailing bases with quality lower than \(q\))
13.1 Sample Dataset
20
@Rosalind_0049
GCAGAGACCAGTAGATGTGTTTGCGGACGGTCGGGCTCCATGTGACACAG
+
FD@@;C<AI?4BA:=>C<G=:AE=><A??>764A8B797@A:58:527+,
@Rosalind_0049
AATGGGGGGGGGAGACAAAATACGGCTAAGGCAGGGGTCCTTGATGTCAT
+
1<<65:793967<4:92568-34:.>1;2752)24')*15;1,*3+*!
@Rosalind_0049
ACCCCATACGGCGAGCGTCAGCATCTGATATCCTCTTTCAATCCTAGCTA
+
B:EI>JDB5=>DA?E6B@@CA?C;=;@@C:6D:3=@49;@87;::;;?8+
13.2 Sample Output
@Rosalind_0049
GCAGAGACCAGTAGATGTGTTTGCGGACGGTCGGGCTCCATGTGACAC
+
FD@@;C<AI?4BA:=>C<G=:AE=><A??>764A8B797@A:58:527
@Rosalind_0049
ATGGGGGGGGGAGACAAAATACGGCTAAGGCAGGGGTCCT
+
<<65:793967<4:92568-34:.>1;2752)24')*15;
@Rosalind_0049
ACCCCATACGGCGAGCGTCAGCATCTGATATCCTCTTTCAATCCTAGCT
+
B:EI>JDB5=>DA?E6B@@CA?C;=;@@C:6D:3=@49;@87;::;;?8
13.3 Solution
def convert_to_phred(quality_string):
"""Convert a quality string to Phred scores."""
return [ord(char) - 33 for char in quality_string]
def find_quality_bounds(phred_quality, threshold):
"""Find the start and end indices where quality is >= threshold."""
= 0
start while start < len(phred_quality) and phred_quality[start] < threshold:
+= 1
start
= len(phred_quality)
end while end > start and phred_quality[end-1] < threshold:
-= 1
end
return start, end
def format_fastq_record(identifier, sequence, quality_string, start, end):
"""Format a FASTQ record with the given bounds."""
return f"@{identifier}\n{sequence[start:end]}\n+\n{quality_string[start:end]}"
def Base_Filtration_Quality(data):
"""Process FASTQ data and filter based on quality threshold."""
= int(data[0])
threshold = []
filtered_records
for i in range(1, len(data), 4):
= data[i].strip()
identifier = data[i+1].strip()
sequence = data[i+3].strip()
quality_string
= convert_to_phred(quality_string)
phred_quality = find_quality_bounds(phred_quality, threshold)
start, end
if start < end:
= format_fastq_record(identifier, sequence, quality_string, start, end)
filtered_record
filtered_records.append(filtered_record)
return filtered_records
# Sample input
= """
sample_input 20
@Rosalind_0049
GCAGAGACCAGTAGATGTGTTTGCGGACGGTCGGGCTCCATGTGACACAG
+
FD@@;C<AI?4BA:=>C<G=:AE=><A??>764A8B797@A:58:527+,
@Rosalind_0050
AATGGGGGGGGGAGACAAAATACGGCTAAGGCAGGGGTCCTTGATGTCAT
+
1<<65:793967<4:92568-34:.>1;2752)24')*15;1,*3+*!
@Rosalind_0051
ACCCCATACGGCGAGCGTCAGCATCTGATATCCTCTTTCAATCCTAGCTA
+
B:EI>JDB5=>DA?E6B@@CA?C;=;@@C:6D:3=@49;@87;::;;?8+
""".strip().split("\n")
# Process the input and print the results
= Base_Filtration_Quality(sample_input)
filtered_records for record in filtered_records:
print(record)
14 Suboptimal Local Alignment
Given: Two DNA strings ss and t in FASTA format that share some short inexact repeat r of 32-40 bp. By “inexact” we mean that rr may appear with slight modifications (each repeat differ by ≤3 changes/indels).
Return: The total number of occurrences of r as a substring of s, followed by the total number of occurrences of r as a substring of t.
14.1 Sample Dataset
>Rosalind_12
GACTCCTTTGTTTGCCTTAAATAGATACATATTTACTCTTGACTCTTTTGTTGGCCTTAAATAGATACATATTTGTGCGACTCCACGAGTGATTCGTA
>Rosalind_37
ATGGACTCCTTTGTTTGCCTTAAATAGATACATATTCAACAAGTGTGCACTTAGCCTTGCCGACTCCTTTGTTTGCCTTAAATAGATACATATTTG
14.2 Sample Output
2 2
14.3 Solution
from Bio import SeqIO
from io import StringIO
def hamming_distance(s1, s2):
return sum(c1 != c2 for c1, c2 in zip(s1, s2))
def count_inexact_repeats(pattern, seq, max_mismatches=3):
= 0
count = len(pattern)
pattern_length = len(seq)
seq_length
for i in range(seq_length - pattern_length + 1):
if hamming_distance(seq[i:i+pattern_length], pattern) <= max_mismatches:
+= 1
count
return count
# Your sample input
= """
sample_input >Rosalind_12
GACTCCTTTGTTTGCCTTAAATAGATACATATTTACTCTTGACTCTTTTGTTGGCCTTAAATAGATACATATTTGTGCGACTCCACGAGTGATTCGTA
>Rosalind_37
ATGGACTCCTTTGTTTGCCTTAAATAGATACATATTCAACAAGTGTGCACTTAGCCTTGCCGACTCCTTTGTTTGCCTTAAATAGATACATATTTG
"""
# Parse the FASTA data
= StringIO(sample_input.strip())
fasta_io = list(SeqIO.parse(fasta_io, "fasta"))
sequences
# Extract the sequences as strings
= str(sequences[0].seq)
s = str(sequences[1].seq)
t
# Find the shared inexact repeat
# We'll use a simple approach: try all substrings of length 32-40 from s as potential repeats
= ""
best_repeat = 0
best_count
for length in range(32, 41):
for i in range(len(s) - length + 1):
= s[i:i+length]
potential_repeat = count_inexact_repeats(potential_repeat, s)
count_s = count_inexact_repeats(potential_repeat, t)
count_t = count_s + count_t
total_count
if total_count > best_count:
= total_count
best_count = potential_repeat
best_repeat
# Count the occurrences of the best repeat in both sequences
= count_inexact_repeats(best_repeat, s)
count_s = count_inexact_repeats(best_repeat, t)
count_t
print(f"{count_s} {count_t}")
15 Global Multiple Alignment
One of the first and commonly used programs for MSA is Clustal, developed by Des Higgins in 1988. The current version using the same approach is called ClustalW2, and it is embedded in many software packages. There is even a modification of ClustalW2 called ClustalX that provides a graphical user interface for MSA.
See the link below for a convenient online interface that runs Clustal on the EBI website:
Select “Protein” or “DNA”, then either paste your sequence in one of the listed formats or upload an entire file. To obtain a more accurate alignment, leave Alignment type: slow
selected: if you choose to run Clustal on only two sequences, then the parameter options correspond to those in Needle
(see “Pairwise Global Alignment”).
Given: Set of nucleotide strings in FASTA format.
Return: ID of the string most different from the others.
15.1 Sample Dataset
>Rosalind_18
GACATGTTTGTTTGCCTTAAACTCGTGGCGGCCTAGCCGTAAGTTAAG
>Rosalind_23
ACTCATGTTTGTTTGCCTTAAACTCTTGGCGGCTTAGCCGTAACTTAAG
>Rosalind_51
TCCTATGTTTGTTTGCCTCAAACTCTTGGCGGCCTAGCCGTAAGGTAAG
>Rosalind_7
CACGTCTGTTCGCCTAAAACTTTGATTGCCGGCCTACGCTAGTTAGTTA
>Rosalind_28
GGGGTCATGGCTGTTTGCCTTAAACCCTTGGCGGCCTAGCCGTAATGTTT
15.2 Sample Output
Rosalind_7
15.3 Solution
from Bio import SeqIO
from io import StringIO
def simple_msa(sequences):
# Find the length of the longest sequence
= max(len(seq) for seq in sequences)
max_length
# Pad shorter sequences with gaps
= [seq.ljust(max_length, '-') for seq in sequences]
aligned_sequences
return aligned_sequences
def calculate_differences(seq1, seq2):
return sum(a != b for a, b in zip(seq1, seq2))
def find_most_different_sequence(fasta_string):
# Parse the FASTA string
= StringIO(fasta_string)
fasta_io = list(SeqIO.parse(fasta_io, "fasta"))
sequences
# Perform a simple multiple sequence alignment
= simple_msa([str(seq.seq) for seq in sequences])
aligned_seqs
= len(aligned_seqs)
n_seqs = []
avg_distances
for i in range(n_seqs):
= sum(calculate_differences(aligned_seqs[i], aligned_seqs[j])
total_distance for j in range(n_seqs) if i != j)
/ (n_seqs - 1))
avg_distances.append(total_distance
= avg_distances.index(max(avg_distances))
most_different_index return sequences[most_different_index].id
# Your sample dataset
= """
sample_data >Rosalind_18
GACATGTTTGTTTGCCTTAAACTCGTGGCGGCCTAGCCGTAAGTTAAG
>Rosalind_23
ACTCATGTTTGTTTGCCTTAAACTCTTGGCGGCTTAGCCGTAACTTAAG
>Rosalind_51
TCCTATGTTTGTTTGCCTCAAACTCTTGGCGGCCTAGCCGTAAGGTAAG
>Rosalind_7
CACGTCTGTTCGCCTAAAACTTTGATTGCCGGCCTACGCTAGTTAGTTA
>Rosalind_28
GGGGTCATGGCTGTTTGCCTTAAACCCTTGGCGGCCTAGCCGTAATGTTT
"""
= find_most_different_sequence(sample_data)
result print(result)