Using Perl for Bioinformatics - Science and Technology Support Group High Performance Computing
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Using Perl for Bioinformatics Science and Technology Support Group High Performance Computing Ohio Supercomputer Center 1224 Kinnear Road Columbus, OH 43212-1163
Table of Contents • Section 1 • Section 3 – Concatenate sequences – Read FASTA Files – Transcribe DNA to RNA – Exercises 3 – Reverse complement of sequences • Section 4 – Read sequence data from files – Searching for motifs in DNA or proteins – GenBank Files and Libraries – Exercises 1 – Exercises 4 • Section 2 • Section 5 – Subroutines – PDB – Mutations and Randomization – Exercises 5 – Translating DNA into Proteins • Section 6 using Libraries of Subroutines – Blast Files – BioPerl Modules – Exercises 6 – Exercises 2 2 Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions Example 1-1 : Concatenation of two strings of DNA • Concatenating two DNA sequences defined by two perl variables. – Two character sequences assigned to scalar variables. – The two sequences are used to create a third variable. – The third variable is the concatenated sequence by use of the ‘.’. • Use ‘print’ command to print concatenated sequence stdout. – Example 1-1 uses many different routines to print out the concatenated sequence. – Use of the newline character, “\n”. 3 Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions Example 1-1 #!/usr/bin/perl -w # Example 1-1 Concatenating DNA Note the different uses of the assignment to DNA3 achieve the same result: # Store two DNA fragments into two variables called $DNA1 and $DNA2 1. $DNA3 = “$DNA1$DNA2”; $DNA1 = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC'; 2. $DNA3 = $DNA1.$DNA2; $DNA2 = 'ATAGTGCCGTGAGAGTGATGTAGTA'; # Print the DNA onto the screen Results of running example 1-1: print "Here are the original two DNA fragments:\n\n"; Here are the original two DNA fragments: print $DNA1, "\n"; ACGGGAGGACGGGAAAATTACTACGGCATTAGC ATAGTGCCGTGAGAGTGATGTAGTA print $DNA2, "\n\n"; Here is the concatenation of the first two fragments (version 1): ACGGGAGGACGGGAAAATTACTACGGCATTAGCATAGTGCCGTGAGAGTGAT # Concatenate the DNA fragments into a third variable and print them GTAGTA # Using "string interpolation" Here is the concatenation of the first two fragments (version 2): $DNA3 = "$DNA1$DNA2"; ACGGGAGGACGGGAAAATTACTACGGCATTAGCATAGTGCCGTGAGAGTGAT GTAGTA print "Here is the concatenation of the first two fragments (version 1):\n\n"; Here is the concatenation of the first two fragments (version 3): ACGGGAGGACGGGAAAATTACTACGGCATTAGCATAGTGCCGTGAGAGTGAT print "$DNA3\n\n"; GTAGTA # An alternative way using the "dot operator": # Concatenate the DNA fragments into a third variable and print them $DNA3 = $DNA1 . $DNA2; print "Here is the concatenation of the first two fragments (version 2):\n\n"; print "$DNA3\n\n"; # Print the same thing without using the variable $DNA3 print "Here is the concatenation of the first two fragments (version 3):\n\n"; print $DNA1, $DNA2, "\n"; exit; 4 Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions Example 1-2 : Transcribing DNA to RNA • Converting all thymine with uracil in the DNA – Replace all the ‘T’ characters in the string with ‘U’. – Use binding operator ‘=~’. – Regular expression substitution, globally, ‘s/T/U/g’. 5 Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions Example 1-2 #!/usr/bin/perl -w 1. Assign the variable $RNA to the string $DNA. # Transcribing DNA into RNA 2. $RNA =~ s/T/U/g; is evaluated as substitute all uppercase T’s with uppercase U’s. # The DNA $DNA = Results of running example 1-2: ’ACGGGAGGACGGGAAAATTACTACGG CATTAGC’; Here is the starting DNA: ACGGGAGGACGGGAAAATTACTACGGCATTAGC # Print the DNA onto the screen print "Here is the starting DNA:\n\n"; Here is the result of transcribing the DNA to RNA: print "$DNA\n\n"; ACGGGAGGACGGGAAAAUUACUACGGCAUUAGC # Transcribe the DNA to RNA by substituting # all T’s with U’s. $RNA = $DNA; $RNA =~ s/T/U/g; # Print the RNA onto the screen print "Here is the result of transcribing the DNA to RNA:\n\n"; print "$RNA\n"; # Exit the program. exit; 6 Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions Example 1-3 : Calculating the Reverse Compliment of a DNA strand • Find the reverse of the DNA string. • Calculate the compliment of the reversed string. – Substitute for all bases their compliment. • A -> T; T -> A; C -> G; G -> C. – Could use the substitute function of the regular expression • $var =~ s/A/T/g; • $var =~ s/T/A/g; • $var =~ s/C/G/g; • $var =~ s/G/C/g; – This would result in error!? – Fortunately there is an operation with regular expressions called ‘translator’. 7 Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions Example 1-3 #!/usr/bin/perl -w Note that the translator replaces the characters in the first sequence with the # Calculating the reverse complement of strand of DNA corresponding character in the second sequence. In this example both # The DNA uppercase and lowercase replacement of the bases are translated. $DNA =ACGGGAGGACGGGAAAATTACTACGGCATTAGC’; # Print the DNA onto the screen Results of running example 1-3: print "Here is the starting DNA:\n\n"; print "$DNA\n\n"; Here is the starting DNA: ACGGGAGGACGGGAAAATTACTACGGCATTAGC # Make a new copy of the DNA $revcom = reverse $DNA; Here is the reverse complement DNA: # See the text for a discussion of tr/// GCTAATGCCGTAGTAATTTTCCCGTCCTCCCGT $revcom =~ tr/ACGTacgt/TGCAtgca/; # Print the reverse complement DNA onto the screen print "Here is the reverse complement DNA:\n\n"; print "$revcom\n"; exit; 8 Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions Example 1-4 : Reading protein sequences from a file • Use ‘open’. – Use a character string variable. – open(FILEPOINTER, $filename); • Read in the contents. – Use angle brackets, ‘’. – Need to create a loop to read in all lines • Read from a file named in the command line. – Use angle brackets, ‘’. – Do not need to create a filepointer. – Read into an array – Need to create a loop to read in all lines of the array 9 Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions Example 1-4 The filename is set by assigning the string variable $proteinfilename. The #!/usr/bin/perl -w ‘while’ loop reads in from the file one line at a time. Each line from the file $longprotein = ''; is concatenated on the end of the previous string. It is good programming practice to close the file pointer when done. Note how the output is each line of # Example 4-5 Reading protein sequence data from a file the file is on a newline. # Usage: perl example1-4.pl Results of running example 1-4: # The filename of the file containing the protein sequence data $proteinfilename = 'NM_021964fragment.pep'; Here is the protein: MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQD # First we have to "open" the file, and associate SVLQDRSMPHQEILAADEVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQ # a "filehandle" with it. We choose the filehandle GLQYALNVPISVKQEITFTDVSEQLMRDKKQIR # PROTEINFILE for readability. open(PROTEINFILE, $proteinfilename); # Now we do the actual reading of the protein sequence from the # file by using the angle brackets < and > to get the input from the # filehandle. We store the data into our variable $protein. while ($protein = ) { $longprotein .= $protein; } # Now that we've got our data, we can close the file. close PROTEINFILE; # Print the protein onto the screen print "Here is the protein:\n\n"; print $longprotein; exit; 10 Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions Example 1-4 The filename is given as an argument on the command line. This is much more #!/usr/bin/perl –w convenient than writing a different perl script for each file we need to open. The command: $longprotein = ''; @data_from_file = ; treats each list on the command line as a file, opens each file, and then reads each # Example 4-5 Reading protein sequence data from a file line of the file into the array. Creating a filehandle is not needed. # Usage: perl example1-4b.pl filename The ‘foreach’ loop then retrieves each element of the array, discards the newline at the end, then concatenates the string onto the end of the string # The filename of the file containing the protein sequence data variable $longprotein. # is in the command line. The '' is shortcut for . # the treats the @ARGV array as a list of # filenames, returning the contents Results of running example 1-4b: # of those files one line at a time. The contents of those files are # available to the program, using the angle brackets , Here is the protein: MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSGQSTVSGELQDSVLQDRSMPHQEILAAD # without a filehandle. EVLQESEMRQQDMISHDELMVHEETVKNDEEQMETHERLPQGLQYALNVPISVK @data_from_file = ; QEITFTDVSEQLMRDKKQIR # Using the foreach loop, we access the data from the array, # one line at a time. Removing the 'newline' from the string, # concatenate to the string variable, making one long protein # string. foreach (@data_from_file) { chop $_; $longprotein .= $_; } # Print the protein onto the screen print "Here is the protein:\n"; print $longprotein."\n"; exit; 11 Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions Example 1-5 : Searching for motifs in DNA or proteins • Prompt the user for filename and protein strings – Specify a filename to open – open(FILEPOINTER, $filename); • Read in the contents. – Read the lines of the file into an array. – Concatenate all lines of the array into a scalar variable. – Remove all newlines and blanks from the scalar variable. • Compare the motif entered from the terminal to the protein string. – Use regular expression comparison. – Exit the program when motif only contains whitespaces. 12 Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions Example 1-5 The filename is given as standard input to the question: #!/usr/bin/perl -w $proteinfilename = ; # Example 5-3 Searching for motifs The ‘unless’ condition checks for the presence of the file, exiting if not found: # Ask the user for the filename of the file containing # the protein sequence data, and collect it from the keyboard unless ( open(PROTEINFILE, $proteinfilename) ) print "Please type the filename of the protein sequence data: "; Each line of the file is then put into an array, @protein, after which the filehandle is closed: $proteinfilename = ; @protein = ; # Remove the newline from the protein filename By using ‘join’ each line in the array is put into one long character string, chomp $proteinfilename; including newline characters: # open the file, or exit $protein = join( '', @protein); unless ( open(PROTEINFILE, $proteinfilename) ) { print "Cannot open file \"$proteinfilename\"\n\n"; exit; All whitespaces, including newlines, tabs and blanks, are then removed. } $protein =~ s/\s//g; # Read the protein sequence data from the file, and store it # into the array variable @protein @protein = ; # Close the file - we've read all the data into @protein now. close PROTEINFILE; # Put the protein sequence data into a single string, as it's easier # to search for a motif in a string than in an array of # lines (what if the motif occurs over a line break?) $protein = join( '', @protein); # Remove whitespace $protein =~ s/\s//g; 13 Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions Example 1-5 (cont’d) The loop controls the search for the character string in the entire protein string. # In a loop, ask the user for a motif, search for the motif, # and report if it was found. The variable $motif is assigned the character string typed in the shell: # Exit if no motif is entered. $motif = ; do { The newline character is removed from the end of the string: print "Enter a motif to search for: "; chomp $motif; $motif = ; # Remove the newline at the end of $motif The character string $motif is compared to the protein string for a match: chomp $motif; $protein =~ /$motif/ # Look for the motif When the user types nothing but whitespaces, the program exits: if ( $protein =~ /$motif/ ) { until ( $motif =~ /^\s*$/ ); print "I found it!\n\n"; } else { Results from running example1-5.pl: print "I couldn\'t find it.\n\n"; Please type the filename of the protein sequence data: NM_021964fragment.pep } Enter a motif to search for: SVLQ I found it! Enter a motif to search for: sqlv # exit on an empty user input I couldn’t find it. Enter a motif to search for: QDSV } until ( $motif =~ /^\s*$/ ); I found it! Enter a motif to search for: HERLPQGLQ # exit the program I found it! Enter a motif to search for: exit; I couldn’t find it. 14 Using Perl for Bioinformatics
Section 1 : Sequences and Regular Expressions Exercises for Section 1 1. Explore the sensitivity of programming languages to errors of syntax. Try removing the semicolon from the end of any statement of one of our working programs and examining the error messages that result, if any. Try changing other syntactical items: add a parenthesis or a curly brace; misspell some command, like "print" or some other reserved word; just type in, or delete, anything. Programmers get used to seeing such errors; even after getting to know the language well, it is still common to have some syntax errors as you gradually add code to a program. Notice how one error can lead to many lines of error reporting. Is Perl accurately reporting the line where the error is? 2. Write a program that prints DNA (which could be in upper- or lowercase originally) in lowercase (acgt); write another that prints the DNA in uppercase (ACGT). Use the function tr///. 3. Do the same thing as Exercise 2, but use the string directives \U and \L for upper- and lowercase. For instance, print "\U$DNA" prints the data in $DNA in uppercase. 4. Prompt the user to enter two (short) strings of DNA. Concatenate the two strings of DNA by appending the second to the first using the .= assignment operator. Print the two strings as concatenated, and then print the second string lined up over its copy at the end of the concatenated strings. For example, if the input strings are AAAA and TTTT, print: AAAATTTT TTTT 5. Write a program to calculate the reverse complement of a strand of DNA. Do not use the s/// or the tr functions. Use the substr function, and examine each base one at a time in the original while you build up the reverse complement. (Hint: you might find it easier to examine the original right to left, rather than left to right, although either is possible.) 6. Write a program to report how GC-rich some sequence is. (In other words, just give the percentage of G and C in the DNA.) 7. Modify Example 1-5 to not only find motifs by regular expressions but to print out the motif that was found. For example, if you search, using regular expressions, for the motif EE.*EE, your program should print EETVKNDEE. You can use the special variable $&. After a successful pattern match, this special variable is set to hold the pattern that was matched. 8. Write a program that switches two bases in a DNA string at specified positions. (Hint: you can use the Perl functions substr or slice. 15 Using Perl for Bioinformatics
Section 2 : Mutations, Randomization and Modules Example 2-1 : Counting bases in DNA string, using subroutines. • Subroutines are very efficient – Write once, use many times. – Routines which have a pervasive utility may be stored in a library for future use. • Lexical scoping using ‘my’ declaration – Important to understand the scope of variables – Use ‘my’ to declare variables with in the scope of the code – Variable names may be used in different code segments – Declare ‘use strict’ to enforce variables to be defined with ‘my’ • Use special array to pass arguments to subroutine – my($var1, $var2, $var3) = @_; – This will assign the values of arguments passed to the subroutine to the named variables – Mistake of not using the @_ • Variables will not have their passed values 16 Using Perl for Bioinformatics
Section 2 : Mutations, Randomization and Modules Example 2-1 The command ‘use strict’ requires all variables to use the ‘my’ declaration for #!/usr/bin/perl -w all variables. This will limit the scope of any variable. # Example 2-1 Counting the number of G's in some DNA on the # command line Declare a string variable to keep usage line. use strict; The ‘unless’ condition will make sure there are arguments on the command line. The special array, @ARGV, exists only if there are arguments present on the # Collect the DNA from the arguments on the command line command line. # when the user calls the program. # If no arguments are given, print a USAGE statement and exit. Assign the value of the character string in the command line to the variable $dna. # $0 is a special variable that has the name of the program. Here the first value of the array of argument array, and in this case the only argument, is represented by the variable $ARGV[0]. Here the individual my($USAGE) = "$0 DNA\n\n"; elements of an array are references by the syntax $array1[n]. # @ARGV is an array containing allcommand-line arguments. # # If it is empty, the test will fail and the print USAGE and exit # statements will be called. unless(@ARGV) { print $USAGE; exit; } # Read in the DNA from the argument on the command line. my($dna) = $ARGV[0]; 17 Using Perl for Bioinformatics
Section 2 : Mutations, Randomization and Modules Example 2-1 (cont’d) The subroutine ‘countG’ takes a character string as an argument and returns a # Call the subroutine that does the real work, and collect the result. number. my($num_of_Gs) = countG ( $dna ); The line “my($num_of_Gs) = countG($dna);” passes the dna sequence to the subroutine ‘countG’ and assingns the returned number to the variable # Report the result and exit. ‘$num_of_Gs’. print "\nThe DNA $dna has $num_of_Gs G\'s in it!\n\n"; exit; The variable $dna, now lexically scoped only to the subroutine, is assigned the value passed. ######################################## The variable count is initialized to the value ‘0’. # Subroutines for Example 2-1 ######################################## The translate of the dna string, $dna =~ tr/Gg//, will effectively remove any upper or lower case G from the string. sub countG { # return a count of the number of G's in the argument $dna The assignment to the variable $count is a count of the list which is the # initialize arguments and variables successful tranlations, and is returned. my($dna) = @_; Results from running example2-1.pl: my($count) = 0; perl example2-1.pl CGGATTTAGCGCGT # Use the tr on the regular expression for The DNA CGGATTTAGCGCGT has 5 G's in it! # counting nucleotides in DNA $count = ( $dna =~ tr/Gg//); return $count; } 18 Using Perl for Bioinformatics
Section 2 : Mutations, Randomization and Modules Example 2-2 : Creating mutant DNA using Perl’s random number generator • Simulate mutating DNA using random number generator – Randomly pick a nucleotide in a DNA string – Randomly pick a basis from the four, A, C, T, G – Replace the picked nucleotide in the selected position of the DNA string with the randomly selected basis • Random number algorithms are only psuedo-random numbers – With the same seed, random number generators will produce the series of numbers – Algorithms are designed to give an even distribution of values • Random numbers require a ‘seed’ – Should be selected randomly, as well – Different seed values will produce different sequences of random numbers – If program security and privacy issues, patient records,is important, you should consult the Perldocumentation, and the Math::Random and Math::TrulyRandom modules from CPAN 19 Using Perl for Bioinformatics
Section 2 : Mutations, Randomization and Modules Example 2-2 This is the main program which seeds the random number algorithm and calls the #!/usr/bin/perl -w subroutine, mutate(). # Example 2-2 Mutate DNA # using a random number generator to randomly select bases to mutate The call to srand() uses the seed of ‘time|$$’, OR’s the current time with the use strict; process id, creating a unique seed. This is not a very secure method but it will do use warnings; for our purposes. # Declare the variables # The DNA is chosen to make it easy to see mutations: The argument to mutate() is the current DNA string. my $DNA = 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'; # $i is a common name for a counter variable, short for "integer" my $i; my $mutant; # Seed the random number generator. # time|$$ combines the current time with the current process id srand(time|$$); $mutant = mutate($DNA); print "\nMutate DNA\n\n"; print "\nHere is the original DNA:\n\n"; print "$DNA\n"; print "\nHere is the mutant DNA:\n\n"; print "$mutant\n"; # Let's put it in a loop and watch that bad boy accumulate mutations: print "\nHere are 10 more successive mutations:\n\n"; for ($i=0 ; $i < 10 ; ++$i) { $mutant = mutate($mutant); print "$mutant\n"; } exit; 20 Using Perl for Bioinformatics
Section 2 : Mutations, Randomization and Modules Example 2-2 (cont’d) The subroutine mutate() takes the argument from the special array @_ and ######################################################## assigns it to the variable $dna. # Subroutines for Example 2-2 ######################################################## The array @ nucleotides is intialized with the values which are our nucleotides. # A subroutine to perform a mutation in a string of DNA # # WARNING: make sure you call srand to seed the The subroutine randomposition() takes the current dna string and returns a # random number generator before you call this function. position within the string. sub mutate { The subroutine randomnucleotide() takes the our array of bases and returns a my($dna) = @_; randomly selected value. my(@nucleotides) = ('A', 'C', 'G', 'T'); Finally, the perl module substr() takes the DNA string, the random position, a length of our substitution string, here it is 1, the replacement string and returns # Pick a random position in the DNA the new string in the variable $dna. my($position) = randomposition($dna); # Pick a random nucleotide my($newbase) = randomnucleotide(@nucleotides); # Insert the random nucleotide into the random position in the DNA # The substr arguments mean the following: # In the string $dna at position $position change 1 character to # the string in $newbase substr($dna,$position,1,$newbase); return $dna; } 21 Using Perl for Bioinformatics
Section 2 : Mutations, Randomization and Modules Example 2-2 (cont’d) Randomnucleotide() passes our array of bases to the function randomelement(), # A subroutine to randomly select an element from an array and in turn, returns the randomly chosen nucleotide. # # WARNING: make sure you call srand to seed the In randomelement(), an array is given to the function and returns a randomly # random number generator before you call this function. selected element from the array. How is this done? Rand() expects a scalar sub randomelement { value, evaluating the array @array in a scalar context, the size of @array. Perl was designed to take as array subscripts the integer part of a floating-point value. my(@array) = @_; Here $array[rand @array] returns the element of the array associated with the subscript randomly chosen from 0 to n-1, where n is the length of the array. # Here the code is succinctly represented rather than # “return $array[int rand scalar @array];” return $array[rand @array]; } # randomnucleotide # # A subroutine to select at random one of the four nucleotides # # WARNING: make sure you call srand to seed the # random number generator before you call this function. sub randomnucleotide { my(@nucleotides) = ('A', 'C', 'G', 'T'); # scalar returns the size of an array. # The elements of the array are numbered 0 to size-1 return randomelement(@nucleotides); } 22 Using Perl for Bioinformatics
Section 2 : Mutations, Randomization and Modules Example 2-2 (cont’d) Randomposition() takes an string argument and calculates a random position # randomposition withing the string. It is very concise and useful. The return command could have # been written: # A subroutine to randomly select a position in a string. return (int (rand (length $string))); # Certainly, this is more understandable, but I believe there is no loss of clarity as # WARNING: make sure you call srand to seed the in Perl we can write these as a sequence of Perl modules. Chaining single-argument # random number generator before you call this function. functions is often done in Perl. sub randomposition { Rand() takes the length as an argument and calculates a floating point number between 0 and the length. Int() will round the floating point number down to a range of integers, 0 to length-1. my($string) = @_; Results from running example2-2.pl: # Notice the "nested" arguments: # Mutate DNA # $string is the argument to length Here is the original DNA: # length($string) is the argument to rand # rand(length($string))) is the argument to int AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA # int(rand(length($string))) is the argument to return # Here is the mutant DNA: # rand returns a decimal number between 0 and its argument. AAAAAAAAAAAAAAAAAAAAAAAGAAAAAA # int returns the integer portion of a decimal number. # Here are 10 more successive mutations: # The whole expression returns a random number AAAAAAAAAAAAAAAAAAAAAAAGAAAAAG # between 0 and length-1, AAAAAAAAAAAAAAAAAAAACAAGAAAAAG # which is how the positions in a string are numbered in Perl. AAAAAAAAAAAAAAAAAAAACAAGAAAAAG # CAAAAAAAAAAAAAAAAAAACAAGAAAAAG CAAAAAAAAAAAAAAAAAAACAAGATAAAG CAAAAAAAAAAAGAAAAAAACAAGATAAAG return int rand length $string; CAAAAAAAAAAAGAACAAAACAAGATAAAG } GAAAAAAAAAAAGAACAAAACAAGATAAAG GAAAAAAAAAAAGAACAAAAGAAGATAAAG GAAAAAAAAAAAGAACAAAAGCAGATAAAG 23 Using Perl for Bioinformatics
Section 2 : Mutations, Randomization and Modules Example 2-3 : Translating DNA into proteins … using modules • First transcribe DNA to RNA • Translate RNA to amino acids – Four bases, A, U, C, G – Codon defined by sequece of three bases – 64 possible combinations, 43. – There are only 20 amino acids and a stop – Redundancy with codons, more than one codon represents each amino acid – Refer to Table 1 on page ?? • Use subroutine defined in BegPerlBioinfo.pm – Specify module filename in perl code – If not installed in a known library path, need “use lib ‘pathname’” to specify where to find the module • Module codon2aa() returns a single character amino acid from the 3-character codon input • Need to write a loop which will grab 3 characters while stepping through the RNA sequence 24 Using Perl for Bioinformatics
Section 2 : Mutations, Randomization and Modules Example 2-3 Example 2-3 Example 2-3 # 'CAT' => 'H', # Histidine 'CCT' => 'P', # Proline # codon2aa 'CAA' => 'Q', # Glutamine 'CAC' => 'H', # Histidine # 'CAG' => 'Q', # Glutamine # A subroutine to translate a DNA 3-character 'GTA' => 'V', # Valine 'CGA' => 'R', # Arginine 'GTC' => 'V', # Valine # codon to an amino acid 'CGC' => 'R', # Arginine # Using hash lookup 'GTG' => 'V', # Valine 'CGG' => 'R', # Arginine 'GTT' => 'V', # Valine 'CGT' => 'R', # Arginine 'GCA' => 'A', # Alanine sub codon2aa { 'ATA' => 'I', # Isoleucine my($codon) = @_; 'GCC' => 'A', # Alanine 'ATC' => 'I', # Isoleucine 'GCG' => 'A', # Alanine 'ATT' => 'I', # Isoleucine 'GCT' => 'A', # Alanine $codon = uc $codon; 'ATG' => 'M', # Methionine 'GAC' => 'D', # Aspartic Acid 'ACA' => 'T', # Threonine 'GAT' => 'D', # Aspartic Acid my(%genetic_code) = ( 'ACC' => 'T', # Threonine 'GAA' => 'E', # Glutamic Acid 'ACG' => 'T', # Threonine 'GAG' => 'E', # Glutamic Acid 'TCA' => 'S', # Serine 'ACT' => 'T', # Threonine 'TCC' => 'S', # Serine 'GGA' => 'G', # Glycine 'AAC' => 'N', # Asparagine 'GGC' => 'G', # Glycine 'TCG' => 'S', # Serine 'AAT' => 'N', # Asparagine 'TCT' => 'S', # Serine 'GGG' => 'G', # Glycine 'AAA' => 'K', # Lysine 'GGT' => 'G', # Glycine 'TTC' => 'F', # Phenylalanine 'AAG' => 'K', # Lysine 'TTT' => 'F', # Phenylalanine ); 'AGC' => 'S', # Serine 'TTA' => 'L', # Leucine 'AGT' => 'S', # Serine 'TTG' => 'L', # Leucine if(exists $genetic_code{$codon}) { 'AGA' => 'R', # Arginine return $genetic_code{$codon}; 'TAC' => 'Y', # Tyrosine 'AGG' => 'R', # Arginine 'TAT' => 'Y', # Tyrosine } 'CCC' => 'P', # Proline else{ 'TAA' => '_', # Stop 'CCG' => 'P', # Proline print STDERR "Bad codon \"$codon\"!!\n"; 'TAG' => '_', # Stop 'TGC' => 'C', # Cysteine exit; 'TGT' => 'C', # Cysteine } 'TGA' => '_', # Stop } 'TGG' => 'W', # Tryptophan 'CTA' => 'L', # Leucine 'CTC' => 'L', # Leucine This subroutine takes, as an argument, a three character DNA sequence and returns the single character 'CTG' => 'L', # Leucine representation of the amino acid. The data type used is a hash lookup. The condition 'CTT' => 'L', # Leucine ‘if (exists $genetic_code($codon)) 'CCA' => 'P', # Proline searches for a match between the 3 characters of the codon and the list of keys in the hash. The associated value of the key, if found, is returned. Otherwise an error is reported and the program terminates. This module is included in the module BeginPerlBioinf.pm, which will be used with other subroutines, throughout the rest of the workshop. 25 Using Perl for Bioinformatics
Section 2 : Mutations, Randomization and Modules Example 2-3 This is the perl code which, with only a few lines, translates DNA into a #!/usr/bin/perl -w protein sequence. The command ‘use lib …’ instructs the perl compiler to # Example 2-3 : Translate DNA into protein append the search path for necessary libraries, like BeginPerlBioinfo.pm. BeginPerlBioinfo.pm is a part of the book Beginning Perl for use lib ‘../ModLib/’; Bioinformatics, by James Tysdall. use strict; use warnings; The ‘for’ loop references the dna string sequence by threes starting at the 0 use BeginPerlBioinfo; # This does not require the ‘.pm’ in the ‘use’ command Index : 0 3 6 9 …. # Initialize variables CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC my $dna = 'CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC'; The 3 character substring is assigned to the $codon variable by the perl my $protein = ''; command ‘substr’. Then $protein, returned by the subroutine codon2aa() is my $codon; appended to the end of the current protein string. # Translate each three-base codon into an amino acid, and append to a protein Results from running example2-3.pl: for(my $i=0; $i < (length($dna) - 2) ; $i += 3) { $codon = substr($dna,$i,3); I translated the DNA $protein .= codon2aa($codon); } CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC into the protein print "I translated the DNA\n\n$dna\n\n into the protein\n\n$protein\n\n"; RRLRTGLARVGR exit; 26 Using Perl for Bioinformatics
Section 2 : BioPerl and CPAN Example 2-4 : Installing and testing bioperl • http://bioperl.org • The Bioperl Project is an international association of developers of open source Perl tools for bioinformatics, genomics and life science research. • The Bioperl server provides an online resource for modules, scripts, and web links for developers of Perl-based software for life science research. • Bioperl modules and documentation are very extensive • Good examples to illustrate uses • Will discuss installation of bioperl • Also take a quick look at some test scripts • In Chapter 9 of Mastering Perl for Bioinformatics, James Tisdall gives a personal account of installing bioperl. – Depends on installing using CPAN shell – Linux installations vary from site to site, so it is advised that someone with administrator privileges install bioperl 27 Using Perl for Bioinformatics
Section 2 : BioPerl and CPAN Example 2-4 : Installing and testing bioperl • My own experiences were slightly different – Download the core bioperl install file, version 1.4 the most recent – Follow the make instructions included in the INSTALL documentation – Carefully follow the ‘make test’ instruction • Make sure you have an internet connection – Note where the test script fails • You will see module names like LPW, IO::Strings, etc. – I noticed that the LPW and IO::Strings were involved in quite a few failures • Here is where I installed the missing modules using the CPAN shell – >> perl –MCPAN –e shell – At the CPAN prompt, install the missing module • cpan > install LPW – After exiting the CPAN shell, try ‘make test’ to see if it lessens the failed responses • After concluding that the failures won’t impede using bioperl, use the ‘make install’ • This usually puts the modules in /usr/lib/perl5/5.x.x/site_perl, on Linux systems 28 Using Perl for Bioinformatics
Section 2 : BioPerl and CPAN Example bptest0.pl These simple tests measure if bioperl is installed correctly. #!/usr/bin/perl –w Test ‘bptest0.pl’ simply checks if Perl can find Bio::Perl. If it doesn’t complain, we are one step closer. use Bio::Perl; exit; ###################################################### Example bptest1.pl #!/usr/bin/perl -w # Example to Test the Bioperl installation use Bio::Perl; In the file ‘bptest1.pl’, we need internet access. The perl program retrieves a # Must use this script with an internet connection swissprot sequence and prints it to a file, ‘roa1.fasta’, in FASTA format. $seq_object = get_sequence('swissprot',"ROA1_HUMAN"); write_sequence("> roa1.fasta", 'fasta', $seq_object); exit; ###################################################### Example bptest2.pl #!/usr/bin/perl –w # Example to Test the Bioperl installation The last perl script uses NCBI to BLAST a sequence and saves the results to a use Bio::Perl; file. This should be used judiciously as we don’t want to abuse the computing # Must use this script with an internet connection cycles of NCBI. These requests should be done for individual searches. Download the blast package locally to do large numbers of BLAST searches. $seq_object = get_sequence('swissprot',"ROA1_HUMAN"); $blast_result = blast_sequence(($seq_object); write_blast(“>raol1.blast”, $blast_result); exit; 29 Using Perl for Bioinformatics
Section 2 : Mutations, Randomization and Bioperl Exercises for Section 2 1. Write a subroutine to concatenate two strings of DNA. 2. Write a subroutine to report the percentage of each nucleotide in DNA. Count the number of each nucleotide, divide by the total length of the DNA, then multiply by 100 to get the percentage. Your arguments should be the DNA and the nucleotide you want to report on. The int function can be used to discard digits after the decimal point, if needed. 3. Write a module that contains subroutines that report various statistics on DNA sequences, for instance length, GC content, presence or absence of poly-T sequences (long stretches of mostly T’s at the 5’ (left) end of many $DNA sequences), or other measures of interest. 4. Write a program that asks you to pick an amino acid and then keeps (randomly) guessing which amino acid you picked. 5. Write a program to mutate protein sequence, similar to the code in Example 2-2 that mutates DNA. 6. Write a program that uses Bioperl to perform a BLAST search at the NCBI web site, then use Bioperl to parse the BLAST output. 30 Using Perl for Bioinformatics
Section 3 : Fasta Files and Frames • Many different formats for saving sequence data and annotations in files • Perhaps as many as 20 such formats for DNA • Some of the most popular – FASTA and BLAST, Basic Local Alignment Search Technique, both using the FASTA format – Genetic Sequence Data Bank (GenBank) – European Molecular Biology Laboratory (EMBL) • In this section we will focus on reading FASTA format • Sample of FASTA format: > sample dna | (This is a typical fasta header.) agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat cgggtgtgacaactgcaatgagtggttccatggggactgcatccggatca ctgagaagatggccaaggccatccgggagtggtactgtcgggagtgcaga 31 Using Perl for Bioinformatics
Section 3 : Fasta Files and Frames Example 3-1: Reading FASTA format and extract sequence data • Write three subroutines and rely on regular expressions • First subroutine will get data from a file – Read filename from command li neargument = filename – open file • if can’t open, print error message and exit – read in data – return array which contains each line of the file, @data • Second subroutine extracts sequence data from fasta file – Read in array of file data in fasta format – Discard all header, blank and comment lines – If first character of first line is >, discard it – Read in the rest of the file, joined in a scalar, – edit out non-sequence data, white spaces – return sequence • Third subroutine writes the sequence data – More often than not, the sequence to print is longer than most page widths – Need to specify a length parameter to control the output 32 Using Perl for Bioinformatics
Section 3 : Fasta Files and Frames Example 3-1 Get_file_data() take a string argument, the filename. The unless condition # get_file_data attempts to open a file. If unsuccessful, it prints an error statement and exits the # program. # A subroutine to get data from a file given its filename sub get_file_data { If the file exists, it saves each line of the file, one by one, into the array @filedata. Returns the array to the main routine, after closing the file pointer, of my($filename) = @_; course. use strict; use warnings; # Initialize variables my @filedata = ( ); unless ( open (GET_FILE_DATA, $filename) ) { print STDERR "Cannot open file \"$filename\"\n\n"; exit; } @filedata = ; close GET_FILE_DATA; return @filedata; } 33 Using Perl for Bioinformatics
Section 3 : Fasta Files and Frames Example 3-1 Extract_sequence_from_fasta_data() takes the array that is the contents of the # extract_sequence_from_fasta_data fasta file. The foreach loop takes each of the elements of the array, a complete # line of the file, and assigns it to the variable $line. The different conditions help # A subroutine to extract FASTA sequence data from an array us ignore the blank, comment and header lines: sub extract_sequence_from_fasta_data { • /^\s*$/ looks for lines that have just white spaces from beginning to end my(@fasta_file_data) = @_; • /^\s*#/ look for lines which have the pound character, preceded by white use strict; spaces, as a comment line use warnings; • /^>/ look for lines which have the ‘greater-than’ symbol at the beginning of the line, the fasta header line # Declare and initialize variables • all other lines are concatenated together into the $sequence variable my $sequence = ’’; foreach my $line (@fasta_file_data) { When all is done, all white space characters are removed: # discard blank line $sequence =~ s/\s//g; if ($line =~ /^\s*$/) { next; The sequence is returned to the calling routine. # discard comment line } elsif($line =~ /^\s*#/) { next; # discard fasta header line } elsif($line =~ /^>/) { next; # keep line, add to sequence string } else { $sequence .= $line; } } # remove non-sequence data (in this case,whitespace) from $sequence string $sequence =~ s/\s//g; return $sequence; } 34 Using Perl for Bioinformatics
Section 3 : Fasta file format Example 3-1 Finally, the print_sequence() routine takes the cleaned string and an integer # print_sequence specifying the number of characters to print, per line. Again notice that the # variables are assigned from the special array, @_. This is accomplished by the # A subroutine to format and print sequencedata for for loop and the substr module. The print command takes a substring of the sub print_sequence { complete string on a new line. my($sequence, $length) = @_; use strict; use warnings; # Print sequence in lines of $length Well, now that we have the produced the subroutines needed for our program, for ( my $pos = 0 ; $pos < length($sequence) ; $pos += $length ) { these subroutines have been installed in the BeginPerlBioinfo.pm module. Our print substr($sequence, $pos, $length), "\n"; program may be succinctly written as in the code to the left. The final command } prints the sequence, passing the character string and the length to the } print_sequence subroutine. Output from example3-1 agatggcggcgctgaggggtcttgg Example 3-1 gggctctaggccggccacctactgg #!/usr/bin/perl tttgcagcggagacgacgcatgggg cctgcgcaataggagtacgctgcct # Read a fasta file and extract the sequence data gggaggcgtgactagaagcggaagt use lib ‘../ModLib/’; # Must point to where BeginPerlBioinfo.pm resides agttgtgggcgcctttgcaaccgcc use strict; tgggacgccgccgagtggtctgtgc aggttcgcgggtcgctggcgggggt use warnings; Cgtgagggagtgcgccgggagcgga use BeginPerlBioinfo; … # Declare and initialize variables my @file_data = ( ); gaagttcgggggccccaacaagatc my $dna = ’’; cggcagaagtgccggctgcgccagt # Read in the contents of the file "sample.dna" gccagctgcgggcccgggaatcgta @file_data = get_file_data("sample.dna"); caagtacttcccttcctcgctctca ccagtgacgccctcagagtccctgc # Extract the sequence data from the contents of the file "sample.dna" caaggccccgccggccactgcccac $dna = extract_sequence_from_fasta_data(@file_data); ccaacagcagccacagccatcacag aagttagggcgcatccgtgaagatg # Print the sequence in lines 25 characters long agggggcagtggcgtcatcaacagt print_sequence($dna, 25); caaggagcctcctgaggctacagcc exit; acacctgagccactctcagatgagg accta 35 Using Perl for Bioinformatics
Section 3 : Fasta Files and Frames Example 3-2: Translate a DNA sequence in all six reading frames • Given a sequence of DNA, it is necessary to examine all six reading frames of the DNA to find the coding regions the cell uses to make proteins • Genes very often occur in pieces that are spliced together during the transcription/translation process • Since the codons are three bases long, the translation happens in three "frames,“ starting at the first base, or the second, or perhaps the third. • Each starting place gives a different series of codons, and, as a result, a different series of amino acids. • Examine all six reading frames of a DNA sequence and to look at the resulting protein translations • Stop codons are definite breaks in the DNA => protein translation process • If a stop codon is reached, the translation stops • We need some code to represent the reverse compliment of the DNA • Need to break both strings into the representative frames • Translate each frame of DNA to protein 36 Using Perl for Bioinformatics
Section 3 : Fasta Files and Frames Example 3-2 We are going to reuse our old code from Section 1, revcom(). We have to # revcom rewrite it as a subroutine. # # A subroutine to compute the reverse complement of DNA sequence Now we need to design that subroutine which will break the DNA strings sub revcom { into our frames and translate the string into proteins. Our old perl command my($dna) = @_; substr() should do the trick for taking apart our frames. The unless($end) # First reverse the sequence condition checks for a value in the variable $end, if no value then it my($revcom) = reverse($dna); calculates the end value as the length of the sequence. The length of the # Next, complement the sequence, dealing with upper and lower case desired sequence doesn’t change with the change in indices, since: # A->T, T->A, C->G, G->C (end - 1) - (start - 1) + 1 = end - start + 1 $revcom =~ tr/ACGTacgt/TGCAtgca/; return $revcom; Translating to peptides we revisite our codon2aa() subroutine, from Section } 2. This has been included in a subroutine dna2peptide() which is, already, in BeginPerlBioin.pm. # translate_frame # # A subroutine to translate a frame of DNA sub translate_frame { my($seq, $start, $end) = @_; my $protein; # To make the subroutine easier to use, you won’t need to specify # the end point--it will just go to the end of the sequence # by default. unless($end) { $end = length($seq); } # Finally, calculate and return the translation return dna2peptide ( substr ( $seq, $start - 1, $end -$start + 1) ); } 37 Using Perl for Bioinformatics
Section 3 : Fasta Files and Frames Example 3-2 #!/usr/bin/perl Now that we have done all that work, and it appears that our subroutines will # Translate a DNA sequence in all six reading frames provide us with the functon we need, these routines are provided in use lib ‘../ModLib’; BeginPerlBioinf.pm. So, the Perl program is a short exercise and is very use strict; use warnings; modular. use BeginPerlBioinfo; # Initialize variables Output from example 3-2 my @file_data = ( ); -------Reading Frame 1-------- my $dna = ’’; my $revcom = ’’; RWRR_GVLGALGRPPTGLQRRRRMGPAQ_EYAAWEA_LEAEVVVGAFATAWDAAE my $protein = ’’; WSVQVRGSLAGVVRECAGSGDMEGDGSDPEPPDAGEDSKSENGENAPIYCICRKP # Read in the contents of the file "sample.dna" DINCFMIGCDNCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKS @file_data = get_file_data("sample.dna"); RERDGNERDSSEPRDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSP # Extract the sequence data from the contents of the file "sample.dna" QPLVATPSQHHQQQQQQIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKI $dna = extract_sequence_from_fasta_data(@file_data); RQKCRLRQCQLRARESYKYFPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRI # Translate the DNA to protein in six reading frames REDEGAVASSTVKEPPEATATPEPLSDEDL # and print the protein in lines 70 characters long print "\n -------Reading Frame 1--------\n\n"; … $protein = translate_frame($dna, 1); print_sequence($protein, 70); -------Reading Frame 5-------- print "\n -------Reading Frame 2--------\n\n"; $protein = translate_frame($dna, 2); RSSSESGSGVAVASGGSLTVDDATAPSSSRMRPNFCDGCGCCWVGSGRRGLGRDS print_sequence($protein, 70); EGVTGESEEGKYLYDSRARSWHWRSRHFCRILLGPPNFFMSRQKSQ_PQSSVRRH print "\n -------Reading Frame 3--------\n\n"; ASHSPHMRADRLICCCCCW_CWLGVATKGCGEDLWGEAEPRASMAPTPVPDPARR $protein = translate_frame($dna, 3); CRSGSGTGLLRPPPSSRGSLLSRSLPSRSRDFLCR_RISSLGSFSLHSRQYHSRM print_sequence($protein, 70); ALAIFSVIRMQSPWNHSLQLSHPIMKQLMSGLRQMQ_MGAFSPFSDLLSSPASGG # Calculate reverse complement SGSEPSPSISPLPAHSLTTPASDPRTCTDHSAASQAVAKAPTTTSASSHASQAAY $revcom = revcom($dna); SYCAGPMRRLRCKPVGGRPRAPKTPQRRH print "\n -------Reading Frame 4--------\n\n"; $protein = translate_frame($revcom, 1); -------Reading Frame 6-------- print_sequence($protein, 70); print "\n -------Reading Frame 5--------\n\n"; GPHLRVAQVWL_PQEAP_LLMTPLPPHLHGCALTSVMAVAAVGWAVAGGALAGTL $protein = translate_frame($revcom, 2); RASLVRARKGSTCTIPGPAAGTGAAGTSAGSCWGPRTSSCPDRNHSDHSPQCADM print_sequence($protein, 70); PHTHHTCGLTV_SAAAAAGDAGWVWPPRAAERICGAKQSPEQAWPQPLSLTLPGA print "\n -------Reading Frame 6--------\n\n"; AGLDQGQASCALHPHPGAHCCPAHCHPAPVTSCADSESLAWGLSLCTPDSTTPGW $protein = translate_frame($revcom, 3); PWPSSQ_SGCSPHGTTHCSCHTRS_SS_CPVCGRCSRWAHSPHSRTCCPPRHLEA print_sequence($protein, 70); LGLNHLPPYLRSRRTPSRPPPATREPAQTTRRRPRRLQRRPQLLPLLVTPPRQRT exit; PIAQAPCVVSAANQ_VAGLEPPRPLSAAI 38 Using Perl for Bioinformatics
Section 3 : FASTA file format Exercises for Section 3 1. Add to the Perl program in Example 3-1 a translation from DNA to protein and print out the protein. 2. Write a subroutine that checks a string and returns true if it’s a DNA sequence. Write another that checks for protein sequence data. 3. Write a program that can search by name for a gene in an unsorted array. 4. Write a subroutine that inserts an element into a sorted array. Hint: use the splice Perl function to insert the element. 5. Write a subroutine that checks an array of data and returns true if it’s in FASTA format. Note that FASTA expects the standard IUB/IUPAC amino acid and nucleic acid codes, plus the dash (-) that represents a gap of unknown length. Also, the asterisk (*) represents a stop codon for amino acids. Be careful using an asterisk in regular expressions; use a \* to escape it to match an actual asterisk. 39 Using Perl for Bioinformatics
Section 4 : GenBank (Genetic Sequence Data Bank) Files • International repository of known genetic sequences from a variety of organisms • GenBank is a flat file, an ASCII text file, that is easily readable • GenBank referred to as a databank or data store – Databases have a relational structure – includes associated indices – links and a query language. • Perl modules and constructs are ideal for processing flat files • For additional bioinformatics software, reference these web sites – National Center for Biotechnology Information (NCBI) – National Institutes of Health (NIH), http://proxy.lib.ohio-state.edu:2224 – European Bioinformatics Institute (EBI), http://www.ebi.ac.uk – European Molecular Biology Laboratory (EMBL), http://www.embl-heidelberg.de/ • Let’s take a look at a short GenBank file 40 Using Perl for Bioinformatics
Section 4 : GenBank Files Example of a short GenBank file; /cell_line="HuS-L12" LOCUS AB031069 2487 bp mRNA PRI 27-MAY-2000 /cell_type="lung fibroblast" DEFINITION Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1, /dev_stage="embryo" complete cds. gene 229..2199 ACCESSION AB031069 /gene="PCCX1" VERSION AB031069.1 GI:8100074 CDS 229..2199 KEYWORDS . /gene="PCCX1" SOURCE Homo sapiens embryo male lung fibroblast cell_line:HuS-L12 cDNA to /note="a nuclear protein carrying a PHD finger and a CXXC mRNA. domain" ORGANISM Homo sapiens /codon_start=1 Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; /product="protein containing CXXC domain 1" Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. /protein_id="BAA96307.1" REFERENCE 1 (sites) /db_xref="GI:8100075" AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,Si. and /translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD Takano,T. NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP TITLE PCCX1, a novel DNA-binding protein with PHD finger and CXXC domain, RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ is regulated by proteolysis QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY JOURNAL Biochem. Biophys. Res. Commun. 271 (2), 305-310 (2000) FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP MEDLINE 20261256 EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE REFERENCE 2 (bases 1 to 2487) KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S. and DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR Takano,T. FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK TITLE Direct Submission YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC JOURNAL Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank databases. PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT Tadahiro Fujino, Keio University School of Medicine, Department of AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR" Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo 160-8582, Japan BASE COUNT 564 a 715 c 768 g 440 t (E-mail:fujino@microb.med.keio.ac.jp, ORIGIN Tel:+81-3-3353-1211(ex.62692), Fax:+81-3-5360-1508) (cont’d on next page) FEATURES Location/Qualifiers source 1..2487 /organism="Homo sapiens" /db_xref="taxon:9606" /sex="male" 41 Using Perl for Bioinformatics
Section 4 : GenBank Files Example of a short GenBank filw (cont’d): For a view of the complete file and it’s format, look at ‘record.gb’ in Section 4 1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg of the exercises. 61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg 121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt A typical GenBank entry is packed with information. With perl we will be 181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat able to separate the different parts. For instance, by extracting the sequence, 241 ggttcagacc cagagcctcc agatgccggg gaggacagca agtccgagaa tggggagaat we can search for motifs, calculate statistics on the sequence, or compare with other sequences. Also, separating the various parts of the data annotation, we … have access to ID numbers, gene names, genus and species, publications, etc. The FEATURES table part of the annotation includes specific information about the DNA, such as the locations of exons, regulatory regions, important 2101 gccatgacaa accgcgcggg attgctggcc ctgatgctgc accagacgat ccagcacgat mutations, and so on. The format specification of GenBank files and a great 2161 cccctcacta ccgacctgcg ctccagtgcc gaccgctgag cctcctggcc cggacccctt deal of other information about GenBank can be found in theGenBank release 2221 acaccctgca ttccagatgg gggagccgcc cggtgcccgt gtgtccgttc ctccactcat notes, gbrel.txt, on the GenBank web site at 2281 ctgtttctcc ggttctccct gtgcccatcc accggttgac cgcccatctg cctttatcag ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt. 2341 agggactgtc cccgtcgaca tgttcagtgc ctggtggggc tgcggagtcc actcatcctt 2401 gcctcctctc cctgggtttt gttaataaaa ttttgaagaa accaaaaaaa aaaaaaaaaa 2461 aaaaaaaaaa aaaaaaaaaa aaaaaaa // 42 Using Perl for Bioinformatics
You can also read