Here is a simple
pdb
to fasta
format conversion tool. The purpose was to extract residue sequence from
atomic coordinates (ATOM
lines).
Note that the sequence extracted this way corresponds to residues which are really present in the 3D structure. This could be different from the sequence obtained from the SEQRES field -- protein sequence without missing residues due to experiment -- or from the DBREF field -- complete sequence of the protein, provided through an UniProt identifier (see final remark PS 2 for an example).
This could have been done very quickly with a Python script but the goal was also to practise Unix tools such as awk, sed, tr, fold... (grep was not used here).
In the PDB format, the region of interest for this project are lines
starting with the ATOM
keyword. Below is an excerpt from the
1BTA structure
of Barstar.
[...]
ATOM 48 CA ALA A 3 -1.086 3.403 6.848 1.00 0.30 C
ATOM 49 C ALA A 3 0.316 4.013 6.751 1.00 0.30 C
ATOM 50 O ALA A 3 0.469 5.207 6.586 1.00 0.54 O
ATOM 51 CB ALA A 3 -1.491 2.834 5.486 1.00 0.34 C
ATOM 52 H ALA A 3 -1.991 5.353 6.827 1.00 0.25 H
ATOM 53 HA ALA A 3 -1.088 2.618 7.589 1.00 0.34 H
ATOM 54 HB1 ALA A 3 -2.257 3.457 5.050 1.00 1.08 H
ATOM 55 HB2 ALA A 3 -0.630 2.813 4.834 1.00 1.05 H
ATOM 56 HB3 ALA A 3 -1.871 1.831 5.612 1.00 1.07 H
ATOM 57 N VAL A 4 1.342 3.214 6.860 1.00 0.30 N
ATOM 58 CA VAL A 4 2.725 3.769 6.780 1.00 0.27 C
[...]
The 3rd, 4th and 5th columns correspond to the atom name, residue name
(3-letters code) and chain name, respectively. For a given protein
chain, we want the name of all residues in the structure. They all have
a carbon alpha
atom
labelled CA
(provided no atom is missing). Therefore, we have to
select lines that start with the keyword ATOM
, with column 3 equals to
CA
and column 5 equals to the targeted chain (here chain A
).
$ cat 1BTA.pdb | awk '/^ATOM/ && $3 == "CA" && $5 == "A" {print $4}'
We then get the sequence with one residue per line
LYS
LYS
ALA
VAL
ILE
ASN
[...]
To remove the newline character after each residue, there are several options:
tr '\n' ' '
is easy to understand -- replace the newline ('\n') character by a space (' ') characterpaste -s -d ' '
is more tricky- and
sed ':a;N;$!ba;s/\n/ /g'
is black magic to me (for an explanation, read this nice thread on StackOverflow)
And from this
$ cat 1BTA.pdb | awk '/ATOM/ && $3 == "CA" && $5 == "A" {print $4}' | tr '\n' ' '
we get that
LYS LYS ALA VAL ILE ASN GLY GLU GLN ILE ARG SER ILE...
Next we want to convert the 3-letter code residue name into 1-letter
code. sed
is useful here
$ sed 's/ALA/A/g;s/CYS/C/g;s/ASP/D/g;s/GLU/E/g;s/PHE/F/g;s/GLY/G/g;s/HIS/H/g;s/ILE/I/g;s/LYS/K/g;s/LEU/L/g;s/MET/M/g;s/ASN/N/g;s/PRO/P/g;s/GLN/Q/g;s/ARG/R/g;s/SER/S/g;s/THR/T/g;s/VAL/V/g;s/TRP/W/g;s/TYR/Y/g'
This is a pretty long command line since we have to convert the 20 amino acids.
Eventually, we have to remove space between residue names with
sed 's/ //g'
and fold the sequence into 60-character lines with
fold -w 60
. The complete command line is
$ cat 1BTA.pdb | awk '/ATOM/ && $3 == "CA" && $5 == "A" {print $4}' | tr '\n' ' ' | sed 's/ALA/A/g;s/CYS/C/g;s/ASP/D/g;s/GLU/E/g;s/PHE/F/g;s/GLY/G/g;s/HIS/H/g;s/ILE/I/g;s/LYS/K/g;s/LEU/L/g;s/MET/M/g;s/ASN/N/g;s/PRO/P/g;s/GLN/Q/g;s/ARG/R/g;s/SER/S/g;s/THR/T/g;s/VAL/V/g;s/TRP/W/g;s/TYR/Y/g' | sed 's/ //g' | fold -w 60
that outputs
KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDCLTGWVEYPLVLEWRQFEQSK
QLTENGAESVLQVFREAKAEGCDITIILS
A more exhaustive script that scans all possible chains is available in the pdb2fasta repository from Bitbucket. You can get it with
$ git clone https://bitbucket.org/pierrepo/pdb2fasta.git
or directly through the source code
browser.
The repository also contains two PDB files in the test
directory.
$ ./pdb2fasta.sh test/1BTA.pdb
>test/1BTA | chain A | 89 aa
KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDCLTGWVEYPLVLEWRQFEQSK
QLTENGAESVLQVFREAKAEGCDITIILS
and
$ ./pdb2fasta.sh test/3FCS.pdb
>test/3FCS | chain A | 934 aa
LNLDPVQLTFYAGPNGSQFGFSLDFHKDSHGRVAIVVGAPRTLGPSQEETGGVFLCPWRA
[...]
LKVDWDPVLVSCDSAPCTVVQCDLQEMARGQRAMVTVLAFLWLPSLYQRPLDQFVLQSHA
AWBWFNVSSLPYAVPPLSLPRGEAQVWTQLLRAC
>test/3FCS | chain B | 683 aa
GPNICTTRGVSSCQQCLAVSPMCAWCSDEALPLGSPRCDLKENLLKDNCAPESIEFPVSE
[...]
CTFKKECVECKKFDRGALHDENTCNRYCRDEIESVKELKDTGKDAVNCTYKNEDDCVVRF
QYYEDSSGKSILYVVEEPECCKG
>test/3FCS | chain C | 927 aa
LNLDPVQLTFYAGPNGSQFGFSLDFHKDSHGRVAIVVGAPRTLGPSQEETGGVFLCPWRA
[...]
KDPVLVSCDSAPCTVVQCDLQEMARGQRAMVTVLAFLWLPSLYQRPLDQFVLQSHAWFNV
SSLPYAVPPLSLPRGEAQVWTQLLRAC
>test/3FCS | chain D | 612 aa
GPNICTTRGVSSCQQCLAVSPMCAWCSDEALPLGSPRCDLKENLLKDNCAPAEBESIEFP
[...]
QCSCGDCLCDSDWTGYYCNCTTRTDTCMSSNGLLCSGRGKCECGSCVCIQPGSYGDTCEK
CPTCPDACTFKK
Have fun.
PS 1: beware that, in a PDB file, extracting data from the ATOM
lines
using column numbers (the awk
command line above) is not recommended.
Extracting a range of characters directly from the ATOM
line is a much
more robust alternative. Although such manipulation is trivial with
Python, I had no idea how to do this in Bash :-)
PS 2: to illustrate differences between the sequence pointed in DBREF, contained in SEQRES and contained in the real structure, the structure of chain B of PDB 3FCS has 683 residues resolved experimentally, the sequence from SEQRES has 690 residues and the corresponding UniProt sequence (from DBREF) counts 788 residues.