Here is a simple pdb to fasta format conversion tool. The purpose was to extract residue sequence from atomic coordinates (ATOM lines).
Note that the sequence extracted this way corresponds to residues which are really present in the 3D structure. This could be different from the sequence obtained from the SEQRES field — protein sequence without missing residues due to experiment — or from the DBREF field — complete sequence of the protein, provided through an UniProt identifier (see final remark PS 2 for an example).
This could have been done very quickly with a Python script but the goal was also to practise Unix tools such as awk, sed, tr, fold… (grep was not used here).
In the PDB format, the region of interest for this project are lines starting with the ATOM keyword. Below is an excerpt from the 1BTA structure of Barstar.
[...] ATOM 48 CA ALA A 3 -1.086 3.403 6.848 1.00 0.30 C ATOM 49 C ALA A 3 0.316 4.013 6.751 1.00 0.30 C ATOM 50 O ALA A 3 0.469 5.207 6.586 1.00 0.54 O ATOM 51 CB ALA A 3 -1.491 2.834 5.486 1.00 0.34 C ATOM 52 H ALA A 3 -1.991 5.353 6.827 1.00 0.25 H ATOM 53 HA ALA A 3 -1.088 2.618 7.589 1.00 0.34 H ATOM 54 HB1 ALA A 3 -2.257 3.457 5.050 1.00 1.08 H ATOM 55 HB2 ALA A 3 -0.630 2.813 4.834 1.00 1.05 H ATOM 56 HB3 ALA A 3 -1.871 1.831 5.612 1.00 1.07 H ATOM 57 N VAL A 4 1.342 3.214 6.860 1.00 0.30 N ATOM 58 CA VAL A 4 2.725 3.769 6.780 1.00 0.27 C [...]
The 3rd, 4th and 5th columns correspond to the atom name, residue name (3-letters code) and chain name, respectively. For a given protein chain, we want the name of all residues in the structure. They all have a carbon alpha atom labelled CA (provided no atom is missing). Therefore, we have to select lines that start with the keyword ATOM, with column 3 equals to CA and column 5 equals to the targeted chain (here chain A).
cat 1BTA.pdb | awk '/^ATOM/ && $3 == "CA" && $5 == "A" {print $4}'
We then get the sequence with one residue per line
LYS LYS ALA VAL ILE ASN [...]
To remove the newline character after each residue, there are several options:
tr '\n' ' 'is easy to understand — replace the newline (‘\n’) character by a space (‘ ‘) characterpaste -s -d ' 'is more tricky- and
sed ':a;N;$!ba;s/\n/ /g'is black magic to me (for an explanation, read this nice thread on StackOverflow)
And from this
cat 1BTA.pdb | awk '/ATOM/ && $3 == "CA" && $5 == "A" {print $4}' | tr '\n' ' '
we get that
LYS LYS ALA VAL ILE ASN GLY GLU GLN ILE ARG SER ILE...
Next we want to convert the 3-letter code residue name into 1-letter code. sed is useful here
sed 's/ALA/A/g;s/CYS/C/g;s/ASP/D/g;s/GLU/E/g;s/PHE/F/g;s/GLY/G/g;s/HIS/H/g;s/ILE/I/g;s/LYS/K/g;s/LEU/L/g;s/MET/M/g;s/ASN/N/g;s/PRO/P/g;s/GLN/Q/g;s/ARG/R/g;s/SER/S/g;s/THR/T/g;s/VAL/V/g;s/TRP/W/g;s/TYR/Y/g'
This is a pretty long command line since we have to convert the 20 amino acids.
Eventually, we have to remove space between residue names with sed 's/ //g' and fold the sequence into 60-character lines with fold -w 60. The complete command line is
cat 1BTA.pdb | awk '/ATOM/ && $3 == "CA" && $5 == "A" {print $4}' | tr '\n' ' ' | sed 's/ALA/A/g;s/CYS/C/g;s/ASP/D/g;s/GLU/E/g;s/PHE/F/g;s/GLY/G/g;s/HIS/H/g;s/ILE/I/g;s/LYS/K/g;s/LEU/L/g;s/MET/M/g;s/ASN/N/g;s/PRO/P/g;s/GLN/Q/g;s/ARG/R/g;s/SER/S/g;s/THR/T/g;s/VAL/V/g;s/TRP/W/g;s/TYR/Y/g' | sed 's/ //g' | fold -w 60
that outputs
KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDCLTGWVEYPLVLEWRQFEQSK QLTENGAESVLQVFREAKAEGCDITIILS
A more exhaustive script that scans all possible chains is available in the pdb2fasta repository from Bitbucket. You can get it with
git clone https://bitbucket.org/pierrepo/pdb2fasta.git
or directly through the source code browser. The repository also contains two PDB files in the test directory.
poulain@badiane> ./pdb2fasta.sh test/1BTA.pdb >test/1BTA | chain A | 89 aa KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDCLTGWVEYPLVLEWRQFEQSK QLTENGAESVLQVFREAKAEGCDITIILS
and
poulain@badiane> ./pdb2fasta.sh test/3FCS.pdb >test/3FCS | chain A | 934 aa LNLDPVQLTFYAGPNGSQFGFSLDFHKDSHGRVAIVVGAPRTLGPSQEETGGVFLCPWRA [...] LKVDWDPVLVSCDSAPCTVVQCDLQEMARGQRAMVTVLAFLWLPSLYQRPLDQFVLQSHA AWBWFNVSSLPYAVPPLSLPRGEAQVWTQLLRAC >test/3FCS | chain B | 683 aa GPNICTTRGVSSCQQCLAVSPMCAWCSDEALPLGSPRCDLKENLLKDNCAPESIEFPVSE [...] CTFKKECVECKKFDRGALHDENTCNRYCRDEIESVKELKDTGKDAVNCTYKNEDDCVVRF QYYEDSSGKSILYVVEEPECCKG >test/3FCS | chain C | 927 aa LNLDPVQLTFYAGPNGSQFGFSLDFHKDSHGRVAIVVGAPRTLGPSQEETGGVFLCPWRA [...] KDPVLVSCDSAPCTVVQCDLQEMARGQRAMVTVLAFLWLPSLYQRPLDQFVLQSHAWFNV SSLPYAVPPLSLPRGEAQVWTQLLRAC >test/3FCS | chain D | 612 aa GPNICTTRGVSSCQQCLAVSPMCAWCSDEALPLGSPRCDLKENLLKDNCAPAEBESIEFP [...] QCSCGDCLCDSDWTGYYCNCTTRTDTCMSSNGLLCSGRGKCECGSCVCIQPGSYGDTCEK CPTCPDACTFKK
Have fun.
PS 1: beware that, in a PDB file, extracting data from the ATOM lines using column numbers (the awk command line above) is not recommended. Extracting a range of characters directly from the ATOM line is a much more robust alternative. Although such manipulation is trivial with Python, I had no idea how to do this in Bash :-)
PS 2: to illustrate differences between the sequence pointed in DBREF, contained in SEQRES and contained in the real structure, the structure of chain B of PDB 3FCS has 683 residues resolved experimentally, the sequence from SEQRES has 690 residues and the corresponding UniProt sequence (from DBREF) counts 788 residues.
Nice article! Because I had to do a lot of those conversion, I wrote a web app now. If you are interested: http://bluewoodtree.zxq.net/pdb_to_fasta.html
i am doing virtual screening for 80000 peptide sequence.so can u give me the script in any language that convert fasta format into pdb format.
thank you
Hi. There is no simple script that can convert a 1D sequence in fasta format into a 3D structure in pdb format.
Please read further this thread on BioStar.
Hello,
If I may add my 2 cents in the game…
I suggest using cut as a faster tool to isolate dedicated columns from the greped ATOM lines.
grep ‘^ATOM’ 1BTA.pdb | cut -c14-22 | awk ‘$1 == “CA” && $3 == “A” {print $2}’ | tr ‘\n’YS/C/g;s/ASP/D/g;s/GLU/E/g;s/PHE/F/g;s/GLY/G/g;s/HIS/H/g;s/ILE/I/g;s/LYS/K/g;s/LEU/L/g;s/MET/M/g;s/ASN/N/g;s/PRO/P/g;s/GLN/Q/g;s/ARG/R/g;s/SER/S/g;s/THR/T/g;s/VAL/V/g;s/TRP/W/g;s/TYR/Y/g’ | sed ‘s/ //g’ | fold -w 60
Cheers!
oops something wrong copy/paste in the editor…
Even faster, yo ucan remove the translitteration of newlines before sed since this newline is already a good delimiter for sed. Just translitterate after sed to remove newlines before folding output.
finally I suggest:
grep ‘^ATOM’ 1BTA.pdb | cut -c14-22 | awk ‘$1 == “CA” && $3 == “A” {print $2}’ | sed ‘s/ALA/A/g;s/CYS/C/g;s/ASP/D/g;s/GLU/E/g;s/PHE/F/g;s/GLY/G/g;s/HIS/H/g;s/ILE/I/g;s/LYS/K/g;s/LEU/L/g;s/MET/M/g;s/ASN/N/g;s/PRO/P/g;s/GLN/Q/g;s/ARG/R/g;s/SER/S/g;s/THR/T/g;s/VAL/V/g;s/TRP/W/g;s/TYR/Y/g’ | tr -d ‘\n’ | fold -w 60
Great script!
About the first post scriptum, getting a range of character in a string can be done with the substr function.
This way, line 40 of the script for example become:
awk -v ch=$1 ‘/^ATOM/ && substr($0, 13, 4) == ” CA ” && substr($0, 22, 1) == ch {print substr($0, 18, 3)}’