pdb to fasta conversion with Bash

Here is a simple pdb to fasta format conversion tool. The purpose was to extract residue sequence from atomic coordinates (ATOM lines).

Note that the sequence extracted this way corresponds to residues which are really present in the 3D structure. This could be different from the sequence obtained from the SEQRES field — protein sequence without missing residues due to experiment — or from the DBREF field — complete sequence of the protein, provided through an UniProt identifier (see final remark PS 2 for an example).

This could have been done very quickly with a Python script but the goal was also to practise Unix tools such as awk, sed, tr, fold… (grep was not used here).

In the PDB format, the region of interest for this project are lines starting with the ATOM keyword. Below is an excerpt from the 1BTA structure of Barstar.

[...]
ATOM     48  CA  ALA A   3      -1.086   3.403   6.848  1.00  0.30           C
ATOM     49  C   ALA A   3       0.316   4.013   6.751  1.00  0.30           C
ATOM     50  O   ALA A   3       0.469   5.207   6.586  1.00  0.54           O
ATOM     51  CB  ALA A   3      -1.491   2.834   5.486  1.00  0.34           C
ATOM     52  H   ALA A   3      -1.991   5.353   6.827  1.00  0.25           H
ATOM     53  HA  ALA A   3      -1.088   2.618   7.589  1.00  0.34           H
ATOM     54  HB1 ALA A   3      -2.257   3.457   5.050  1.00  1.08           H
ATOM     55  HB2 ALA A   3      -0.630   2.813   4.834  1.00  1.05           H
ATOM     56  HB3 ALA A   3      -1.871   1.831   5.612  1.00  1.07           H
ATOM     57  N   VAL A   4       1.342   3.214   6.860  1.00  0.30           N
ATOM     58  CA  VAL A   4       2.725   3.769   6.780  1.00  0.27           C
[...]

The 3rd, 4th and 5th columns correspond to the atom name, residue name (3-letters code) and chain name, respectively. For a given protein chain, we want the name of all residues in the structure. They all have a carbon alpha atom labelled CA (provided no atom is missing). Therefore, we have to select lines that start with the keyword ATOM, with column 3 equals to CA and column 5 equals to the targeted chain (here chain A).

cat 1BTA.pdb | awk '/^ATOM/ && $3 == "CA" && $5 == "A" {print $4}'

We then get the sequence with one residue per line

LYS
LYS
ALA
VAL
ILE
ASN
[...]

To remove the newline character after each residue, there are several options:

  • tr '\n' ' ' is easy to understand — replace the newline (‘\n’) character by a space (‘ ‘) character
  • paste -s -d ' ' is more tricky
  • and sed ':a;N;$!ba;s/\n/ /g' is black magic to me (for an explanation, read this nice thread on StackOverflow)

And from this

cat 1BTA.pdb | awk '/ATOM/ && $3 == "CA" && $5 == "A" {print $4}' | tr '\n' ' '

we get that

LYS LYS ALA VAL ILE ASN GLY GLU GLN ILE ARG SER ILE...

Next we want to convert the 3-letter code residue name into 1-letter code. sed is useful here

 sed 's/ALA/A/g;s/CYS/C/g;s/ASP/D/g;s/GLU/E/g;s/PHE/F/g;s/GLY/G/g;s/HIS/H/g;s/ILE/I/g;s/LYS/K/g;s/LEU/L/g;s/MET/M/g;s/ASN/N/g;s/PRO/P/g;s/GLN/Q/g;s/ARG/R/g;s/SER/S/g;s/THR/T/g;s/VAL/V/g;s/TRP/W/g;s/TYR/Y/g'

This is a pretty long command line since we have to convert the 20 amino acids.

Eventually, we have to remove space between residue names with sed 's/ //g' and fold the sequence into 60-character lines with fold -w 60. The complete command line is

cat 1BTA.pdb | awk '/ATOM/ && $3 == "CA" && $5 == "A" {print $4}' | tr '\n' ' ' | sed 's/ALA/A/g;s/CYS/C/g;s/ASP/D/g;s/GLU/E/g;s/PHE/F/g;s/GLY/G/g;s/HIS/H/g;s/ILE/I/g;s/LYS/K/g;s/LEU/L/g;s/MET/M/g;s/ASN/N/g;s/PRO/P/g;s/GLN/Q/g;s/ARG/R/g;s/SER/S/g;s/THR/T/g;s/VAL/V/g;s/TRP/W/g;s/TYR/Y/g' | sed 's/ //g' | fold -w 60

that outputs

KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDCLTGWVEYPLVLEWRQFEQSK
QLTENGAESVLQVFREAKAEGCDITIILS

A more exhaustive script that scans all possible chains is available in the pdb2fasta repository from Bitbucket. You can get it with

git clone https://bitbucket.org/pierrepo/pdb2fasta.git

or directly through the source code browser. The repository also contains two PDB files in the test directory.

poulain@badiane> ./pdb2fasta.sh test/1BTA.pdb
>test/1BTA | chain A | 89 aa
KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDCLTGWVEYPLVLEWRQFEQSK
QLTENGAESVLQVFREAKAEGCDITIILS

and

poulain@badiane> ./pdb2fasta.sh test/3FCS.pdb
>test/3FCS | chain A | 934 aa
LNLDPVQLTFYAGPNGSQFGFSLDFHKDSHGRVAIVVGAPRTLGPSQEETGGVFLCPWRA
[...]
LKVDWDPVLVSCDSAPCTVVQCDLQEMARGQRAMVTVLAFLWLPSLYQRPLDQFVLQSHA
AWBWFNVSSLPYAVPPLSLPRGEAQVWTQLLRAC
>test/3FCS | chain B | 683 aa
GPNICTTRGVSSCQQCLAVSPMCAWCSDEALPLGSPRCDLKENLLKDNCAPESIEFPVSE
[...]
CTFKKECVECKKFDRGALHDENTCNRYCRDEIESVKELKDTGKDAVNCTYKNEDDCVVRF
QYYEDSSGKSILYVVEEPECCKG
>test/3FCS | chain C | 927 aa
LNLDPVQLTFYAGPNGSQFGFSLDFHKDSHGRVAIVVGAPRTLGPSQEETGGVFLCPWRA
[...]
KDPVLVSCDSAPCTVVQCDLQEMARGQRAMVTVLAFLWLPSLYQRPLDQFVLQSHAWFNV
SSLPYAVPPLSLPRGEAQVWTQLLRAC
>test/3FCS | chain D | 612 aa
GPNICTTRGVSSCQQCLAVSPMCAWCSDEALPLGSPRCDLKENLLKDNCAPAEBESIEFP
[...]
QCSCGDCLCDSDWTGYYCNCTTRTDTCMSSNGLLCSGRGKCECGSCVCIQPGSYGDTCEK
CPTCPDACTFKK

Have fun.

PS 1: beware that, in a PDB file, extracting data from the ATOM lines using column numbers (the awk command line above) is not recommended. Extracting a range of characters directly from the ATOM line is a much more robust alternative. Although such manipulation is trivial with Python, I had no idea how to do this in Bash  :-)

PS 2: to illustrate differences between the sequence pointed in DBREF, contained in SEQRES and contained in the real structure, the structure of chain B of PDB 3FCS has 683 residues resolved experimentally, the sequence from SEQRES has 690 residues and the corresponding UniProt sequence (from DBREF) counts 788 residues.

  • http://scientific-ocean.com Sebastian

    Nice article! Because I had to do a lot of those conversion, I wrote a web app now. If you are interested: http://bluewoodtree.zxq.net/pdb_to_fasta.html

  • alisha parveen

    i am doing virtual screening for 80000 peptide sequence.so can u give me the script in any language that convert fasta format into pdb format.
    thank you

    • PP

      Hi. There is no simple script that can convert a 1D sequence in fasta format into a 3D structure in pdb format.
      Please read further this thread on BioStar.

  • Philippe Label

    Hello,

    If I may add my 2 cents in the game…
    I suggest using cut as a faster tool to isolate dedicated columns from the greped ATOM lines.

    grep ‘^ATOM’ 1BTA.pdb | cut -c14-22 | awk ‘$1 == “CA” && $3 == “A” {print $2}’ | tr ‘\n’YS/C/g;s/ASP/D/g;s/GLU/E/g;s/PHE/F/g;s/GLY/G/g;s/HIS/H/g;s/ILE/I/g;s/LYS/K/g;s/LEU/L/g;s/MET/M/g;s/ASN/N/g;s/PRO/P/g;s/GLN/Q/g;s/ARG/R/g;s/SER/S/g;s/THR/T/g;s/VAL/V/g;s/TRP/W/g;s/TYR/Y/g’ | sed ‘s/ //g’ | fold -w 60

    Cheers!

    • Philippe Label

      oops something wrong copy/paste in the editor…

      Even faster, yo ucan remove the translitteration of newlines before sed since this newline is already a good delimiter for sed. Just translitterate after sed to remove newlines before folding output.

      finally I suggest:

      grep ‘^ATOM’ 1BTA.pdb | cut -c14-22 | awk ‘$1 == “CA” && $3 == “A” {print $2}’ | sed ‘s/ALA/A/g;s/CYS/C/g;s/ASP/D/g;s/GLU/E/g;s/PHE/F/g;s/GLY/G/g;s/HIS/H/g;s/ILE/I/g;s/LYS/K/g;s/LEU/L/g;s/MET/M/g;s/ASN/N/g;s/PRO/P/g;s/GLN/Q/g;s/ARG/R/g;s/SER/S/g;s/THR/T/g;s/VAL/V/g;s/TRP/W/g;s/TYR/Y/g’ | tr -d ‘\n’ | fold -w 60

  • Jonathan Barnoud

    Great script!

    About the first post scriptum, getting a range of character in a string can be done with the substr function.
    This way, line 40 of the script for example become:
    awk -v ch=$1 ‘/^ATOM/ && substr($0, 13, 4) == ” CA ” && substr($0, 22, 1) == ch {print substr($0, 18, 3)}’