pdb to fasta conversion with Bash

Here is a simple pdb to fasta format conversion tool. The purpose was to extract residue sequence from atomic coordinates (ATOM lines).

Note that the sequence extracted this way corresponds to residues which are really present in the 3D structure. This could be different from the sequence obtained from the SEQRES field -- protein sequence without missing residues due to experiment -- or from the DBREF field -- complete sequence of the protein, provided through an UniProt identifier (see final remark PS 2 for an example).

This could have been done very quickly with a Python script but the goal was also to practise Unix tools such as awk, sed, tr, fold... (grep was not used here).

In the PDB format, the region of interest for this project are lines starting with the ATOM keyword. Below is an excerpt from the 1BTA structure of Barstar.

[...]
ATOM     48  CA  ALA A   3      -1.086   3.403   6.848  1.00  0.30           C
ATOM     49  C   ALA A   3       0.316   4.013   6.751  1.00  0.30           C
ATOM     50  O   ALA A   3       0.469   5.207   6.586  1.00  0.54           O
ATOM     51  CB  ALA A   3      -1.491   2.834   5.486  1.00  0.34           C
ATOM     52  H   ALA A   3      -1.991   5.353   6.827  1.00  0.25           H
ATOM     53  HA  ALA A   3      -1.088   2.618   7.589  1.00  0.34           H
ATOM     54  HB1 ALA A   3      -2.257   3.457   5.050  1.00  1.08           H
ATOM     55  HB2 ALA A   3      -0.630   2.813   4.834  1.00  1.05           H
ATOM     56  HB3 ALA A   3      -1.871   1.831   5.612  1.00  1.07           H
ATOM     57  N   VAL A   4       1.342   3.214   6.860  1.00  0.30           N
ATOM     58  CA  VAL A   4       2.725   3.769   6.780  1.00  0.27           C
[...]

The 3rd, 4th and 5th columns correspond to the atom name, residue name (3-letters code) and chain name, respectively. For a given protein chain, we want the name of all residues in the structure. They all have a carbon alpha atom labelled CA (provided no atom is missing). Therefore, we have to select lines that start with the keyword ATOM, with column 3 equals to CA and column 5 equals to the targeted chain (here chain A).

$ cat 1BTA.pdb | awk '/^ATOM/ && $3 == "CA" && $5 == "A" {print $4}'

We then get the sequence with one residue per line

LYS
LYS
ALA
VAL
ILE
ASN
[...]

To remove the newline character after each residue, there are several options:

tr '\n' ' ' is easy to understand -- replace the newline ('\n') character by a space (' ') character
paste -s -d ' ' is more tricky
and sed ':a;N;$!ba;s/\n/ /g' is black magic to me (for an explanation, read this nice thread on StackOverflow)

And from this

$ cat 1BTA.pdb | awk '/ATOM/ && $3 == "CA" && $5 == "A" {print $4}' | tr '\n' ' '

we get that

LYS LYS ALA VAL ILE ASN GLY GLU GLN ILE ARG SER ILE...

Next we want to convert the 3-letter code residue name into 1-letter code. sed is useful here

$ sed 's/ALA/A/g;s/CYS/C/g;s/ASP/D/g;s/GLU/E/g;s/PHE/F/g;s/GLY/G/g;s/HIS/H/g;s/ILE/I/g;s/LYS/K/g;s/LEU/L/g;s/MET/M/g;s/ASN/N/g;s/PRO/P/g;s/GLN/Q/g;s/ARG/R/g;s/SER/S/g;s/THR/T/g;s/VAL/V/g;s/TRP/W/g;s/TYR/Y/g'

This is a pretty long command line since we have to convert the 20 amino acids.

Eventually, we have to remove space between residue names with sed 's/ //g' and fold the sequence into 60-character lines with fold -w 60. The complete command line is

$ cat 1BTA.pdb | awk '/ATOM/ && $3 == "CA" && $5 == "A" {print $4}' | tr '\n' ' ' | sed 's/ALA/A/g;s/CYS/C/g;s/ASP/D/g;s/GLU/E/g;s/PHE/F/g;s/GLY/G/g;s/HIS/H/g;s/ILE/I/g;s/LYS/K/g;s/LEU/L/g;s/MET/M/g;s/ASN/N/g;s/PRO/P/g;s/GLN/Q/g;s/ARG/R/g;s/SER/S/g;s/THR/T/g;s/VAL/V/g;s/TRP/W/g;s/TYR/Y/g' | sed 's/ //g' | fold -w 60

that outputs

KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDCLTGWVEYPLVLEWRQFEQSK
QLTENGAESVLQVFREAKAEGCDITIILS

A more exhaustive script that scans all possible chains is available in the pdb2fasta repository from Bitbucket. You can get it with

$ git clone https://bitbucket.org/pierrepo/pdb2fasta.git

or directly through the source code browser. The repository also contains two PDB files in the test directory.

$ ./pdb2fasta.sh test/1BTA.pdb
>test/1BTA | chain A | 89 aa
KKAVINGEQIRSISDLHQTLKKELALPEYYGENLDALWDCLTGWVEYPLVLEWRQFEQSK
QLTENGAESVLQVFREAKAEGCDITIILS

and

$ ./pdb2fasta.sh test/3FCS.pdb
>test/3FCS | chain A | 934 aa
LNLDPVQLTFYAGPNGSQFGFSLDFHKDSHGRVAIVVGAPRTLGPSQEETGGVFLCPWRA
[...]
LKVDWDPVLVSCDSAPCTVVQCDLQEMARGQRAMVTVLAFLWLPSLYQRPLDQFVLQSHA
AWBWFNVSSLPYAVPPLSLPRGEAQVWTQLLRAC
>test/3FCS | chain B | 683 aa
GPNICTTRGVSSCQQCLAVSPMCAWCSDEALPLGSPRCDLKENLLKDNCAPESIEFPVSE
[...]
CTFKKECVECKKFDRGALHDENTCNRYCRDEIESVKELKDTGKDAVNCTYKNEDDCVVRF
QYYEDSSGKSILYVVEEPECCKG
>test/3FCS | chain C | 927 aa
LNLDPVQLTFYAGPNGSQFGFSLDFHKDSHGRVAIVVGAPRTLGPSQEETGGVFLCPWRA
[...]
KDPVLVSCDSAPCTVVQCDLQEMARGQRAMVTVLAFLWLPSLYQRPLDQFVLQSHAWFNV
SSLPYAVPPLSLPRGEAQVWTQLLRAC
>test/3FCS | chain D | 612 aa
GPNICTTRGVSSCQQCLAVSPMCAWCSDEALPLGSPRCDLKENLLKDNCAPAEBESIEFP
[...]
QCSCGDCLCDSDWTGYYCNCTTRTDTCMSSNGLLCSGRGKCECGSCVCIQPGSYGDTCEK
CPTCPDACTFKK

Have fun.

PS 1: beware that, in a PDB file, extracting data from the ATOM lines using column numbers (the awk command line above) is not recommended. Extracting a range of characters directly from the ATOM line is a much more robust alternative. Although such manipulation is trivial with Python, I had no idea how to do this in Bash :-)

PS 2: to illustrate differences between the sequence pointed in DBREF, contained in SEQRES and contained in the real structure, the structure of chain B of PDB 3FCS has 683 residues resolved experimentally, the sequence from SEQRES has 690 residues and the corresponding UniProt sequence (from DBREF) counts 788 residues.

Related Posts: