Parallel on-chip gene synthesis and application to optimization of protein expression

xyli83
Nov 17, 2017
7 min read

Gene synthesis is an efficient and cost-effective alternative to molecular cloning for custom gene production, where the DNA is manufactured by assembling strings of oligos together.Gene synthesis is the process of chemically synthesizing double-stranded DNA molecules in vitro. The main concept of gene synthesis is to assemble custom oligos into long DNA molecules. For mote detail ,please click the link, discovery biology services or contact us. Email:marketing@medicilon.com.cn web:www.medicilon.com

A number of regulatory elements, such as promoters4,5 and ribosomal binding sites6,7, have been used to modulate protein expression. However, if the protein-coding DNA sequence itself is poorly translatable in a given host, modifying these elements may have a limited effect. This suggests that recoding the sequence with synonymous codons may be required. Moreover, it has not yet been possible to determine the full expression potential of a given protein in a given host or to use computer algorithms to reliably modify the coding sequence to achieve desired levels of protein expression3,8. Existing methods of optimizing codon usage for protein production in a heterologous host are slow, costly and unreliable. This problem has become a bottleneck for biomedical research and pharmaceutical development and, if not addressed, could hamper efforts to design and construct synthetic biological systems9,10,11,12,13,14,15. A key technical barrier to optimizing codon usage is the inability to synthesize genes at sufficiently low cost and high throughput. Such a capability would enable many gene and genome variants to be synthesized to explore the vast protein coding space3.

High-throughput gene synthesis technology has been driven by recent advances in DNA microarrays that can produce pools of up to a million oligonucleotides for gene assembly1,16,17,18,19, albeit in minute quantities (∼105–106 molecules per sequence). The presence of too many oligo sequences in a pool makes it difficult to effectively use the entire oligo pool for gene assembly, as similar sequences can cross-hybridize. Practical solutions include more efficient assembly strategies19,20, selective amplification of oligos20 or, as we do here, physical division of the oligo pool.

To effectively use all the oligos synthesized on a microarray, we divided the whole microarray into subarrays, each containing only the oligos that are needed to assemble a longer DNA molecule of about 0.5–1 kb in total length. Subarrays are physically isolated from the rest of the chip by being located in individual wells, eliminating the need for post-synthesis partitioning of the oligo pool. Oligos are synthesized on an embossed plastic microchip using a custom-made inkjet DNA microchip synthesizer21. The printing area in each subarray was patterned with 150-μm spots of silica thin film to reduce 'edge-effects', which could lead to poor oligo synthesis22. Our design allowed a standard 1′′ × 3′′ chip surface to be divided into as many as 30 subarrays, each containing 361 silica spots for synthesizing a unique DNA oligonucleotide sequence. With the setup used in this study, 10,830 different 85-mer oligo sequences could be synthesized on a single chip, providing a capacity to produce up to ∼30 kb of assembled DNA.

We next sought to achieve additional increases in throughput by integrating oligo synthesis with amplification and gene assembly on the same chip. In previous work, chemical methods, such as NH4OH treatment, have been used to cleave oligos from the chip for subsequent off-chip gene assembly reactions16. Progress toward automating and miniaturizing the subsequent gene assembly reactions has been reported using microfluidics, resulting in reduced costs and reagent consumption23. Here we first use isothermal nicking and a strand displacement amplification reaction (nSDA) to amplify oligos from the microarray surface, followed by polymerase cycling assembly (PCA) reaction in the same chamber. Briefly, 60-mer gene construction oligo sequences are synthesized with a 25-mer universal adaptor added at the 3′ end, which is anchored on the chip surface. This adaptor contains a nicking endonuclease recognition site (Supplementary Sequences). After array synthesis, a universal primer (Supplementary Sequences) hybridizes to the adaptor and initiates continuous elongation and nicking on the extending strand. This is catalyzed by a combination of a strand-displacing polymerase and a nicking endonuclease. The amplification is linear so as to keep the ratios constant among amplified oligos. The extent of the amplification is adjusted by controlling the reaction time. We estimate that a ∼2 h reaction time results in an approximately fourfold amplification.

To avoid complex microfluidic manipulations that would otherwise be required to collect and purify the amplified oligos for downstream gene assembly reactions, we designed the gene-assembly reaction cocktail to allow the PCA reaction to take place immediately after nSDA without a buffer change. After appropriate concentrations of the amplified oligos were accumulated by nSDA, the reaction mode is switched from isothermal amplification to thermal cycling, which results in assembly of the amplified oligos into gene fragments in the same reaction chamber. The gene products are further amplified off-chip by PCR (Supplementary Fig. 1). The size range of the combined nSDA-PCA reaction products is currently set at 0.5–1 kb for overall throughput and assembly efficiency considerations. Longer sequences can be hierarchically assembled from these 0.5–1 kb building blocks.

To reduce gene synthesis errors, we developed a simple yet effective error-correction method using the plant CEL family of mismatch-specific endonucleases, which have been shown to recognize and cleave all types of mismatches arising from base substitutions or from small insertions or deletions. A commercial source of a subtype of the CEL enzymes was the Surveyor nuclease, which has been used primarily for mutation detection24. To use it for error correction, we first denature by heat and reanneal the synthetic genes, and then treat them with Surveyor nuclease to cleave error-containing heteroduplexes at the mismatch sites. The error-free DNA duplexes remain intact and are amplified by overlap-extension PCR.

To test the effectiveness of this approach, we cloned chip-synthesized genes encoding RFP into an expression vector with and without Surveyor nuclease treatment. We performed sequencing and automated fluorescent colony-counting experiments to determine and compare error frequencies. By Sanger sequencing 470 randomly selected clones, we observed error frequencies of 1/526 bp (or ∼1.9 errors per kb) and 1/5,392 bp (or ∼0.19 errors per kb) before and after Surveyor nuclease treatment, respectively. Automated counting of thousands of colonies showed that ∼50% and 84% of the RFP colonies were fluorescent in untreated and Surveyor nuclease–treated populations (Supplementary Fig. 2a). The results of the sequencing and the colony counting experiments correlated well according to statistical analysis. Another study published while this manuscript was being revised reported comparable error frequencies using the commercial ErrASE kit20.

To apply high-throughput gene synthesis to optimize protein expression, we studied the distribution of protein expression levels of a large number of synthetic genes that all encode the same protein, called 'codon variants'. LacZα was used as an example in this study. Expression of lacZα makes the host E. coli cells turn blue in the presence of isopropyl-β-D-thiogalactopyranoside (IPTG). First, we designed synthetic codon variants using an unbiased codon usage table, in which codons representing an amino acid were used with equal frequency (Supplementary Sequences). Then, we constructed a library of lacZα codon variants and transformed the variants into E. coli competent cells. We plated a small fraction of the library on solid agar and measured the blue color intensity of the individual colonies in real time by automated image analysis. Clones representing a full spectrum of protein translation levels could be readily identified with fine shades of differences in protein expression. Notably, we observed a bell-shaped distribution of the maximum protein expression levels of random codon variants growing on the plate. Approximately one-third of the variants showed higher expression levels than wild-type lacZα. The expression level of the wild-type gene was slightly above the median level of all the clones with measurable expressions. Although understanding the causes and implications of this distribution requires further study, the distribution allowed us to estimate the translational potential of the lacZα gene in E. coli, which is indicated by the upper boundary in the quantile box plot (Fig. 2b). These observations suggest the feasibility of an experimental approach to reliably obtain gene sequences with the desired protein expression levels in a given expression system.

Next we describe the successful development of such an optimization approach in E. coli, which has been a workhorse for expressing a variety of proteins for research and industrial applications. To allow direct measurement of protein expression levels, we tag each target gene with a GFP reporter gene. Proteins expressed at higher levels will result in colonies with brighter fluorescence.

We applied this strategy to optimize the expression of 74 Drosophila transcription factor protein domains to be used for generating antibodies for the ENCODE (ENCyclopedia Of DNA Elements) Project25. We first tested the approach on 15 candidates that were not expressed in E. coli (N.N. & K.P.W., unpublished data). Libraries of synthetic codon variants were designed based on an E. coli codon-usage table26 (Supplementary Sequences) and constructed using our on-chip gene synthesis technology. The enzymatic error correction procedure was not performed here because heteroduplexes might form between closely related codon variants. The synthetic genes were fused to the N terminus of GFP and cloned into the pAcGFP expression vector using the sequence-independent circular polymerase extension cloning method (CPEC)27. E. coli cells were transformed by the plasmid libraries and cultured on agar plates. GFP fluorescence from all colonies was monitored continuously and a small number of highly fluorescent colonies were selected from each pool for sequencing. All colonies contained plasmids with different codon usages throughout the sequence of the candidate proteins.

The sequence-confirmed, highly fluorescent colonies were cultured individually in liquid media and the expression of the protein domains was measured by running the total protein extracts on polyacrylamide gels. High-expression clones were identified for all 15 candidates using this strategy (Fig. 3 and Supplementary Sequences). In comparison, the wild-type controls cloned into the same vector and cultured under the same conditions showed undetectable protein expression. This result indicates that this method has the capability to reliably increase protein expression from an undetectable level to as high as representing ∼50–60% of the total cell protein mass.