What my lab would do

The longstanding dream of many academic trainees is to start a lab. But unlike other professions where seniority is earned with experience, the lab is earned by brilliance. Unfortunately, the sacrifice necessary for achievement has made the choice of starting a lab one that few people are able to make.

But it makes sense.

The ‘tired’ vision is that you need to constantly churn out grant applications, mentoring students and postdocs, 90+% of which can’t hold a candle to your own talent, and appeasing the endless requests of administration, reviewers, and employees. Being a PI can be the ultimate client facing operation, where every day you need to do something for someone.

The ‘greedy’ vision is a never ending assault on hairy scientific problems, enabling you as a PI to win grants, publish papers, and build companies that solve the world’s most difficult problems. In the right environment, flywheels spin and your career sprouts wings.

Related to this dichotomy, the key theme that I learned this year is the importance of problem selection. Choosing the right problems can make feeling tired turn into feeling greedy. Finding your scientific personality and the set of problems that allow you to splash color and energize a field of study should be the key focus of graduate school. Indeed, for a certain phenotype of individual, the conventional wisdom that the most important decision you make as a PhD student is picking a kind and thoughtful graduate mentor is BS. What is important is picking an area of research where the field has vision and ambitious talent is abundant.

For me?

My research focuses on pancreatic cancer. The ~1.5 years I’ve spent working in this field have given me the freedom to explore how to use cutting edge technologies that I found cool 2 years ago. Spatial -omics, pooled screening, organoids, and genetically engineered mouse models are objectively helpful tools for exploring disease.

But they aren’t groundbreaking.

I spend my days distributing myself across many projects, performing maintenance tasks like passaging cells or titrating reagents. Piloting or optimizing new assays and experiments. Data comes in, you analyze it, and you make a figure. Most of my projects don’t have serious therapeutic or diagnostic applications yet, they are simply arguments to try and convince people to think a certain way or more deeply study a certain topic. I’m a story farmer harvesting evidence and…

Studying biology is a constant search down rabbit holes, which is fun but can get exhausting without a community. The downside, frankly, about this field is that there is simply very little ambition. Survival outcomes are horrible. The core infrastructure has more or less been built. The main scientific questions are well defined and are easily answered by straightforward experiments, but are blocked by structural problems.

Drugging KRAS was the watershed moment and now we are all waiting on the same clinical trial specimens to be analyzed. There isn’t a lot of creative freedom so the ‘top’ researchers resort to grinding through expensive experiments. The formula is very simple. Profile the patient samples, execute the CRISPR screens, and do the mouse study. Cancer research is largely a sample acquisition game and what separates top researchers is their ability to secure samples, money, or human resources. Rigorous evaluation of new targets is a whole different type of investment — you need to buy or invent reagents, build new models and assays, and still do the expensive mouse experiments in the end. It is not as fast or as cheap to iterate as in synthetic biology, chemistry, physics, or computational domains.

Poor quality evidence is everywhere. There have been several painful experiences in my own training (email me) and its clearly a common enough experience for people if people are writing papers about it. When data cannot be cheaply and quickly replicated, anything and everything can be published.

In a way, this conditions you so that your gut reaction every time you see something new is to throw shade. If nothing works and the task that you are working on is too hard, there shouldn’t be any way that someone less talented than you can come up with something right? The default reaction is jealousy and that is deeply broken.

The biggest benefit of working specifically on a disease is that once you learn an indication, you know it. It’s like riding a bike, it honestly doesn’t really change that much. You could study pancreas cancer for a month and more or less be on par with the so called experts.

The next best skill that training in biology (in a therapeutic sense) gives you, is how to gain clarity of strengths and limitations. No one has cured cancer yet and so there is always a next experiment to do, always a clear benchmark to beat, and always caveats to experimental results.

If I were to summarize the strengths and limitations of my PhD training, I would say the following:

Skills gained:

How to think about biology from a therapeutic angle
How to think about the utility and applications of new technologies
How to think creatively and think through conflicting results

Lacking in:

Engineering ability, ie. development of drugs or antibodies
(a real) Computational skill set
Visionary big picture thinking (slightly buffered by extracurricular experiences)

My lab

In science, and probably in most other aspects of life too, there is a rubric. The rubric exists for fairness and while the intentions are good, it generates a lot of frivolous work. For example, the industrial standard of collecting only enough information that you need to make a decision is against the ethos of the modern peer review system.

Your lab is arguably your first opportunity to protest against the rubric. In fact, you design your own rubric according to your own values and ambition. Your lab is earned by brilliance, and your brilliance will set you free. Being a PI enables you to direct focus towards the big hairy problems in the world that you care about. Being a PI lets you be greedy without a limit on time or money.

For myself, I want to study recurrent, refractory, and resistant disease.

The most frustrating thing is having a drug but no cure. I want to study why there are no cures, starting in cancer and across different drug modalities. My lab would design new therapeutic strategies to subvert primary and secondary emergence of resistance, as well as biomarkers to more confidently identify optimal therapeutic strategies. I have five verticals:

1. Development of new models for minimal residual disease or refractory/recurrent disease

One of the best things I’ve learned in my PhD is the success of genetically engineered mouse models in recapitulating outcomes in human disease. In pancreatic cancer, the KP mouse model (having a pancreas specific KRAS G12D and p53 mutation) is the gold standard preclinical tool for understanding tumor biology and evaluating experimental therapeutics. There is another model, the iKRAS model, that has a doxycycline inducible KRAS G12D mutation, that enables study of what happens when you ablate KRAS G12D. These models are remarkably accurate at recapitulating the effects of KRAS inhibitors, down to specific genetic alterations that emerge at resistance and changes in transcriptional state.

Yet these inducible oncogene models have not been used broadly to study resistance, across cancers and across the diversity of driver oncogenes. In some instances, the primary utility of inducible expression has perhaps only been to prevent oncogene induced fetal death if constitutively expressed. Outside of KRAS, what if we could generate inducible ALK fusions or BRAF V600E? Or what about whatever oncogenic driver gene, even those without chemical inhibitors (like the many transcription factors nominated as top selective dependencies in DepMap)?

What about more broadly applicable medications like chemotherapy? Achieving absolute cure with intensive chemotherapy is often not possible, and there is inevitably some residual disease leftover, both in patients and in mouse models. For mice, could we think about ways of standardizing the induction of complete responses, partial responses, or stable disease depending on dose? Such a paradigm would allow us to model and understand what the residual disease looks like and how it could potentially be targeted.

Reproducible mouse models of treatment resistance require technological (and likely regulatory) improvements to the current stack. We need things like robotic automation of mouse experiments. Doxycycline chow with inducible systems can help relieve the stain of dosing for some studies. But what if we could automate weight measurements, tumor ultrasound/CT, or use other digital biomarkers? Could cages endowed with sensors automate many of the mundane tasks of in vivo research? A company called Vium was doing this but was acquired by Recursion in 2020. More broadly accessible tools for mouse experiments would accelerate the vision for new and improved models of disease.

I would begin with the following concrete steps:

Generate inducible ablation models for oncogenic transcription factors enriched at the top of DepMap (e.g. SOX10 melanoma, PAX8 RCC/ovarian, CTNNB1 colon). These can most likely be built on top of the gold standard disease specific models used currently.
Generate reproducible models of chemotherapy resistance by optimizing dosing for complete, partial, and minimal response
Characterize the kinetics of each of these models for tumor growth, remission, and recurrence

2. Characterization of residual or refractory/recurrent disease

Using these models, I want to perform in vivo and in vitro time course experiments to understand exactly how pathogenic cells change in response to drug. You can imagine establishing exact time points of when resistance occurs and when recurrence occurs. Using large cohorts of mice, one could profile the cells at these time points to try and discover aspects of tumor evolution through therapy.

There are many open questions here that improved models could answer:

Why are some tumors likely to develop resistance with genetic mechanisms while others are non-genetic?
Are there shared mechanisms of resistance following oncogene ablation? Why do cells keep growing despite lack of the original growth signal?
What is the relative strength of oncogene across tumor contexts? In which contexts does tumor maintenance require other factors?
What molecules do drug resistant cells secrete, and how does this impact the microenvironment around them?
How plastic are cells and what are the cell states along the continuum of resistance and at specific timepoints?

3. High throughput model development for functional genomics

DepMap is the most revolutionary thing that has happened for cancer research since sequencing and the logical next step is scale and diversity.

One key ingredient missing from DepMap is the impact of therapy, and being able to characterize drug resistant cells would greatly enrich the kinds of computational analyses we could do to model how patient tumors respond to therapy. If the goal of therapy is to kill cells, generate kill resistant cell lines. Not just 3-5 of them, but 100s of them per drug and cancer type, and multiple within the same cell line. Just like the current version of DepMap, we would RNAseq them all and do CRISPR screens in the representative lines.

These experiments are highly amenable to lab automation. Using OpenTrons or whatever other pipetting machine, we can install it inside an incubator and use it to passage cells, administer drug, and take images of cells as they change to monitor growth and other behaviors. Generating drug resistant lines is a pretty brain dead activity where all you need to do is grow cells in increasing concentrations of drug.

CRISPR screens are more difficult to automate, but AI can help plan experiments, providing a daily todo list for tasks machines can’t do. Technicians could get an email each day with tasks to do or even a daily schedule for when to do things. Cell line profiling experiments (e.g. RNAseq, WGS, proteomics) are automatable or at least not incredibly labor intensive.

The end goal should be for any given blockbuster parachute therapy (e.g. RMC-6236, osimertinib, lorlatinib, most forms of chemo, etc.), to characterize hundreds of resistant cell lines. The matched pre/post profiling data could then be used to understand what molecular alterations make cells resistant and what the dependencies of new cells are. People have done plenty of drug anchored screens where CRISPR screens are conducted in the context of drug, but this is perhaps a better measure of primary resistance, instead of secondary resistance which is a reprogramming effect that may be better measured by the isogenic matched lines grown to resistance. Still, I do think that expanding the set of drug anchored screens to study primary resistance should be done in tandem with the resistant cell line bank approach.

Doing so, we can answer several key questions about drug resistance:

Are there conserved and reproducible mechanisms of resistance within the same cell line and across panels of cell lines?
By creating a DepMap for resistant lines, we can build a data corpus for machine learning.
Are primary and secondary resistance mechanisms related? Which best recapitulates the patient experience?
Are there unique dependencies for drug resistant cell lines?

4. Collaborations with precision medicine companies (Tempus AI, Caris Life Sciences, etc) to train AI models

Precision medicine companies with CLIA certified labs that regularly process tissue have an abundance of data and will continue to grow their data corpus. Such datasets are annotated to be machine readable and are (and will continue to be) more or less the only large source of longitudinal molecular profiling of tumors through the course of therapy. Foundation models are genuinely exciting for biomedical datasets, due to their readiness for ingesting large multimodal datasets and will be readily applied for these databases. However, the scale of biomedical -omics datasets is far more limited than other successful foundation models.

The protein model ESM2 was trained on 250 M protein sequences, where the TCGA only has ~10k patients with matched tumor and normal sequenced specimens. While there are roughly 2 million cases of cancer in the U.S. annually, there are probably still only on the order of 100k cancer transcriptomes that have been collected.

How expensive are 1 million profiled patients? You likely want genome (via ctDNA or tissue biopsy), transcriptome, H&E, complete path report, imaging studies, and across the longitudinal span of care. This is without even thinking about more complicated -omics measurements like scRNAseq, spatial, or personalized cell line avatars. Just the cost of data collection might be $1000 across 3 visits ($100 genome, $10 transcriptome, $5 H&E, say $200 for a heavily discounted suite of imaging studies). These datasets are only starting to be collected now by Tempus and will take maybe a decade or two plus $1 billion dollars to get to the adequate number of profiled patients to start training robust foundation models.

If Tempus is allowed to partner and share data with biopharma partners, they should be able to share with academics. The issue is that academics can’t pay the purse that a Novartis can. Tempus could offer a grant program or an academic partner program, but alas nothing like this exists. I think the key is as an academic, you need to be prolific enough to make this something that Tempus would want to do, and have a reasonable plan for how to execute on a visionary training run.

My lab would help develop models by developing benchmarks and datasets. Here are exactly what the inputs and outputs of my models would be:

I’d like to supplement efforts to build an “AI oncologist”. Such a model would be capable of understanding the entirety of NCCN guideline treatment algorithms, as well as be able to ingest embeddings from foundation H&E models and imaging models. My lab’s role would be on the molecular biomarkers end, providing embeddings based upon -omics datasets that are collected. The AI oncologist would thus be able to synthesize input from multimodal data and provide recommendations for targeted therapy, additional profiling to collect, or clinical trials to enroll into. Just like in radiology where deep learning models can help radiologists not ‘miss’ certain findings, an AI oncologist could help clinicians parse through ever larger patient datasets.
I’d also like to specifically train models for predicting how resistance emerges based upon the baseline state of cells. This is only possible with matched data across thousands of cell line backgrounds (from my high throughput model development). To some extent we can also evaluate the performance of my models with real world matched RNAseq from matched pre/post treatment patient samples.
I also want to provide more accurate assessment of how effective a therapy might be. Imagine being able to say with a degree of certainty how long a therapy could extend your life, how much a combination therapy would help. Imagine being able to say with certainty what the molecular cause of death is projected to be, and suggesting therapies to overcome that.

5. Partnerships with translational medicine groups

I think if you are to become an expert at understanding drug resistance, you need to be the go to academic lab for translational medicine groups at all major drug developers running clinical trials.

I don’t think that the reliability of an academic lab will ever match a service provider, and the type of lab I would want to build won’t be developing new assays for being resourceful with limited tissue input. Partnerships shouldn’t be a fee for experiment exchange.

Instead, there is a lot of mileage that can be gained being the data analysis czar. Fine tuning models on their proprietary datasets and coming up with informative ways of extracting biological insights is valuable.

Groups can ask you:

What is the mechanism of action for my drug?
What are the resistance mechanisms for my drug?
What kinds of other drugs might have synergy, or at least strong additivity with my drug?
What biomarker strategy should we try and develop?

Funding

My lab will be expensive but it will be expensive due to scale rather than specialty assays. Annually, we will probably spend ~30k in DNA/RNAseq, another 20k on proteomics, and another 50k on consumables. The most expensive assay we will do is probably scRNAseq. Maintenance of a mouse colony could be another 50k annually. Probably another 700k in personnel costs for a 10-15 person lab accounting for some success in fellowships. Altogether, the lab might cost $1 million each year.

With HHMI funding being $11 million over 7 years. RO1 grants are roughly 500k annually. Various foundation grants and smaller government grants later, a $1 million annual budget is reasonable. With lab automation, the goal for the lab is not to be limited by human resources.

Closing

Cures are possible. Studying treatment resistance is ultimately a diversity problem where there simply aren’t enough training examples (patient samples) to fully understand edge cases and context specificity. Scaling the diversity of samples by generating new mouse models and deriving new treatment resistant human models is what I would propose as a mechanism for overcoming this problem.

A big thesis for me for the past couple of years has been that our understanding of biology is the biggest bottleneck for progress in new therapeutics. The commoditization of bioengineering and chemistry mean that target discovery and biomarker development become high leverage. Once a target is discovered, there are plentiful collaborations to be had with scientists using AI for antibody development, small molecule screening and optimization, and other tasks for developing drugs for nominated targets.

The upside case for a lab focused on biology is that principles of resistance are shared across cancers and potentially across indications. If we are successful, we could even think about working on infectious diseases or I&I indications with clear examples of treatment resistance.

I think that even in the age of AI we are still fundamentally data limited. This is especially true in cancer where heterogeneity makes it almost impossible to reason from first principles.

Published Nov 16, 2024

Harvard-MIT PhD StudentDennis Gong on Twitter