AI/LLM applications in biotech regulation and finance

I am looking for a student with strong computational skills and an interest in drug development and/or finance to assist with scraping, processing, and analyzing publicly available documents from FDA and SEC. The role would involve automating data collection, downloading, and processing of PDF files and integrating machine learning techniques to enable large-scale text analysis using large language models.

Background:

I am interested in how biotech and pharma companies decide whether and how to develop new medicines in particular clinical areas. I study questions with practical relevance, where the answers will help industry professionals make smarter choices about how to balance cost, time, risk, and reward in drug R&D.

My interest in these sorts of questions stems from my work as an advisor to biopharma companies and investors. Recent projects, whitepapers, and opinion pieces have focused on drug regulation, drug pricing, corporate and investor decision-making, and clinical trial strategy. (See here for details on these projects and others, and see here for more details on my professional background and expertise.)

The purpose of this project is to create the data infrastructure to be able to ask and answer these sorts of questions more efficiently.

Responsibilities:

  • Write Python scripts to scrape relevant PDFs from public websites (e.g., drug approval documents from FDA, 10-K reports from SEC)
  • Download, store, and organize large numbers of PDF files
  • Extract text from PDFs
  • Clean and preprocess text for advanced natural language processing (NLP)
  • Set up a vector database for semantic search and text retrieval
  • Implement LLM-based analyses (e.g., summarization, topic modeling, sentiment analysis) on the extracted text
  • Deploy a simple interface (e.g., using FastAPI or Streamlit) for interacting with the LLM

Work would be independent and remote, with regular videoconference check-ins (approx. once per week) and additional ad hoc interactions as needed by video and/or email.

Name of research group, project, or lab
David group
Why join this research group or lab?

This work is well-suited to students with computational skills who are interested in future careers in biopharma (R&D, regulatory, project management, etc.), biotech consulting, or biotech finance/investing. Students will learn about key questions and issues in drug development, pharmaceutical regulation, drug commercialization, and biopharma investment. I provide students with 1:1 mentorship on projects as well as individualized career advice.

Logistics Information:
Project categories
Biology
Student ranks applicable
1st year undergraduate
2nd year undergraduate
3rd year undergraduate
4th year undergraduate
Masters
Student qualifications

Required qualifications:

  • Proficiency in Python
  • At least some of the following (see note):
    • Experience with web scraping (e.g., BeautifulSoup, requests, Scrapy)
    • Experience with cloud storage solutions (AWS, Google Drive API)
    • Familiarity with PDF processing (e.g., PyMuPDF, PDFMiner, Tesseract OCR)
    • Understanding of natural language processing (NLP) and LLMs (e.g., OpenAI's GPT, LangChain)
    • Experience with vector databases (e.g., FAISS, Pinecone, ChromaDB) for document search and retrieval
  • Interest in pharmaceutical R&D, drug regulation, and/or finance
  • Ability to work independently and document work clearly

Note: Please consider applying even if you only have some of the desired technical experiences, provided you are enthusiastic to fill your knowledge gaps as part of this project.

To apply for a research position, please submit:

  • Your resume.
  • A one-page cover letter describing your interest in this position and fit with the required qualifications.
  • A brief description of relevant projects you have completed and/or a description of how you would approach at least one of the components listed in the project description under "responsibilities".

As part of compiling your materials and preparing for an interview, I strongly encourage you to review the materials at this link under "1. High-level overview of biopharma".

Preference will be given to students who can commit to continuing their work in the fall (10+ hours per week).

Hours per week
10-15hrs/wk
Compensation
Unpaid - Volunteer
Number of openings
1
Project start
May 2025
Contact Information:
Mentor
Frank.David@tufts.edu
Principal Investigator
Name of project director or principal investigator
Frank David
Email address of project director or principal investigator
frank.david@tufts.edu
1 sp. | 0 appl.
Hours per week
10-15hrs/wk
Project categories
Biology