Generative AI for Healthcare (Part 1): Demystifying Large Language Models

The "Generative AI for Healthcare (Part 1): Demystifying Large Language Models" video from Stanford Online explains generative AI and large language models (LLMs) for healthcare professionals. The hosts, clinical informaticists and emergency/internal medicine physicians, acknowledge the lack of accessible educational material and aim to empower viewers with knowledge for safe and effective implementation of these tools.

The video provides an intuitive understanding of LLMs, including their anatomy, training (pre-training and post-training), and how they generate responses. It traces the evolution of AI in healthcare through three epochs: symbolic AI, deep learning (traditional machine learning), and finally, generative AI and LLMs, offering heuristics to distinguish between them.

The discussion emphasizes how LLMs process information through tokenization, embeddings, and self-attention to create context-aware responses, and the importance of post-training techniques like supervised fine-tuning and reinforcement learning with human feedback in enhancing model performance and alignment.

The speakers in the YouTube video are Dong and Shivam. Both are clinical informaticists and physicians at Stanford — Dong specializing in emergency medicine and Shivam in internal medicine. Their work centers on deploying generative AI in clinical settings at Stanford Medicine and improving model safety for OpenAI as independent contractors with Greenlight. Dong was also a consultant for Glass Health.

Together, they aim to empower healthcare professionals with accessible knowledge for safely and effectively implementing generative AI.

Concepts Explained by Dong

Dong focused on challenges in using AI models and a framework for understanding the evolution and functioning of Large Language Models (LLMs) in healthcare.

Challenges of Prompting or Prompt Engineering:
- Difficulty understanding AI literature: Seminal papers (e.g., "Attention Is All You Need", 2017) are highly technical for clinicians lacking computer science backgrounds.
- Lack of healthcare-specific prompting resources: Many online guides skip practical, in-depth examples suitable for clinicians.
- Exponential pace of AI progress: AI publications increased from 272 (2014) to over 20,000 (2024) on PubMed, making it hard to keep up.
Three Epochs of AI in Healthcare (framework by Michael Howell and Karen De Salvo, Google):
- Epoch 1: Symbolic AI / Probabilistic Models (Rules-Based AI)
  - Timeline: ~1970 onward
  - Features: Logic-based, non-adaptive, do not learn new data
  - Examples: Clinical decision support tools, risk calculators, automated billing
- Epoch 2: Deep Learning (Traditional Machine Learning)
  - Timeline: ~2010 onward
  - Features: Pattern recognition from millions of examples, typically single-task, blackbox models
  - Examples: Automated EKG/STEMI detection, AI deterioration models, radiology AI
- Epoch 3: Large Language Models & Generative AI
  - Timeline: First described in 2017, public attention in 2022 (ChatGPT)
  - Features: General purpose, generative, multimodal, opaque in interpretability
  - Examples: Clinical knowledge retrieval, chart summarization, note drafting, ambient dictation
Anatomy and Physiology of an LLM:
1. Tokenization: Breaks input into tokens (approximate words).
2. Static Embeddings: Tokens become vectors encoding meaning.
3. Context-Aware Embeddings (Self-Attention): Model updates embedding based on full context.
4. Next Token Prediction: Uses contextualized embeddings to predict subsequent tokens; "temperature" controls creativity.
5. Iterative Generation: Output is generated one token at a time, enabling complex reasoning strategies such as chain-of-thought prompting.

Concepts Explained by Shivam

Shivam described the evolution of OpenAI's models and crucial training techniques leading to ChatGPT's capabilities.

Evolution of OpenAI Models:
- GPT-1 (2018): 117 million parameters, trained on unpublished books. Output resembled book-like prose, not clinically useful.
- GPT-2 (2019): 1.5 billion parameters; trained on internet text (Reddit-linked pages). Improved, but still not satisfactory for clinical answers.
- Scaling Laws (2020): Optimal performance comes from scaling compute, dataset size, and parameters together.
- GPT-3 (2020): 175 billion parameters, trained on refined Webtext 2 dataset. Responses were more coherent, yet lacking in expertise.
- GPT-3.5/ChatGPT (2022): Breakthrough owed to post-training, not further scaling. Finally delivered accurate medical information.
Post-Training Techniques:
- Supervised Fine-Tuning (SFT): Trains on curated input-output pairs, improving instruction following (e.g., summarization).
- Reinforcement Learning with Human Feedback (RLHF):
  - Humans rank output candidates by quality; model learns to prioritize better answers.
  - More experts (doctors/lawyers) are now involved in reviews.
  - Reward models and LLM "judges" further automate and scale alignment.
New Paradigm Shift (post-Sept 2024):
- Limits of Pre-training: Creating high-quality datasets is a bottleneck; new focus is on giving models more compute at inference (test-time scaling).
- Reasoning Models: Allow models deeper reasoning during response generation, greatly improving performance on difficult tasks (AIME, ARC AGI benchmarks).
- Future: GPT-4.5 will be the last non-reasoning model; focus shifts to scaling test-time reasoning.

Summary

Dong and Shivam explained that LLMs are highly compressed numerical representations of collective human knowledge and reasoning, trained at enormous expense and compact enough to fit on small devices. This compressed "understanding" underpins new transformative technologies.

Watch the original Generative AI for Healthcare (Part 1): Demystifying Large Language Models on YouTube