The Clinical Trials Assistant is a multi-agent generative AI application whose goal is to help users find completed, published results of clinical trials for treatment or prevention of a disease that the user is interested in. Large Language Models (LLMs) are used in validating user input, summarizing complex text for people without a medical background, and evaluating the summarizations.

This is my capstone project for the Google Kaggle 5-day Generative AI Course. The application demonstrates a few uses of LLMs in a pipeline to produce easy to understand summarizations of the trial results. My Kaggle application is here.

The agents that were created to validate input data and summarize text are talking to LLMs hosted by Google.

This blog (1) introduces LLMs, (2) highlights some of the application's lifecycle, and (3) presents highlights from the application. References used in the blog are at the bottom of the article.

skip to lifecycle
skip to the application

(1) LLMs

LLMs are able to accomplish complex goals through abilities built upon the ability to predict the next word in a sequence of words. Predicting the next word is a difficult task. The space of possible events for all combinations of words is far larger than the amount of data that could be given to a model, also, the model itself has far fewer parameters than that space for all possible events, so the problem of learning the probabilities of word combinations is severely under-determined.

Note that in these algorithms "word" might be partial words or even byte-pairs.

To predict the combination of words, we define the joint and for some tasks, the marginal probabilities of the words. Let x=[ x1 , x2 , xn ] , then, by the chain rule of factorization, we have the joint probability p(sequence x)= p( x1) · p( x2 | x1 ) · p( xn | xn-1, xn-2, x1 ) . Predicting the last word given the words preceding it, that is, p( xn | xn-1, xn-2, x1 ) has a complexity of ( (number of possible values x can be) n-1 -1 ) ) and so that calculation is not tractable. To make the problem tractable while still preserving the autoregressive property of using the previous steps to predict the current step, many different algorithms have been made. The transformer and uses of transformers in LLMs are the latest of these algorithms.

LLMs learn the probabilities of sequences of words by training on a very large corpus of words to predict the next word. They're built from transformers which are generative models that include autoregressive properties with non-linear weights. The transformer was created by Google in 2017. Their network architecture was less complex than the state-of-the-art (SOTA) RNN and CNN models, was parallelizable, required less training time, and demonstrated higher quality results on language translations.

Milestones in LLM components

YearConstructs
2015Attention
2017Transformer Architecture
2018Contextual Word Embeddings and Pretraining
2019Prompting

[view/hide] Details of Transformers and their components.

Briefly, definitions of a few more constructs in order to understand the Transformer architecture are introduced.

Vector embeddings

are ways to represent data in a compact, binned (grouped) manner. For instance, if a language had a million words, a vector would be 1 million elements in length if each word was an element. The length of a vector is also the number of indices on the vector. Words can be grouped by meanings or some unknown aspects to make far fewer than a million indices. For example instead of a vector of [dog, cat, fence, sidewalk] with 4 indices, one could make a vector of [[dog or cat], fence, sidewalk] with 3 indices. The vector elements are placeholders for existence of that object. Text grouped into such a "compressed" vector are the embeddings. Efficient embeddings may have latent (hidden) groupings that are not easy to interpret, and they may share the same embedded vector space with objects of different types (types being modalities, e.g. image, text, audio, or measurements that have embeddings that share the same vector embedding space). There are

Distances

as differences between a model output representation and a representation of ground truth or replacement for it are part of what is needed to calculate model objectives efficiently. The model, in the training stage, is an optimization problem and the objective is a function to minimize. Objectives with losses such as cross-entropy, K-L divergence or Jenson-Shannon divergence, use of contrastive learning, etc. can be used.

Sequence

is an ordered set of elements.

Encoder-Decoder Architecture

The encoder maps the input representation sequence { x1 , x2 , xn } to a latent sequence { z1 , z2 , zn } which are continuous representations. The sequence z are inputs to the decoder which generates ouptut symbols { y1 , y2 , ym } one at a time.

Feed Forward Network

is a network composed of a series of one or more layers that accept input and output a result. A layer multiplies the input by a weight matrix, adds a bias vector and then can put that resulting matrix through an activation function resulting in the output for that layer. The activation function makes non-linearity possible to express. The output of one layer is an input for the next layer.

Position-wise Feed Forward Network

is a Feed Forward Network with 2 layers. For each position in the sequence separately and identically, the network layers are defined as FFN(x)= max( 0, x· W1 + b1 ) · W2 + b2

Attention

determines the importance of each component in a sequence relative to the other components in another sequence or that same sequence (self-attention). In context of a transformer, the components of the sequence are tokens.

Scaled Dot Product Attention

is a form of Self-Attention mechanism. An attention score is calculated from a query vector q, key vector k, and value vector v.

Multi-Head Attention

is an extension of Self-Attention mechanism. The "heads" refers to multiple, independent attention calculations that are performed in parallel. The input is partitioned into several parts, each processed thru a different attention head. Multi-Head Attention, in addition to being an efficient mechanism, also allows each head to learn a different aspect of relationships. For example, one head might learn to attend to grammatical relationships while another might attend to semantic relationships.

Residual Network Block

is a block that uses shortcut connections, a.k.a. skip connections to allow a model to essentially skip a layer without making a discontinuity in the gradients during back propagation. The residual function result is the output of its own block layers plus the original input to the block. If a layer is not helpful, the model learns to make F(x) close to 0. The shortcut connection creates direct path for gradient to backpropagate to earlier layers, preventing vanishing gradients.

Weight Tying (sharing)

is the sharing of a matrix between different components of the model in order to reduce the number of free parameters that the model must learn.

Linear Layer

in a neural network is a layer that multiplies the input by a weight matrix and adds a weight vector.

Softmax Layer

is a layer that applies the softmax operator to the input, resulting in output components that are normalized.

The Transformer

is a sequence-to-sequence model, and the original transformer is composed of the above components.
Figure 1 from Vaswani et al. NIPS 2017) shows that the canonicalized, tokenized inputs are transformed into embeddings and assigned positional encodings. The inputs are then put through an encoder-decoder model. Both the encoder and decoder are using stacks of blocks of layers of multi-head attention, feed forward, and skip connection blocks along with normalization between blocks.
In the above figure, NX is N=6 identical layers in the encoder and N=6 identical layers in the decoder.

Encoder:

Each of the N=6 identical layers has a multi-head self-attention sub-layer followed by a simple position-wise fully connected feed-forward network sub-layer. Each sub-layer is in a residual block with output being LayerNorm(x + Sublayer(x)) where Sublayer(x) is the function implemented by that specific sub-layer and LayerNorm is a normalization function. The output dimension is the same size for all embedding layers and sub-layers.

Decoder:

Each of the N=6 identical layers is a stack of the same 2 sub-layers of the encoder including the residual blocks and normalization, preceded by a sub-layer that is a masked multi-head attention over the encoder stack's output. Masking prevents the use of subsequent positions in the calculation. The output embeddings are all offset by 1 position from one another, so combined with masking, produce predictions for i that depend upon known inputs from positions less than i.

The original transformer uses Multi-head Attention in the following ways:

Characteristics that Vaswanis et al improved with their Transformer architecture:

  1. reduced the total computational complexity per model layer
  2. increased the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required
  3. increased the context length, that is, the path length between long-range dependencies in the network, measured by the maximum path lengths between any 2 input and output positions.

LLMs

[view/hide] A list of modern LLMs.

There are variants of encoder, decoder and encoder-decoder architectures of LLMs, but the most recent are based upon a decoder-only transformers in Mixture of Experts architecture.
In addition to the large LLM foundation models, each of the cloud vendors and major LLM providers make task specific derivations of their models too. From Google there are Med-PaLM 2, Codey, Imogen, Veo, AdioLM, AudioPaLM models and/or APIs.. From Amazon there are task specific derivations of Nova and Titan models including APIs like Amazon Comprehed Medical and Amazon Q APIs. From Microsoft there are task specific derivations of Copilot including APIs such as Medical Diagnostic AI, AI for Legal, Copilot for finance and more. From Meta AI there are Thinking LLMs and LLM Compiler. From Anthropic there are task specific derivations of Claude. There continue to be many other new LLMs models, some findable on HuggingFace.

LLMs in general handle tasks such as text classification, question answering, document summarization, and text generation (including language translation). Increasingly, more task specific LLMs are being made and task specific agents follow from that, and multi-agent systems of agents interacting with one another to accomplish tasks follow from that.

(2) LifeCycle

The first steps in building a Gen AI application are defining the goal of the project, finding a stable, robust source of data for it, and building the smallest working version to explore the feasibility of it. B.T.W. the contest requirement was to use generative AI as a uniquely valuable tool in an application which would otherwise be less capable of accomplishing its real world goals. For this I created a clinical trials assistant from multiple gen-AI agents.
To create the protoype, I first hand picked published clinical trial articles which had complex medical terms. Using Google AI Studio for text summarization by the Gemini-2-*, Gemini-1-* and Gamma-3-* models gave very good results that did not require a medical background to understand. The prompts I used were simple and instructive zero-shot prompts. I used deterministice generative settings (temperature=0) with all agents. Neither model fine-tuning, nor parameter efficient tuning were deemed to be necessary for this prototype.

Standard software requirements, analysis, design and implementation were followed with additional elements added for generative AI architecture.

[view/hide] Details of the gen AI lifecycle here (also in docs/lifecycle in the github repository)

  1. Requirements (functional and non-functional):
    • Capstone capabilities: At least 3 of the get AI capstone capabilities must be included.
    • Agents: text capable LLMs. All need to be able to reason, one needs to be able to perform document summarization, one (preferably a different model than the summarizer, needs to be able to evaluation the summarization, and one needs to be able to recognize valid disease names.
    • Data: robust, stable API sources
    • Deployment/Serving/Client:
      • Cloud hosted LLMs supplied by Google.
      • Stubs for cloud based logging.
      • The code must run from start to finish within a Kaggle notebook. The assistant must be an interactive application, and so it needs to optionally run automatically too.
      • Consideration for ease of porting to mobile environments is kept in mind.
    • Prompts
      • Zero-shot instructions for all tasks
    • Logging, Monitoring:
      • latencies for all API and LLM invocations
      • size of data: number of input and output tokens
      • drift of data: watching for APIs returning data different than expected
      • errors
      • user feedback
    • Protect User from harmful content
    • Ensure regulations for privacy and other guidelines are followed
    • Latencies: Each response to the user should be less than 3 seconds ideally.
    • QPS:
      • Rate limiting for API requests. The requests are directly from the client Kaggle notebooks to the APIs. Rate limiting on client-side should be made. Model choices that handle scale well for the budget should be made.
  2. Analysis and Design
    • Data I/O assessment
      • data to and from the NIH APIs is small
      • data to and from the Google LLM APIs is small
    • Choice of LLMs w/ preference for smaller models, no need for image or audio in this prototype
      • Gemini-1.5-flash and smaller models: SOTA, handles long context well, ability to follow complex instructions and complete complex tasks, document understanding, can take system instructions specifically, can output results in a structured format, can scale well, use is free up to rate and data limits, then increases.
      • Gamma context window constraint of 128 k, ability to follow complex instructions and complete complex tasks. document understanding. it's an open source model that performs very well, though has fewer abilities than the Gemini models. The costs for rates and data sizes are free, but the model might have scale limits.
    • Function Calling: client side methods to build and test
      • query to user for disease name w/ option to exit
        • Google LLM agent to validate the disease name
      • retrieve clinical trials
        • API request to clinicaltrials.gov, parsing, logging
      • query to user for trial selection w/ option to exit
      • query to user for citation selection w/ option to exit
        • API request to NIHML's PubMed, parsing, logging
      • article results summarization
        • Google LLM agent to summarize text
        • Google LLM agent to evaluate summarization
        • parsing, logging
    • Gen AI orchestration layer: langgraph
    • Integrity of data: best practices are followed in using APIs. The APIs themselves and Kaggle follow secure practices.
    • Integrity of logs: Kaggle environment is session based. cloud logging stubs are made but not implemented so no concerns there.
    • Protection of User from Harmful content
      • User Feedback is requested and logged. Additionally, the Kaggle notebooks have a messaging environment where users can ask questions or leave comments.
    • Regulations: GDPR CCPA, and other guidelines are implicitly followed because no PII is requested nor stored.
    • Prompts
      • Versions: prompts are stored in language and version directories to allow mixing of components while experimenting with improvements for the application.
    • Logging, Monitoring
      • implemented in client locally w/ stubs for remote aggregation in the cloud
    • Source version control in github
    • Development tools were the Kaggle notebook and the JetBrains Pycharm IDE
  3. Implementation (see The Gen AI Application below)

The Gen AI Application

The application is hosted in a Kaggle notebook here. Langraph was used for orchestration of the function calls (a.k.a. tools) by client-side invocations (client-side rather than LLM invocations to reduce token use). A sequential planner pattern with conditional cycles was used for the workflow. Nodes were created for each function, and conditional edges from each node to the next in the sequence or to exit the application by user request.
  1. node: user_input_disease
  2. node: fetch_trials
  3. node: user_choose_trial_number
  4. node: user_choose_citation_number
  5. node: fetch_abstract
  6. node: llm_summarization
  7. node: feedback_query
When the application starts, the user is presented with a welcome message and then a query for the disease name.

[to user]
"This librarian searches clinical trials for completed, published results and summarizes the results for you." "Please enter a disease to search for (q to quit at any time):"


The disease name checker is an instance of ChatGoogleGenerativeAI with a chosen LLM and a deterministic temperature of 0. The disease name checker is given this prompt with the disease name and returns a response which includes meta data such as the number of input and output tokens:

[to LLM agent]
"You are a librarian at the National Institutes of Health. Do you recognize the words {disease_name} as a valid disease name? Answer yes or no."


The trials are retrieved from the US National Library of Medicine's Clinical Trials database and presented to the user.
The user selects a trial and published trial result citations are then presented to them. The user selects a result from the list and the summary of that result is retrieved from the US National Library of Medicine's PubMed database. A prompt with text summarization instructions is given to the document summary agent which has been configured for a deterministic response.

[to LLM agent]
"Summarize this text in simple terms in a serious tone."


The agent response is presented to the user. The agent response is asynchronously evaluated by another agent which is given a somewhat lengthy prompt of instructions. and the response evaluation is logged. The user is asked if they would like to submit feedback and if so, they are presented with a couple of questions with itemized choices. Their responses are logged. Lastly, the user is asked if they would like to make another query.

And that is the prototype. Thanks to Google for sponsoring this 5-day intensive course on Generative AI and providing great examples and resources for us to use!

References