Natural Language Processing Aimed at Improving Cancer Care


Objectives


To build computational framework that leverages NLP

- Structure unstructured notes

- Foundation to build Analytics Platform

Per Patient

- Extract the timeline of clinical events from notes using NLP

- Stage Info, Time of Tumor Progression and other clinical information

Unstructured Data

Workflow


Header Identification:

Section Identification:

Event Extraction:

Data Visualization:

Project Wrap-up:

Clustering & structure-based method

Rule-based method & classification

Obtain clinical events from sections

Flask web framework

Build an easy-to-use python library

Workflow

Note Clustering


Utilize the similarity in notes to build up a DBSCAN model


Measuring coherence in notes of each cluster

- Plotting number of sections in a given class

- Average Overlap Count with sections in class regex

Plot

Note Classification


Notes from same institute and department have similar

  (1) structure, (2) sections, (3) format


Structure-based Classification

  Objective

  - Group notes having similar structure

  - Classified 52% of notes into 19 classes

  Methodology

  - Strategically extract tokens from top section of notes

    (e.g. SURGEON)

  - Have chance to implicitly reveal department

Unstructured Note

Rule-Based NLP Method


Methodology

- Manually explore notes of a class and find possible sections

- Build sections dictionary, regular expressions

- Parse notes to extract section names

- Append section contents following each section name


Advantages

- Extract predictable sections with full accuracy from well-structured notes

- Successfully processed ~35% notes

Disadvantages

- Requires huge Initial Human Effort

- Fails to capture unpredictable sections

Unstructured Note

Machine Learning Method


Methodology


(1) Define output values

  - Ruled-based identified section headers: True

  - Other section candidates: False


(2) Extract section candidates

  - N-Grams: Too many candidates

    1,2,3-Gram tokens for a note of 1000 tokens, would yield 3000 n-grams

  - N-Grams beginning at each sentence: Still too many candidates

    1,2,3-Gram tokens for a note of 1000 tokens and 300 sentences would yield 300 n-grams

  - POS Tag Chunking: Computationally slow

    Extracts chunks of known sections and yields 30% of tokens of above method


(3) Conduct feature selection

  - Syntactic analysis

  - Semantic analysis


(4) Modeling

  - Section Identification -> Binary Classification

  - Model selection and evaluation

Unstructured Note

Pos Tag for Sections


Plot
Table

Feature Engineering


Note Features1 Features2 Features3

Model Selection and Estimation


Models: Decision Tree, Random Forest, Gradient Boosting and Logistic Regression

Imbalanced Class Problem

- Around 4 million observations

- About 4.3% observations are sections

- Down-sampling non-sections


Results for models using 70-30 train-test split (Training set accounts for 70% of the whole dataset).

Table

Clinical Event Extraction


Convert medical records to a JSON file

- Each patient has multiple medical records

- Each record has record type, header, date, and multiple sections

- Record type: Progress Notes, Discharge Summaries, H&P

- Most frequent section names: INDICATION, SUBJECTIVE, OBJECTIVE, PROCEDURE, Assessment/Plan, Care Management


Extract keywords from section contents

- e.g. extract Staging from Assessment; extract drug names from MEDICATIONS

- Append the events to the JSON file under each note accordingly

Dashboard


Dashboard Screenshot

Dashboard


Dashboard Screenshot

Dashboard


Dashboard Screenshot

Python Library


- Wrap up analysis modules


- Document the project using Sphinx


- Deploy the dashboard on server

Dashboard Screenshot

Use Case Diagram