Table of Contents
For a while, I’ll be analyzing a healthcare dataset here; one that I believe holds valuable insights waiting to be uncovered. But this process isn’t just about building powerful machine learning or deep learning models. Of course, those will be part of it, but my main goal is to start with raw data which can solve a real-life problem, understand it, process it, explore it, create both simple and visually appealing plots, generate synthetic data maybe, create new features, build various models, evaluate them, and make them explainable with additional analyses. The process will also be helpful to understand how to approach to any health data.
Beyond that, I want to experiment with parameter tuning; not just using common methods, but also exploring alternative techniques like Bayesian optimization. A simple interactive page to play around with the data and visualizations can be even developed but I don’t promise that.
I see this as a multifaceted healthcare data journey, blending exploration, visualization, modeling, and interactive experimentation.
In this journey, I will walk you through the analysis of the dataset ‘Diabetes 130-US Hospitals for Years 1999-2008‘. It includes a comprehensive collection of clinical records within a decade (1999–2008) from 130 U.S. hospitals and integrated networks.
For a long time, I’ve been really interested in Electronic Health Records (EHRs), Real-World Data (RWD), any evidence can be derived from them, and how they shape modern healthcare. That’s why I decided to give this dataset a shot. However, it caught my attention not only for its RWD relevance but also for its simplicity, making it an excellent resource for learning and practicing.
This dataset is actually a subset of Health Facts National Database (Cerner Corporations, Kansas city, MO) which has vast and diverse clinical records collected from hundreds of hospitals. Each row in the dataset represents an encounter (hospital admission / visit) of a patient diagnosed with diabetes, along with details such as demographics, diagnoses, procedures, laboratory test results, and prescribed medications. We can also see whether a patient was readmitted to the hospital or not; if so, when. You can find further details regarding the dataset here.
If you’d like to gain more insights into the dataset and explore possible analytical approaches before we dive in, you can check out the following two papers:
As I mentioned, this dataset is a subset of a larger database, filtered based on the following five criteria:
First of all, this dataset is valuable for understanding hospitalization trends among patients diagnosed with diabetes. By asking the right questions and exploring the data, we can extract meaningful insights; this is my primary goal.
Later, we can attempt to develop machine learning or deep learning models to predict readmissions or hospitalization outcomes, particularly in terms of mortality risk.
In most cases, datasets like this are used to predict readmissions. If we can successfully predict them, it could help reduce readmission costs, optimize resource allocation, or enable early interventions. Interestingly, readmission prediction also somewhat overlaps with mortality prediction.
This is an exciting and meaningful problem to work on. However, my first priority is to uncover insights from the data. I’ll revisit the possible business contexts, readmission and mortality prediction challenges in greater detail before diving into machine learning model development.
A common mistake I’ve observed among data scientists is diving straight into model development without fully understanding the data. At first glance, a dataset may seem simple, but studying the domain behind the data is invaluable and may reveal interesting data transformation ideas that has never been thought of before. Each field has its own hidden complexities, established best practices, or standard procedures. Being aware of these domain-specific nuances allows for more meaningful and efficient analyses.
Examining each attribute (in each table if applicable) might seem tedious, but its contribution to overall awareness is immensely valuable. That’s why I’ll go through each attribute one by one and categorize them even before looking at the data, considering possible insights and making simple inferences, even if they later prove incorrect. This approach doesn’t just reinforce understanding; but also boosts it.
In the rest of this article, I’ll be examining each feature in detail. Writing the post may take a full day, but once this initial analysis becomes a habit, it can be completed in just 10–15 minutes before starting any project.
That’s it. Let’s dive in.
Here is the summary of the features:
The dataset includes two key identification features:
Brainstorming: I expect to see the same patient_nbr appear in multiple rows. Why? Because a patient can visit the hospital more than once. This ID must be unique to each patient but not unique across the entire dataset since the same patient may have multiple hospital visits.
On the other hand, encounter_id should always be unique because each hospital visit happens only once in our space-time continuum (unless we crack time travel). This makes encounter_id a great choice for the main index of the dataset. We could use it as the primary key of this table if this were a database.
Or, if I see that encounter_id values are assigned in ascending order (or descending or in a pattern) over time, we can use it to determine the chronological order of a patient’s visits; especially since the dataset doesn’t include a ‘date’ feature. For machine learning purposes, it may not be a very useful feature, but exploring how the diagnoses, procedures or medications of patients change over time might be a good exercise in exploratory analysis phase when we create stratified bioprofiles.
I’m not sure yet, I need to analyze the data first to see whether those assumptions are correct or not.
The dataset includes race, gender, and age as demographic features.
Brainstorming: I expect to see different age values for the same patient_nbr across multiple rows. Why? Because a patient may visit the hospital in different years, say, in 2001, 2005, and 2008, and their age should naturally increase over time, say from 65 to 72. But what about race and gender? From a data quality perspective, these values should remain consistent across all rows for the same patient. Of course, gender changes due to transition surgeries are a separate and complex topic, which I won’t dive into here, but it’s definitely worth exploring another time. Unknown gender values may also related to that.
The only physical feature in the dataset is weight.
In addition to encounter_id and patient_nbr, the dataset includes three other types of IDs: admission_type_id, admission_source_id, and discharge_disposition_id. Despite having ‘ID’ in their names, these are not unique identifiers for observations (rows), or patients. Instead, they serve as practical codes representing different admission types, sources, and discharge categories. So, again, if this were a table in a relational database, those IDs would be foreign keys coming from other tables.
If you download the dataset from the source I shared above, you’ll find a file called mapping.csv in the folder. This file provides a clear description of what each code represents, making it easier to interpret these features.
Brainstorming: Since the dataset focuses on patients with diabetes, I would expect to see very low number of newborn admissions. Instead, emergency and urgent admissions probably dominate the dataset as the admission reasons (types) due to wide range of diabetes-related complications. Elective admissions may follow as the second most common type.
Because of the factors above, most patients are probably admitted through the emergency room. For elective admissions, physician referrals must be the most common route.
As for discharge outcomes, I expect that most patients are sent home after recovery. However, for elderly patients with urgent conditions, mortality or emergency transfer to another hospital as an outcome is also a possibility.
Of course, these are just my assumptions as a bioinformatician and data scientist. I’m not a doctor or a healthcare facility manager and not exactly aware of the trends. I may be completely wrong but I don’t care actually, because brainstorming is really fun and enjoying.
There are three interesting features that represent a patient’s admission history leading up to the observed admission.
Brainstorming: Based on my previous assumptions, I expect many patients to have high values for number_emergency, as emergency visits are common among individuals with diabetes. In other words, if a patient’s admission_type_id is 1 (emergency), it’s highly likely that their number_emergency value will also be high in the same row.
We can apply similar reasoning to other admission types. For example, if admission_type_id is 3 (elective), then we might expect higher previous outpatient visits (number_outpatient) compared to inpatient visits (number_inpatient). However, the reality is probably more complicated than this and I have doubts for that assumption.
Another interesting pattern might occur when considering age. If a patient is over 60, you can claim 50, or someone else can accept 70 as the threshold, their emergency visit history (number_emergency) is likely to be higher, and this may also lead to a higher number of hospitalizations (number_inpatient) due to age-related complications. For the same patients, it’s also likely to see mortality frequently (discharge_position_id = 11, 19, 20 or 21).
These features are the ones generated while a patient is being treated during the visit / encounter.
Brainstorming: If a row has admission_type_id = 3 (elective), I would typically expect time_in_hospital to be 1 day. Why is this? Because elective admissions are usually for daily treatments or procedures that don’t require an overnight stay. However, if the admission type is an emergency or urgent then the patient is more likely to be hospitalized later for a longer period, meaning time_in_hospital should be greater than 1 day. One more guess from me: I think trauma-center-related admissions (admission_type_id = 7) probably have the longest hospital stays. We will be able to find out when we load and start to explore the data. And as you might have guessed, time_in_hospital should never be less than 1 day; otherwise it would indicate a data quality issue. The maximum value can be any number greater than 1 day.
Brainstorming: It’s possible that patients who frequently see certain specialists have more hospital admissions and longer hospital stays. I’m not sure what other specialties are included in the dataset beyond those listed in the documentation, but ‘surgeon’ should definitely be one of them. If a patient undergoes surgery, it’s very likely that they will stay in the hospital overnight or longer.
Additionally, this feature might not be associated only to hospitalization length, but also to admission type. For example, if I visited a general practitioner, my admission type probably wouldn’t be emergency or urgent, it would be more likely to be elective. If the medical specialty is an ’emergency doctor’ (I don’t know if it exist in the dataset yet), the admission type and source should definitely be classified as emergency or urgent. All these would be valuable checks to include in a data quality framework.
Here we are. Diagnosis codes are a crucial part of any Electronic Health Record (EHR). In this dataset, diagnosis values come from ICD-9 (ICD, International Classification of Diseases, 9th Revision) code set, which is one of the standard international diagnosis classification systems, mostly used in the U.S. Other countries may use different coding systems.
Today, ICD-11 is superior to ICD-9, but since this dataset was created years ago, we’ll be going with ICD-9. Mapping all those ICD-9 codes in this dataset to ICD-10 or CD-11 could actually be an interesting exercise. Ah, this will be also another day’s story.
Normally, ICD-9 codes consist of five-digit/character numeric or alphanumeric values. However, in this dataset, only the first three digits are used. This means that diagnoses are recorded in broader categories, rather than in their most specific forms.
Let me give an example:
Since all patients in this dataset are diagnosed with diabetes, you might assume that every diagnosis code would be 250. However, keep in mind that these individuals can have additional conditions, not only diabetes-related issues. That’s why the dataset contains hundreds of unique ICD-9 codes spread across the following three columns:
Brainstorming: Of course, these diagnosis codes must be associated with specific patterns within other features (admission type and source, discharge type, time in hospital, etc) as I discussed above for the others. For instance, if an ICD-9 code is related to a heart condition, I would expect the admission to be an emergency, likely leading to hospitalization and a high potential for readmission. On the other hand, if the code corresponds to a minor illness, such as a common cold, the admission is more likely to be elective, resulting in a home discharge with a low likelihood of readmission.
However, to me, the most valuable aspect of having three separate diagnosis columns is the opportunity to analyze comorbidities. In real-life RWE datasets and medical databases, a single patient encounter can have more than 10 different diagnosis codes recorded in the same row. Analyzing the patterns of co-existence of such different disease diagnoses allows for deeper insights into health trends, risk factors, and disease progression. In this dataset, since it’s a small one, we only have three diagnosis fields, which is still good. While this limits some analyses, it may still provide valuable opportunities to explore hospitalization patterns, readmission trends, and key statistical insights related to patient care. It would be nice to try and showcase some primitive trends. Ultimately, this dataset wasn’t designed to store complete patient histories. It mainly focuses on hospitalization and readmission trends for patients with diabetes.
When it comes to number_diagnoses, I have some doubts about it since this is such a small dataset. This feature supposed to represent the total number of non-empty diag_x fields recorded for an encounter. For example, if diag_1 and diag_2 contain ICD-9 codes, while diag_3 is NULL, then number_diagnoses would be 2, since two diagnoses were entered into the system for that visit. Logically, the minimum value should be 1, occurring when only one diagnosis is recorded, and the other two fields are NULL. What about the maximum value? Is it 3? Maybe, but not necessarily also. In this dataset, we only see three diagnoses, but in reality, a single encounter could have more than three recorded diagnoses, and the others, beyond the first three, may not being represented here. Therefore the maximum value in this field can be either 3 or infinite, depending on how the dataset was structured. We can understand it when we examine the data.
In RWE datasets, we typically see procedure codes similar to diagnosis codes. It’s common to find ICD-9 and ICD-10 used for procedures as well. Additionally, the CPT (Current Procedural Terminology) code set is another widely used standard for documenting medical procedures.
For laboratory tests, LOINC (Logical Observation Identifiers Names and Codes) is a common standard.
However, in this dataset, procedure and lab test codes are not included. Instead, we only have the number of procedures.
Brainstorming: I’m really curious whether the total number of procedures is higher for urgent cases or elective (scheduled) ones.
My guess is that when a patient is admitted with urgent complaints, doctors may order multiple procedures to quickly diagnose the issue; such as blood and urine tests, CT scans, echocardiograms, EKGs, or X-rays. The number may be reasonably high.
However, for elective admissions with major surgeries, hospitalization tends to be longer, meaning that routine procedures may be repeated daily throughout the stay on top of initial chaos of procedures. On the other hand, for elective admissions involving minor treatments, fewer procedures are likely, since the patient is discharged on the same day.
Based on these assumptions, I would expect the number of procedures to follow this descending order:
A similar brainstorming approach can also be applied considering the diagnoses codes and the specialty of admitting physicians. Certain departments within a facility may be likely to order many procedures while the others order fewer, right? Additionaly, who pays for the bills? Does it influence the number of procedures? And overall, how may the number of procedures affect the readmission trend?
What do you think? Let’s see the bets!
Hah, here’s the answer to the big question: Who pays the bills?
Brainstorming: When we examine the data, we can see that length of stay and number of procedures tend to be influenced by the payer. Depending on what a patient’s insurance covers, these values may increase or decrease.
If you’re paying out of pocket, you might try to avoid non-essential procedures or even shorten your hospital stay to reduce costs. But if your insurance policy is generous, you’re probably more willing to say yes to everything the hospital recommends, right?
Let’s find out which payer is the most generous when we load the data!
Medications can be administered during any encounter, and this brings us to these key features related to that:
Brainstorm: I’d expect number of medications to increase with certain diagnosis codes, physician specialties, and length of stay (time_in_hospital). For example, hypertension, cardiovascular disease, or heart failure may be associated with a higher number of administered medications.
For the features above, a change can only occur if diabetesMed is yes and at least one of the medication has a value either up or down. Otherwise, it should always be recorded as no change. In the same way, diabetesMed can only be yes if any of the medications has a value other than no.
For instance, if we see a change entry while diabetesMed is no, then it indicates a data inconsistency or error. Let me explain those in an example table, let’s suppose we have only two different diabetes medication: insulin and metformin. Below, the data for id_3 and id_4 have something wrong. I hope, there aren’t such problems in the actual data.
encounter_id | change | diabetesMed | insulin | metformin | num_medications |
---|---|---|---|---|---|
id_1 | no change | yes | steady | no | 1 |
id_2 | change | yes | no | down | 1 |
id_3 | change | no | no | no | 0 |
id_4 | no change | yes | no | no | 0 |
id_5 | change | yes | steady | steady | 2 |
What about id_5? At some point, the medication may have changed from insulin to metformin, triggering a change entry in the change column. However, in that case, I would expect the insulin column to be marked as down instead of steady, since its dosage effectively dropped to zero. Therefore, I suspect that id_5 is also an invalid record.
Beyond that data quality perspective, as an ordinary person who is not a physician, I find myself asking a basic yet important question: What does a medication change (such as a dosage adjustment) really mean? Is it generally a bad sign, or could it sometimes indicate a positive outcome?
For example, if a patient is improving, a doctor might reduce their medication dosage. However, when I searched this topic on Google, I got the impression that medication changes are often associated with negative factors, as they are highly correlated with unstable or uncontrolled blood glucose levels. If blood sugar levels are not stable, frequent medication adjustments are needed, right? And the direction of the change may be either up or down in thise case.
Additionally, if a patient experiences adverse effects from a medication, it might be switched to another type. Based on this, it seems that a change entry in the dataset is more likely to indicate a negative scenario. If that’s the case, it could be correlated with longer hospital stays, increased mortality rates, and frequent readmissions.
Again, if we had the number of medication changes recorded for each encounter, it would be easier to interpret the severity of the situation. I believe that a higher number of changes likely indicates a more severe condition, while fewer changes might suggest a greater chance of positive improvement. However, in this dataset, we only have a binary indicator; showing whether at least one change occurred or not for each medication. Despite this limitation, I can still explore the data and test for correlations to see if sum of medication changes (downs or ups) are linked to factors such as hospital stay duration, readmission rates, or mortality risk.
Another interesting exploration could be testing the correlation between medication changes and lab test results. It’s possible that we consistently see a change when blood glucose levels are high (max_glu_serum), as elevated values likely indicate unstable glucose control. If that’s the case, medication adjustments might be a direct response to worsening metabolic stability.
This dataset focuses on diabetes-related hospitalizations, and glucose levels and A1C results are critical indicators for diabetes management. For this reason, we have results regarding to those tests.
Brainstorming: I suspect that higher Hemoglobin A1C and blood glucose levels may be associated with more severe cases requiring longer hospital stays. This could also lead to higher readmission rates, as diabetes-related complications may recur. Patients with elevated glucose levels likely receive more medications and procedures during their hospital stay.
I also have a strong feeling that higher A1C and glucose levels correlate with increased mortality rates.
And finally, another interesting question that comes to mind: What percentage of total admissions for patients with high A1C and glucose values are classified as emergency, urgent, or trauma center admissions? I feel this percentage is also high.
In the end, as usual, elevated blood values indicate that something isn’t going well.
And finally, we are at the feature which is mostly used as the target variable for most of the classification models: readmission.
Brainstorming: Frequent readmissions are problematic for both healthcare facilities and patients, especially the readmissions happen in 30 days.
From a patient’s perspective, frequent readmission often indicates a more severe condition or that their previous care was inadequate.
From a facility’s perspective, it leads to higher resource consumption, reduced efficiency, and potential financial strain.
This dataset could be valuable for developing a model to predict readmissions in advance. As we explore the data, I will analyze the relationship between each feature and readmission, helping to identify the most valuable features heuristically before developing any models.
This exercise was purely about thinking through the features using the data documentation without even seeing the actual data. Yet, we were still able to make estimations about relationships and trends. During this process, we gained a deeper understanding of the dataset’s structure, how it might have been built, potential data issues, and the strengths and weaknesses of our initial theories.
It might seem like overkill for a dataset like this, but I believe this practice is highly valuable before diving into hands-on data analysis.
In the next part, I’ll begin exploring the data, performing meaningful transformations, manipulations, and imputations, etc., all of which will help ensure a cleaner and more effective exploratory data analysis.
If you have any additional theories about the features, I’d love to hear your thoughts and extend the brainstorming further!
Featured Image: Clay Banks