Before the First Line of Code: Thinking Through Healthcare Data

InformaticsFebruary 9, 2025103 Views

Home
Informatics
Before the First Line of Code: Thinking Through Healthcare Data

Table of Contents

Introduction
The Data: Diabetes 130-US Hospitals for Years 1999-2008
Primary Goal and High-Level Problem Description
Breaking Down the Dataset: Exploring Key Features
Conclusion

Introduction

For a while, I’ll be analyzing a healthcare dataset here; one that I believe holds valuable insights waiting to be uncovered. But this process isn’t just about building powerful machine learning or deep learning models. Of course, those will be part of it, but my main goal is to start with raw data which can solve a real-life problem, understand it, process it, explore it, create both simple and visually appealing plots, generate synthetic data maybe, create new features, build various models, evaluate them, and make them explainable with additional analyses. The process will also be helpful to understand how to approach to any health data.

Beyond that, I want to experiment with parameter tuning; not just using common methods, but also exploring alternative techniques like Bayesian optimization. A simple interactive page to play around with the data and visualizations can be even developed but I don’t promise that.

I see this as a multifaceted healthcare data journey, blending exploration, visualization, modeling, and interactive experimentation.

The Data: Diabetes 130-US Hospitals for Years 1999-2008

In this journey, I will walk you through the analysis of the dataset ‘Diabetes 130-US Hospitals for Years 1999-2008‘. It includes a comprehensive collection of clinical records within a decade (1999–2008) from 130 U.S. hospitals and integrated networks.

For a long time, I’ve been really interested in Electronic Health Records (EHRs), Real-World Data (RWD), any evidence can be derived from them, and how they shape modern healthcare. That’s why I decided to give this dataset a shot. However, it caught my attention not only for its RWD relevance but also for its simplicity, making it an excellent resource for learning and practicing.

This dataset is actually a subset of Health Facts National Database (Cerner Corporations, Kansas city, MO) which has vast and diverse clinical records collected from hundreds of hospitals. Each row in the dataset represents an encounter (hospital admission / visit) of a patient diagnosed with diabetes, along with details such as demographics, diagnoses, procedures, laboratory test results, and prescribed medications. We can also see whether a patient was readmitted to the hospital or not; if so, when. You can find further details regarding the dataset here.

If you’d like to gain more insights into the dataset and explore possible analytical approaches before we dive in, you can check out the following two papers:

Paper 1: Predicting 30-Day Hospital Readmission for Diabetes Patients using Multilayer Perceptron
Paper 2: Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records

As I mentioned, this dataset is a subset of a larger database, filtered based on the following five criteria:

Inpatient encounters only – This means the dataset includes only hospitalizations, with no records of outpatient (daily treatment) admissions.
Diabetes-related encounters – Every row (encounter / admission) contains at least one diagnosis code related to diabetes.
Length of hospital stay – The minimum stay is 1 day, and the maximum is 14 days.
Laboratory tests – At least 1 laboratory test was performed during the encounter.
Medications – At least 1 medication was administered during the encounter.

Primary Goal and High-Level Problem Description

First of all, this dataset is valuable for understanding hospitalization trends among patients diagnosed with diabetes. By asking the right questions and exploring the data, we can extract meaningful insights; this is my primary goal.

Later, we can attempt to develop machine learning or deep learning models to predict readmissions or hospitalization outcomes, particularly in terms of mortality risk.

In most cases, datasets like this are used to predict readmissions. If we can successfully predict them, it could help reduce readmission costs, optimize resource allocation, or enable early interventions. Interestingly, readmission prediction also somewhat overlaps with mortality prediction.

This is an exciting and meaningful problem to work on. However, my first priority is to uncover insights from the data. I’ll revisit the possible business contexts, readmission and mortality prediction challenges in greater detail before diving into machine learning model development.

Breaking Down the Dataset: Exploring Key Features

A common mistake I’ve observed among data scientists is diving straight into model development without fully understanding the data. At first glance, a dataset may seem simple, but studying the domain behind the data is invaluable and may reveal interesting data transformation ideas that has never been thought of before. Each field has its own hidden complexities, established best practices, or standard procedures. Being aware of these domain-specific nuances allows for more meaningful and efficient analyses.

Examining each attribute (in each table if applicable) might seem tedious, but its contribution to overall awareness is immensely valuable. That’s why I’ll go through each attribute one by one and categorize them even before looking at the data, considering possible insights and making simple inferences, even if they later prove incorrect. This approach doesn’t just reinforce understanding; but also boosts it.

In the rest of this article, I’ll be examining each feature in detail. Writing the post may take a full day, but once this initial analysis becomes a habit, it can be completed in just 10–15 minutes before starting any project.

That’s it. Let’s dive in.

Here is the summary of the features:

IDs

The dataset includes two key identification features:

encounter_id: This is the unique ID of each encounter, could be translated as visit ID as well. When a patient visits a hospital, an encounter ID is assigned -not to the patient- to the visit itself. In some RWD/RWE datasets, the terms ‘encounter’ and ‘visit’ may have subtle differences, but that distinction is beyond the scope of this dataset.
patient_nbr: That defines a patient. Since health data is sensitive, patient privacy is protected, and we shouldn’t be able to identify them from the data, thus we use only patient ID, not knowing who they are. At this point, it might be interesting to check what Protected Health Information (PHI) and HIPAA (Health Insurance Portability and Accountability Act) regulations are for those who are new to the field of healthcare / clinical data science.

Brainstorming: I expect to see the same patient_nbr appear in multiple rows. Why? Because a patient can visit the hospital more than once. This ID must be unique to each patient but not unique across the entire dataset since the same patient may have multiple hospital visits.

On the other hand, encounter_id should always be unique because each hospital visit happens only once in our space-time continuum (unless we crack time travel). This makes encounter_id a great choice for the main index of the dataset. We could use it as the primary key of this table if this were a database.

Or, if I see that encounter_id values are assigned in ascending order (or descending or in a pattern) over time, we can use it to determine the chronological order of a patient’s visits; especially since the dataset doesn’t include a ‘date’ feature. For machine learning purposes, it may not be a very useful feature, but exploring how the diagnoses, procedures or medications of patients change over time might be a good exercise in exploratory analysis phase when we create stratified bioprofiles.

I’m not sure yet, I need to analyze the data first to see whether those assumptions are correct or not.

Demographics

The dataset includes race, gender, and age as demographic features.

age: The age of a patient is recorded as categorical data rather than a numerical value. It is grouped into 10-year intervals in the dataset. For example, the first category is [0, 10), and the second is [10, 20). This means that a 7-year-old patient would fall into the first group, while a 10-year-old would be placed in the second group.
gender: The gender of a patient; either male, female, or unknown. The reason for the unknown category isn’t clear; perhaps the information wasn’t recorded properly, some patients may have chosen not to disclose their gender, or it was simply overlooked. We should check the proportion of patients with unknown gender. If the number is small, then we might consider dropping these rows from the dataset.
race: The race of a patient. It may be Caucasian, Asian, African, American, Hispanic or other.

Brainstorming: I expect to see different age values for the same patient_nbr across multiple rows. Why? Because a patient may visit the hospital in different years, say, in 2001, 2005, and 2008, and their age should naturally increase over time, say from 65 to 72. But what about race and gender? From a data quality perspective, these values should remain consistent across all rows for the same patient. Of course, gender changes due to transition surgeries are a separate and complex topic, which I won’t dive into here, but it’s definitely worth exploring another time. Unknown gender values may also related to that.

Physical Features

The only physical feature in the dataset is weight.

weight: It simply specifices the weight of a patient. This value is also encoded in a way similar to age feature, categorical 25-kilogram intervals; such as [50, 75) or [75, 100).

Admissions and Discharges

In addition to encounter_id and patient_nbr, the dataset includes three other types of IDs: admission_type_id, admission_source_id, and discharge_disposition_id. Despite having ‘ID’ in their names, these are not unique identifiers for observations (rows), or patients. Instead, they serve as practical codes representing different admission types, sources, and discharge categories. So, again, if this were a table in a relational database, those IDs would be foreign keys coming from other tables.

If you download the dataset from the source I shared above, you’ll find a file called mapping.csv in the folder. This file provides a clear description of what each code represents, making it easier to interpret these features.

admission_type_id: This feature describes the nature of an admission. For example, a patient may have visited the hospital due to a heart attack, which considered emergency and coded as 1 in the dataset according to the mapping file. Or it might be a scheduled treatment, surgery or follow-up appointment, then it is clasiffied as elective, coded as 3. Briefly, this is the main reason of an hospital admission.
admission_source_id: This feature basically answers the question: where was a patient before admission? They might have been transferred from another hospital (coded as 4), admitted from home, or arrived through the emergency room (ER). The specific codes for each scenario can be found in the mapping file.
discharge_disposition_id: This feature answers (maybe) a more important question: what happened at the end of a hospital admission? A patient may have passed away, unfortunately. Another may have been discharged home, or transferred to another facility. Most of these kinds scenarios coded as integers just like the other two admission-related features above. As you would probably see, this feature is reasonably important; it reflects the final outcome of a hospital stay, whether on an individual level (a single admission) or a broader level (a patient’s overall health journey). Because of this, it can be a valuable feature for outcome prediction models in healthcare analytics depending on the context.

Brainstorming: Since the dataset focuses on patients with diabetes, I would expect to see very low number of newborn admissions. Instead, emergency and urgent admissions probably dominate the dataset as the admission reasons (types) due to wide range of diabetes-related complications. Elective admissions may follow as the second most common type.

Because of the factors above, most patients are probably admitted through the emergency room. For elective admissions, physician referrals must be the most common route.

As for discharge outcomes, I expect that most patients are sent home after recovery. However, for elderly patients with urgent conditions, mortality or emergency transfer to another hospital as an outcome is also a possibility.

Of course, these are just my assumptions as a bioinformatician and data scientist. I’m not a doctor or a healthcare facility manager and not exactly aware of the trends. I may be completely wrong but I don’t care actually, because brainstorming is really fun and enjoying.

Admission History

There are three interesting features that represent a patient’s admission history leading up to the observed admission.

number_emergency: This is the number of times a patient has been to the emergency room (ER) before the recent (current) admission.
number_outpatient: This feature represents the number of outpatient visits a patient had before the current admission. Outpatient visits are generally non-urgent, such as routine check-ups or scheduled treatmens. However, they aren’t always limited to simple visits; a minor surgery can also be classified as an outpatient visit. The key point is that outpatient care does not require an overnight hospital stay. These visits are for same-day treatments, procedures, or surgeries, where the patient arrives, receives care, and leaves on the same day.
number_inpatient: I believe you have just guessed what this is. This is simply the number of previous hospitalizations before the current admission. For instance, a surgery requiring pre and/or post operation care for several days is classified as inpatient visit.

Brainstorming: Based on my previous assumptions, I expect many patients to have high values for number_emergency, as emergency visits are common among individuals with diabetes. In other words, if a patient’s admission_type_id is 1 (emergency), it’s highly likely that their number_emergency value will also be high in the same row.

We can apply similar reasoning to other admission types. For example, if admission_type_id is 3 (elective), then we might expect higher previous outpatient visits (number_outpatient) compared to inpatient visits (number_inpatient). However, the reality is probably more complicated than this and I have doubts for that assumption.

Another interesting pattern might occur when considering age. If a patient is over 60, you can claim 50, or someone else can accept 70 as the threshold, their emergency visit history (number_emergency) is likely to be higher, and this may also lead to a higher number of hospitalizations (number_inpatient) due to age-related complications. For the same patients, it’s also likely to see mortality frequently (discharge_position_id = 11, 19, 20 or 21).

During Encounter

These features are the ones generated while a patient is being treated during the visit / encounter.

time_in_hospital: This feature represents the duration of a hospital stay, measured in integer values representing the number of days between admission and discharge.

Brainstorming: If a row has admission_type_id = 3 (elective), I would typically expect time_in_hospital to be 1 day. Why is this? Because elective admissions are usually for daily treatments or procedures that don’t require an overnight stay. However, if the admission type is an emergency or urgent then the patient is more likely to be hospitalized later for a longer period, meaning time_in_hospital should be greater than 1 day. One more guess from me: I think trauma-center-related admissions (admission_type_id = 7) probably have the longest hospital stays. We will be able to find out when we load and start to explore the data. And as you might have guessed, time_in_hospital should never be less than 1 day; otherwise it would indicate a data quality issue. The maximum value can be any number greater than 1 day.

medical_specialty: According to the dataset documentation, this feature represents the specialty of the admitting physician, such as cardiology, general practice, surgery, and more. Simply put, this physician is likely the primary doctor responsible for the patient’s care during their hospital stay. There are 84 distinct values for this feature, all represented as categorical strings rather than numerical codes.

Brainstorming: It’s possible that patients who frequently see certain specialists have more hospital admissions and longer hospital stays. I’m not sure what other specialties are included in the dataset beyond those listed in the documentation, but ‘surgeon’ should definitely be one of them. If a patient undergoes surgery, it’s very likely that they will stay in the hospital overnight or longer.

Additionally, this feature might not be associated only to hospitalization length, but also to admission type. For example, if I visited a general practitioner, my admission type probably wouldn’t be emergency or urgent, it would be more likely to be elective. If the medical specialty is an ’emergency doctor’ (I don’t know if it exist in the dataset yet), the admission type and source should definitely be classified as emergency or urgent. All these would be valuable checks to include in a data quality framework.

Diagnoses

Here we are. Diagnosis codes are a crucial part of any Electronic Health Record (EHR). In this dataset, diagnosis values come from ICD-9 (ICD, International Classification of Diseases, 9th Revision) code set, which is one of the standard international diagnosis classification systems, mostly used in the U.S. Other countries may use different coding systems.

Today, ICD-11 is superior to ICD-9, but since this dataset was created years ago, we’ll be going with ICD-9. Mapping all those ICD-9 codes in this dataset to ICD-10 or CD-11 could actually be an interesting exercise. Ah, this will be also another day’s story.

Normally, ICD-9 codes consist of five-digit/character numeric or alphanumeric values. However, in this dataset, only the first three digits are used. This means that diagnoses are recorded in broader categories, rather than in their most specific forms.

Let me give an example:

The ICD9 code ‘250‘ means ‘Diabetes Mellitus‘
The ICD9 code ‘250.02‘ means ‘Type 2 diabetes with ketoacidosis‘

Since all patients in this dataset are diagnosed with diabetes, you might assume that every diagnosis code would be 250. However, keep in mind that these individuals can have additional conditions, not only diabetes-related issues. That’s why the dataset contains hundreds of unique ICD-9 codes spread across the following three columns:

diag_1: The primary diagnosis (the main reason) of an admission.
diag_2: A coexisting condition that may affect the treatment.
diag_3: A secondary coexisting condition, similar to diag_2.
number_diagnoses: The number of total dioagnoses enteret into the system during an encounter, recorded as integer.

Brainstorming: Of course, these diagnosis codes must be associated with specific patterns within other features (admission type and source, discharge type, time in hospital, etc) as I discussed above for the others. For instance, if an ICD-9 code is related to a heart condition, I would expect the admission to be an emergency, likely leading to hospitalization and a high potential for readmission. On the other hand, if the code corresponds to a minor illness, such as a common cold, the admission is more likely to be elective, resulting in a home discharge with a low likelihood of readmission.

However, to me, the most valuable aspect of having three separate diagnosis columns is the opportunity to analyze comorbidities. In real-life RWE datasets and medical databases, a single patient encounter can have more than 10 different diagnosis codes recorded in the same row. Analyzing the patterns of co-existence of such different disease diagnoses allows for deeper insights into health trends, risk factors, and disease progression. In this dataset, since it’s a small one, we only have three diagnosis fields, which is still good. While this limits some analyses, it may still provide valuable opportunities to explore hospitalization patterns, readmission trends, and key statistical insights related to patient care. It would be nice to try and showcase some primitive trends. Ultimately, this dataset wasn’t designed to store complete patient histories. It mainly focuses on hospitalization and readmission trends for patients with diabetes.

When it comes to number_diagnoses, I have some doubts about it since this is such a small dataset. This feature supposed to represent the total number of non-empty diag_x fields recorded for an encounter. For example, if diag_1 and diag_2 contain ICD-9 codes, while diag_3 is NULL, then number_diagnoses would be 2, since two diagnoses were entered into the system for that visit. Logically, the minimum value should be 1, occurring when only one diagnosis is recorded, and the other two fields are NULL. What about the maximum value? Is it 3? Maybe, but not necessarily also. In this dataset, we only see three diagnoses, but in reality, a single encounter could have more than three recorded diagnoses, and the others, beyond the first three, may not being represented here. Therefore the maximum value in this field can be either 3 or infinite, depending on how the dataset was structured. We can understand it when we examine the data.

Procedures

In RWE datasets, we typically see procedure codes similar to diagnosis codes. It’s common to find ICD-9 and ICD-10 used for procedures as well. Additionally, the CPT (Current Procedural Terminology) code set is another widely used standard for documenting medical procedures.

For laboratory tests, LOINC (Logical Observation Identifiers Names and Codes) is a common standard.

However, in this dataset, procedure and lab test codes are not included. Instead, we only have the number of procedures.

num_procedures: Represents the number of procedures (excluding laboratory tests) performed during a single (current) hospital visit, recorded as an integer value. This does not reflect the total number of procedures a patient has undergone prior to the current admission.
num_lab_procedures: Represents the number of laboratory tests conducted during a single hospital visit, also recorded as an integer value.

Brainstorming: I’m really curious whether the total number of procedures is higher for urgent cases or elective (scheduled) ones.

My guess is that when a patient is admitted with urgent complaints, doctors may order multiple procedures to quickly diagnose the issue; such as blood and urine tests, CT scans, echocardiograms, EKGs, or X-rays. The number may be reasonably high.

However, for elective admissions with major surgeries, hospitalization tends to be longer, meaning that routine procedures may be repeated daily throughout the stay on top of initial chaos of procedures. On the other hand, for elective admissions involving minor treatments, fewer procedures are likely, since the patient is discharged on the same day.

Based on these assumptions, I would expect the number of procedures to follow this descending order:

Elective inpatient admissions with major surgeries leading to longer hospitalizations
Emergency, urgent or trauma center admissions, where multiple tests are performed for rapid diagnosis when the reason is lack.

A similar brainstorming approach can also be applied considering the diagnoses codes and the specialty of admitting physicians. Certain departments within a facility may be likely to order many procedures while the others order fewer, right? Additionaly, who pays for the bills? Does it influence the number of procedures? And overall, how may the number of procedures affect the readmission trend?

What do you think? Let’s see the bets!

Claims

Hah, here’s the answer to the big question: Who pays the bills?

payer_code: Do you pay your own bills or does your insurance company help you get out of this financial nightmare? This field has 23 distinct integer codes where each of represents a different payer; suh as Medicare, self-pay, Blue Cross/Blue Shield, etc.

Brainstorming: When we examine the data, we can see that length of stay and number of procedures tend to be influenced by the payer. Depending on what a patient’s insurance covers, these values may increase or decrease.

If you’re paying out of pocket, you might try to avoid non-essential procedures or even shorten your hospital stay to reduce costs. But if your insurance policy is generous, you’re probably more willing to say yes to everything the hospital recommends, right?

Let’s find out which payer is the most generous when we load the data!

Medications

Medications can be administered during any encounter, and this brings us to these key features related to that:

seperate medication names: In addition to the general diabetesMed and change features, the dataset includes specific medication features for drugs such as insulin, glimepiride, metformin, and others. Each of these is recorded as one of the following categories:
- no – the patient was not prescribed this medication.
- steady – The patient was already on this medication, and the dosage remained unchanged.
- up – The dosage was increased during hospitalization.
- down – The dosage was decreased during hospitalization.
num_medications: This feature represents the number of distinct generic medication names administered during a single visit or hospital stay, recorded as an integer. Based on the description, it appears that discharge prescriptions for home use are not included. What I understand, this is the sum of the unique medications which have steady, up or down in their respective fields.
diabetesMed: Indicates whether any diabetes medication was prescribed during the encounter. Recorded as yes or no.
change: Indicates whether there was any modification in the dosage, generic name, or type of diabetes medication. Recorded as change if a modification occurred; otherwise, no change.

Brainstorm: I’d expect number of medications to increase with certain diagnosis codes, physician specialties, and length of stay (time_in_hospital). For example, hypertension, cardiovascular disease, or heart failure may be associated with a higher number of administered medications.

For the features above, a change can only occur if diabetesMed is yes and at least one of the medication has a value either up or down. Otherwise, it should always be recorded as no change. In the same way, diabetesMed can only be yes if any of the medications has a value other than no.

For instance, if we see a change entry while diabetesMed is no, then it indicates a data inconsistency or error. Let me explain those in an example table, let’s suppose we have only two different diabetes medication: insulin and metformin. Below, the data for id_3 and id_4 have something wrong. I hope, there aren’t such problems in the actual data.

encounter_id	change	diabetesMed	insulin	metformin	num_medications
id_1	no change	yes	steady	no	1
id_2	change	yes	no	down	1
id_3	change	no	no	no	0
id_4	no change	yes	no	no	0
id_5	change	yes	steady	steady	2

What about id_5? At some point, the medication may have changed from insulin to metformin, triggering a change entry in the change column. However, in that case, I would expect the insulin column to be marked as down instead of steady, since its dosage effectively dropped to zero. Therefore, I suspect that id_5 is also an invalid record.

Beyond that data quality perspective, as an ordinary person who is not a physician, I find myself asking a basic yet important question: What does a medication change (such as a dosage adjustment) really mean? Is it generally a bad sign, or could it sometimes indicate a positive outcome?

For example, if a patient is improving, a doctor might reduce their medication dosage. However, when I searched this topic on Google, I got the impression that medication changes are often associated with negative factors, as they are highly correlated with unstable or uncontrolled blood glucose levels. If blood sugar levels are not stable, frequent medication adjustments are needed, right? And the direction of the change may be either up or down in thise case.

Additionally, if a patient experiences adverse effects from a medication, it might be switched to another type. Based on this, it seems that a change entry in the dataset is more likely to indicate a negative scenario. If that’s the case, it could be correlated with longer hospital stays, increased mortality rates, and frequent readmissions.

Again, if we had the number of medication changes recorded for each encounter, it would be easier to interpret the severity of the situation. I believe that a higher number of changes likely indicates a more severe condition, while fewer changes might suggest a greater chance of positive improvement. However, in this dataset, we only have a binary indicator; showing whether at least one change occurred or not for each medication. Despite this limitation, I can still explore the data and test for correlations to see if sum of medication changes (downs or ups) are linked to factors such as hospital stay duration, readmission rates, or mortality risk.

Another interesting exploration could be testing the correlation between medication changes and lab test results. It’s possible that we consistently see a change when blood glucose levels are high (max_glu_serum), as elevated values likely indicate unstable glucose control. If that’s the case, medication adjustments might be a direct response to worsening metabolic stability.

Lab Test Results

This dataset focuses on diabetes-related hospitalizations, and glucose levels and A1C results are critical indicators for diabetes management. For this reason, we have results regarding to those tests.

max_glu_serum: Represents maximum blood glucose level recorded during hospitalization. However, the results are not recorded as numbers. Instead, they are categorized as::
- None – no recorded value.
- Normal – within the normal range
- >200 – Above 200 mg/dL
- >300 – Above 300 mg/dL
A1Cresult: Represents the Hemoglobin A1C test result, which reflects average blood sugar levels over approximately three months. Similar to the previous one, this attribute is also categorized as:
- None – no recorded value.
- Normal – within the normal range
- >7 – Above 7%
- >8 – Above 8%

Brainstorming: I suspect that higher Hemoglobin A1C and blood glucose levels may be associated with more severe cases requiring longer hospital stays. This could also lead to higher readmission rates, as diabetes-related complications may recur. Patients with elevated glucose levels likely receive more medications and procedures during their hospital stay.

I also have a strong feeling that higher A1C and glucose levels correlate with increased mortality rates.

And finally, another interesting question that comes to mind: What percentage of total admissions for patients with high A1C and glucose values are classified as emergency, urgent, or trauma center admissions? I feel this percentage is also high.

In the end, as usual, elevated blood values indicate that something isn’t going well.

Readmission

And finally, we are at the feature which is mostly used as the target variable for most of the classification models: readmission.

readmitted: This feature indicates whether a patient was readmitted after their hospital encounter and, if so, within what timeframe. It is recorded as three categories:
- <30 – The patient was readmitted within 30 days.
- >30 – The patient was readmitted after more than 30 days.
- No – The patient was not readmitted.

Brainstorming: Frequent readmissions are problematic for both healthcare facilities and patients, especially the readmissions happen in 30 days.

From a patient’s perspective, frequent readmission often indicates a more severe condition or that their previous care was inadequate.

From a facility’s perspective, it leads to higher resource consumption, reduced efficiency, and potential financial strain.

This dataset could be valuable for developing a model to predict readmissions in advance. As we explore the data, I will analyze the relationship between each feature and readmission, helping to identify the most valuable features heuristically before developing any models.

Conclusion

This exercise was purely about thinking through the features using the data documentation without even seeing the actual data. Yet, we were still able to make estimations about relationships and trends. During this process, we gained a deeper understanding of the dataset’s structure, how it might have been built, potential data issues, and the strengths and weaknesses of our initial theories.

It might seem like overkill for a dataset like this, but I believe this practice is highly valuable before diving into hands-on data analysis.

In the next part, I’ll begin exploring the data, performing meaningful transformations, manipulations, and imputations, etc., all of which will help ensure a cleaner and more effective exploratory data analysis.

If you have any additional theories about the features, I’d love to hear your thoughts and extend the brainstorming further!

Featured Image: Clay Banks

Upvote0PointsDownvote

0 Votes: 0 Upvotes, 0 Downvotes (0 Points)