Heart Disease Data Analysis
Data analysis is a process of extracting, presenting, and modeling based on information retrieved from raw sources. In this example, a workflow of performing data analysis in the Wolfram Language is showcased. The dataset used here comes from the UCI Machine Learning Repository, which consists of heart disease diagnosis data from 1,541 patients.
Import heart disease diagnosis data and parse it so the rows correspond to different patients, and the columns correspond to different attributes.
data:image/s3,"s3://crabby-images/b19a3/b19a3c8940374f03dd63e1ce89611460f25d99b4" alt="Click for copyable input"
rawdata =
Import["https://archive.ics.uci.edu/ml/machine-learning-databases/\
heart-disease/new.data", "Text"];
data = StringSplit[rawdata, LetterCharacter ..];
data = Table[
ToExpression[StringSplit[dat, (" " | "\n") ..]], {dat, data}];
Extract the relevant attributes into "labels" and "features". The values stored in "labels" are 0 and 1, which correspond to presence and absence of heart disease, respectively.
data:image/s3,"s3://crabby-images/c4c4d/c4c4d04d052b249d31aed3185d437ef802341cbd" alt="Click for copyable input"
labels = Unitize[data[[All, 58]]];
features =
data[[All, {3, 4, 9, 10, 12, 16, 19, 32, 38, 40, 41, 44, 51}]];
data:image/s3,"s3://crabby-images/40c93/40c93434a35bae1d4603d18498c51165e8662b12" alt="Click for copyable input"
Take[labels, 10]
data:image/s3,"s3://crabby-images/9b86a/9b86a82a11567cbd4dda7cea5601555621029fdd" alt=""
For each patient, the feature vector is a list of numerical values. However, the data is not complete and has missing fields stored as .
data:image/s3,"s3://crabby-images/cee6e/cee6e444c157e01d7c5e83e68f2609f33f90aecf" alt="Click for copyable input"
features[[-3]]
data:image/s3,"s3://crabby-images/db5ec/db5ec2bad78672ebbbf6b2abe93e8458b4cfac62" alt=""
Replace missing values by the average of the available data in the corresponding attribute, then visualize the correlation between different attributes.
data:image/s3,"s3://crabby-images/5a21f/5a21ff2f9cfe89bbc315e8b488bce4fa4dc23a86" alt="Click for copyable input"
features = Transpose[Table[
N[attribute /. {-9 -> Mean[N[DeleteCases[attribute, -9]]]}]
, {attribute, Transpose[features]}]];
cormat = Correlation[features];
data:image/s3,"s3://crabby-images/611c2/611c26a4797854da4f3bdc51395277083f3c5165" alt=""
To visualize the distribution of the data, PCA is performed to extract the first two leading components, then the projected data is presented on a scatter plot.
data:image/s3,"s3://crabby-images/9cae5/9cae5a3f7eb1fda288beabc8b97c2c0bf8292e69" alt="Click for copyable input"
pcs2 = Take[PrincipalComponents[features, Method -> "Correlation"],
All, 2];
data:image/s3,"s3://crabby-images/cff1e/cff1e2277be12eb8a5a9afe2397a100ff85c02aa" alt=""
To distinguish the two classes, the projected data is fitted to a two-component Gaussian mixture model.
data:image/s3,"s3://crabby-images/fa5d3/fa5d34b073f87d1b90e311375a102a08bf3622bc" alt="Click for copyable input"
edist = EstimatedDistribution[pcs2,
MixtureDistribution[{p1,
p2}, {BinormalDistribution[{m11, m12}, {s11, s12}, r1],
BinormalDistribution[{m21, m22}, {s21, s22}, r2]}]];
Based on the mixture model, plot the decision boundary (black curve) and probability density contours (red curve) of the mixture model and show them together with the scatter plot. The first component of the Gaussian mixture has higher probability inside the decision boundary.
data:image/s3,"s3://crabby-images/5cb77/5cb7734c493653fcbfea5e7e3f426b397e794542" alt=""