TL;DR:
Introduction
The ID3 algorithm is a powerful older algorithm that allows one to build discrete
decision trees. Itās a greedy machine learning algorithm that generates a decision
tree out of a training dataset. ID3 has many different areas of application in
data mining - for example it can be used to build a decision tree for medical
diagnosis, medical therapy selection, discovering hidden classification rules
in teaching out of students mass evaluations, determining the factors that influence
purchasing behavior of customers, performing coarse weather predictions, deciding
if a system should water plants or if one should plant at a given time under given
conditions in agriculture. The iterative dichotomiser - as the name suggests - tries
to determine which feature might drive a decision most (i.e. which leads to
the best segmentation of oneās target set).
So what is a decision tree anyways? Basically itās a tree that one can traverses
from top to bottom to try to answer a given question based on usually binary or
discrete attributes. Itās an supervised learning algorithm that requires training
data set that one can first built a model out of. Usually this is done by splitting
the training data set into a training and a verification set (for example by random
partitioning) to gain information on how good the trained tree works out when
applied to known situations with a known outcome.
Building the tree is a pretty expensive operation - where one iteratively reduces
the attribute space. For each iteration one first separates the remaining training
data of the given subtree into all result classes and calculates the information
gain of all attributes (i.e. one calculates how well which attributes separate /
correlate with the given class association in the given subtree). This of course
means that a full non pruned binary tree would have $2^n$ levels in case $n$ is
the number of attributes - for non binary attributes the tree even grows faster.
To achieve this goal ID3 uses two basic metrics:
Shannon entropy measures the amount of uncertainty or information content in
a set. In case $S$ is the dataset for which entropy is calculated and $X$ is the set
of classes in $S$ as well as $p(x)$ the proportion of the number of elements in
class $x$ in $S$:
[
H(S) = \sum_{x \in S} - p(x) * \log_2 p(x) \\
p(x) = \frac{n(x)}{N}
]
Note that Iāve used $n(x)$ as the number of elements that are assigned to class $x$
and $N$ as the total number of elements in a class. To perform this calculation
in a meaningful way one uses the mathematical incorrect definition of $log(0) = 0$
which basically just means that one ignores empty classes - which is basically also
already mentioned in the $\sum_{x \in S}$.
Information gain. The information gain describes how well a given attribute $A$
separates the classes. The (discrete) attribute values will be denoted as $a \in A$.
The set $S_a$ will be all elements from $S_a \subset S$ that have the attribute value $a$
assigned. The term $\mid S \mid$ determines the number of elements in a given set.
So the term
[
\frac{\mid S_a \mid}{\mid S \mid}
]
calculates the probability of the attribute value $a$ to occur in the subset $S$.
[
G(S, A) = H(S) - \sum_{a \in A} \frac{\mid S_a \mid}{\mid S \mid} * H(S_a)
]
Thus one basically calculates the information gain as the difference between
the entropy of the full data set and the sum of entropies in splitted data sets
weighted by their probability. This is done for each remaining attribute on each
iteration during building of the tree (thus there will be $O(n^2)$ iterations
only linearly decreasing in complexity). One then selects the attribute with
maximum information gain. If there are attributes with equal information gain one
usually randomly selects one of those attributes and performs the split.
In case information gain is minimal one might also simply prune the tree at this
position and record the probabilities of all result classes. In this case one
can also determine which attributes do not contribute to the decision when looking
at the resulting tree.
The tree built thus is a tree that specifies which attribute is split at with
as many branches as attribute values are present. If one extends the algorithm
to continuous cases one could still use ID3 when one builds binned buckets of
values (i.e. builds classes - for example using an approach such as k-means or
other clustering algorithms - or simply separating in equal class sizes before).
There are better methods to tackle problems with continuous attributes though.
Popular extensions to the ID3 algorithm are:
- C4.5 extends ID3 to allow non binary branches (as I do in my implementation
described below). This algorithm is usually also extended in a way to allow
unknown attribute values (unknown values are simply ignored when calculating
gain). A further extension is the tailoring by adding a cost weight to different
properties so that itās more likely to first branch at simple to calculate
attributes instead of on many expensive ones. They also introduces tree pruning
in a way to reduce the number of resulting classes to prevent over-fitting.
- CART has been developed as an extension to ID3 in 1984 - it only allows binary
trees but is built around fitting for a regression problem solved using least
squares. In the end the tree is usually also pruned.
- CHAID - developed 1964 - is similar to C4.5 and CART but limits the growth
of the tree using a $X^2$ based metric.
Many software implementations currently implement either ID3, CART or minor variations
of those algorithms.
Confidence intervals used
The mentioned Wilson interval is an extension of the simple Wald interval that
would build a confidence interval by doing a simple normal distribution approximation
assuming a binomial distribution between the result outcome and all other values.
Doing a simple Wald interval would assume as known from the Binomial
distribution that:
[
P_{l,u} = \hat{p} \pm z * \sqrt{\frac{\hat{p} * (1 - \hat{p})}{n}} \\
z = \phi^{-1}(1 - \frac{\alpha}{2})
]
Using the Wald interval requires one to clamp the results to the $[0,1]$ interval
at the end and usually undershoots the real confidence interval (i.e. results in an
interval with smaller confidence level). A simple extension to this interval would
be the Agresti-Coull interval - but a way better alternative is the Wilson score
interval. This does not yield symmetric limits though - but solves the problem of
over and undershooting and works perfectly well for small samples and skewed
observations - and unlike even better scores like the Clopper-Pearson interval
that would yield even better results it can be directly calculated by a simple
formula. The derivation of this interval is nicely described
on Wikipedia
so Iām going to skip this here. The main result is the used formula:
[
p_{l,u} = \frac{1}{1 + \frac{z^2}{n}} \left(\hat{p} + \frac{z^2}{2n}\right) \pm \frac{z}{1 + \frac{z^2}{n}} \sqrt{\frac{\hat{p}(1 - \hat{p})}{n} + \frac{z^2}{4n^2}}
]
This is what Iām going to use to determine the confidence intervals.
Simple example data
For demonstration purposes Iām using a pretty simple dataset - thatās often found
on the net when one discusses ID3 trees - that describes if someone got outside
depending on environmental conditions:
Outlook |
Temperature |
Humidity |
Wind |
Decision |
Sunny |
Hot |
High |
Weak |
No |
Sunny |
Hot |
High |
Strong |
No |
Overcast |
Hot |
High |
Weak |
Yes |
Rain |
Mild |
High |
Weak |
Yes |
Rain |
Cool |
Normal |
Weak |
Yes |
Rain |
Cool |
Normal |
Strong |
No |
Overcast |
Cool |
Normal |
Strong |
Yes |
Sunny |
Mild |
High |
Weak |
No |
Sunny |
Cool |
Normal |
Weak |
Yes |
Rain |
Mild |
Normal |
Weak |
Yes |
Sunny |
Mild |
Normal |
Strong |
Yes |
Overcast |
Mild |
High |
Strong |
Yes |
Overcast |
Hot |
Normal |
Weak |
Yes |
Rain |
Mild |
High |
Strong |
No |
Building the decision tree for the test dataset is pretty straight forward. First
one has to determine the Shanon entropy for the unfiltered data. Iām also calculating
the target probabilities and their confidence intervals at every level. The idea
behind the latter is that one might only be interested in a result with a given
probability and thus might want to terminate the tree search or prune the tree
later on.
Entropy: 0.9402859586706309, gain: 0.2467498197744391
Decision = No: 35.71429% [12.73086%, 67.90469%]
Decision = Yes: 64.28571% [32.09531%, 87.26914000000001%]
As one can see the example dataset is somewhat balanced and the confidence intervals
are also massively overlapping.
The next step is to calculate the branching ratios as well as the entropy for each
subtree in case one would split for each of the attributes:
Possible gain for Outlook: 0.2467498197744391
Possible gain for Temperature: 0.029222565658954647
Possible gain for Humidity: 0.15183550136234142
Possible gain for Wind: 0.04812703040826932
Selected maximum gain 0.2467498197744391 for candidate Outlook
As one can see the information gain differs radically between the various
attributes. The largest gain is achieved by selecting the outlook (0.246750
)
followed by the humidity (0.151836
). Wind condition (0.048127
)
and temperature (0.029223
) provide the least information. The first
branch will thus be formed at the temperature class.
The algorithm has decided to branch at attribute the outlook column
since itās information gain is the maximum encountered. Then the algorithm recurses
into the first possible attribute value for the outlook. The first attribute
is overcast:
Inside subtree Outlook = Overcast
Branching candidates Temperature, Humidity, Wind
Our entropy is already 0 - finishing up
Outlook = Overcast:
| | Terminal:
| | Decision = No: 0.0% [0.0%, 62.46387%]
| | Decision = Yes: 100.0% [37.53613%, 100.0%]
As one can see the entropy is zero - the decision is always yes. The width
of the confidence interval is determined by the amount of cases that support
this decision even in case we only encountered records containing this decision.
Since entropy is zero and thus information gain by any further steps would
also vanish this will create a terminal and stop further splitting.
The case is different for an outlook of rain. The algorithm again calculates
entropy and information gain for all possible remaining attributes:
Inside subtree Outlook = Rain
Branching candidates Temperature, Humidity, Wind
Element count: 5, Shanon entropy: 0.9709505944546686
Possible gain for Temperature: 0.01997309402197489
Possible gain for Humidity: 0.01997309402197489
Possible gain for Wind: 0.9709505944546686
Selected maximum gain 0.9709505944546686 for candidate Wind
At the end it selects branching at the wind column.
Outlook = Rain:
| | Entropy: 0.9709505944546686, gain: 0.9709505944546686
| | Decision = No: 40.0% [8.2521%, 83.16892%]
| | Decision = Yes: 60.0% [16.83108%, 91.7479%]
| | Branching on Wind
| | Wind = Weak:
| | | | Terminal:
| | | | Decision = No: 0.0% [0.0%, 68.93252%]
| | | | Decision = Yes: 100.0% [31.067479999999996%, 100.0%]
| | Wind = Strong:
| | | | Terminal:
| | | | Decision = No: 100.0% [23.10429%, 100.0%]
| | | | Decision = Yes: 0.0% [-0.0%, 76.89571%]
Recursing into weak and strong subtrees will yield nodes with zero entropy
again - thus there will be terminal nodes.
At the end the algorithm will have generated a simple decision tree only two levels
deep. The topmost level will split at the outlook column - for overcast no
more recursion will be required. In case of rainy outlook the algorithm will
choose to further discriminate based on wind condition and produce terminals
for strong and weak wind predictions. In case of sunny outlook on the other
hand it will discriminate based on humidity and produce terminals for high
and normal humidity.
| Entropy: 0.9402859586706309, gain: 0.2467498197744391
| Decision = No: 35.71429% [12.73086%, 67.90469%]
| Decision = Yes: 64.28571% [32.09531%, 87.26914000000001%]
| Branching on Outlook
| Outlook = Sunny:
| | | Entropy: 0.9709505944546686, gain: 0.9709505944546686
| | | Decision = No: 60.0% [16.83108%, 91.7479%]
| | | Decision = Yes: 40.0% [8.2521%, 83.16892%]
| | | Branching on Humidity
| | | Humidity = High:
| | | | | Terminal:
| | | | | Decision = No: 100.0% [31.067479999999996%, 100.0%]
| | | | | Decision = Yes: 0.0% [0.0%, 68.93252%]
| | | Humidity = Normal:
| | | | | Terminal:
| | | | | Decision = No: 0.0% [-0.0%, 76.89571%]
| | | | | Decision = Yes: 100.0% [23.10429%, 100.0%]
| Outlook = Overcast:
| | | Terminal:
| | | Decision = No: 0.0% [0.0%, 62.46387%]
| | | Decision = Yes: 100.0% [37.53613%, 100.0%]
| Outlook = Rain:
| | | Entropy: 0.9709505944546686, gain: 0.9709505944546686
| | | Decision = No: 40.0% [8.2521%, 83.16892%]
| | | Decision = Yes: 60.0% [16.83108%, 91.7479%]
| | | Branching on Wind
| | | Wind = Weak:
| | | | | Terminal:
| | | | | Decision = No: 0.0% [0.0%, 68.93252%]
| | | | | Decision = Yes: 100.0% [31.067479999999996%, 100.0%]
| | | Wind = Strong:
| | | | | Terminal:
| | | | | Decision = No: 100.0% [23.10429%, 100.0%]
| | | | | Decision = Yes: 0.0% [-0.0%, 76.89571%]
In case of this example the algorithm built a tree that leads to decisions with
probability 1.0 - i.e. it provides a decision whose point estimator looks like
a sure conclusion. Note that the confidence intervals are still not point like
due to the limited amount of data that has been used to build the trees. In general
case there might be different probabilities for both outcomes - one can then decide
if one wants to work with the point estimators, the confidence intervals or wants
to specify any threshold, add an inconclusive outcome, etc.
The tree looks different when one does not include Humidity
:
| Entropy: 0.9402859586706309, gain: 0.2467498197744391
| Decision = No: 35.71429% [12.73086%, 67.90469%]
| Decision = Yes: 64.28571% [32.09531%, 87.26914000000001%]
| Branching on Outlook
| Outlook = Sunny:
| | | Entropy: 0.9709505944546686, gain: 0.5709505944546686
| | | Decision = No: 60.0% [16.83108%, 91.7479%]
| | | Decision = Yes: 40.0% [8.2521%, 83.16892%]
| | | Branching on Temperature
| | | Temperature = Hot:
| | | | | Terminal:
| | | | | Decision = No: 100.0% [23.10429%, 100.0%]
| | | | | Decision = Yes: 0.0% [-0.0%, 76.89571%]
| | | Temperature = Mild:
| | | | | Entropy: 1.0, gain: 1.0
| | | | | Decision = No: 50.0% [6.1549%, 93.8451%]
| | | | | Decision = Yes: 50.0% [6.1549%, 93.8451%]
| | | | | Branching on Wind
| | | | | Wind = Weak:
| | | | | | | Terminal:
| | | | | | | Decision = No: 100.0% [13.06097%, 100.0%]
| | | | | | | Decision = Yes: 0.0% [-0.0%, 86.93902999999999%]
| | | | | Wind = Strong:
| | | | | | | Terminal:
| | | | | | | Decision = No: 0.0% [-0.0%, 86.93902999999999%]
| | | | | | | Decision = Yes: 100.0% [13.06097%, 100.0%]
| | | Temperature = Cool:
| | | | | Terminal:
| | | | | Decision = No: 0.0% [-0.0%, 86.93902999999999%]
| | | | | Decision = Yes: 100.0% [13.06097%, 100.0%]
| Outlook = Overcast:
| | | Terminal:
| | | Decision = No: 0.0% [0.0%, 62.46387%]
| | | Decision = Yes: 100.0% [37.53613%, 100.0%]
| Outlook = Rain:
| | | Entropy: 0.9709505944546686, gain: 0.9709505944546686
| | | Decision = No: 40.0% [8.2521%, 83.16892%]
| | | Decision = Yes: 60.0% [16.83108%, 91.7479%]
| | | Branching on Wind
| | | Wind = Weak:
| | | | | Terminal:
| | | | | Decision = No: 0.0% [0.0%, 68.93252%]
| | | | | Decision = Yes: 100.0% [31.067479999999996%, 100.0%]
| | | Wind = Strong:
| | | | | Terminal:
| | | | | Decision = No: 100.0% [23.10429%, 100.0%]
| | | | | Decision = Yes: 0.0% [-0.0%, 76.89571%]
In case one decides to remove the Outlook
attribute which had the highest
initial gain it gets more interesting - then one can see that not every case
leads to a 100% conclusion. For example in case one has high humidity but only
mild temperature there is a 50% chance for yes and a 50% chance for no.
The same is the case for high humidity, hot temperature and weak wind and
in a third case for normal humidity. Such inconclusive outcomes are of course
much more likely when one builds trees over complex datasets with enough entries.
Having a tree with exact conclusions is usually an example for an overfitting
situation for complex situations (though of course also a possible correct outcome)
| Entropy: 0.9402859586706309, gain: 0.15183550136234142
| Decision = No: 35.71429% [12.73086%, 67.90469%]
| Decision = Yes: 64.28571% [32.09531%, 87.26914000000001%]
| Branching on Humidity
| Humidity = High:
| | | Terminal:
| | | Decision = No: 57.14286% [18.93662%, 88.38595000000001%]
| | | Decision = Yes: 42.85714% [11.614049999999999%, 81.06338%]
| Humidity = Normal:
| | | Entropy: 0.5916727785823275, gain: 0.19811742113040343
| | | Decision = No: 14.285709999999998% [1.6956700000000002%, 61.69146%]
| | | Decision = Yes: 85.71429% [38.30854%, 98.30433%]
| | | Branching on Wind
| | | Wind = Weak:
| | | | | Terminal:
| | | | | Decision = No: 0.0% [0.0%, 62.46387%]
| | | | | Decision = Yes: 100.0% [37.53613%, 100.0%]
| | | Wind = Strong:
| | | | | Entropy: 0.9182958340544896, gain: 0.2516291673878229
| | | | | Decision = No: 33.33333% [4.03207%, 85.6121%]
| | | | | Decision = Yes: 66.66667% [14.3879%, 95.96793%]
| | | | | Branching on Temperature
| | | | | Temperature = Hot:
| | | | | | | Terminal:
| | | | | | | Not possible according to known data
| | | | | Temperature = Mild:
| | | | | | | Terminal:
| | | | | | | Decision = No: 0.0% [-0.0%, 86.93902999999999%]
| | | | | | | Decision = Yes: 100.0% [13.06097%, 100.0%]
| | | | | Temperature = Cool:
| | | | | | | Terminal:
| | | | | | | Decision = No: 50.0% [6.1549%, 93.8451%]
| | | | | | | Decision = Yes: 50.0% [6.1549%, 93.8451%]
Description of the implementation
So what does one require for a simple implementation of ID3? Note this will be one
of the most simple implementations that I came up with without offering much
performance enhancements. It has been designed to tackle some specific problems
though so the code has been designed to process data stored in CSV files as well
as in various database backends (these have been excluded from the open sourced
version due to a myriad of different reasons).
Datastore: CSV datastore
First one will need a way to access the data. This is what Iām going to call the
datastore layer. The datastore layer will provide access to the subsets of
data - and might be implemented either as a simple iterator or using some
SQL database logic. The methods required are:
- Providing a list of attributes that will be used or are available. This will
have to be tailored by the application.
- For each attribute:
- Discrete labels (or discrete numerical attributes):
- Possible distinct labels
- Assigned numerical values for each label
- The label itself
- Iterating over all rows (allowing access to all attributes by attribute index
returning the numerical values) specifying attribute filters (by index and value)
- Optionally for performance enhancement (since databases are really good at this)
one might support functions for:
- counting attributes having a specific attribute with a specific value
in a given subset (filtered by specific values for specific attributes).
For sake of simplicity Iāve to created a simple CSV/TSV data store at first
for which one has to set attribute types manually by selecting the specific attribute
descriptor class. The datastore can be created specifying a filename as well as
the column index and the type (discrete labels and continuous attributes) type.
For continuous attributes one has to specify the classes since they wonāt be
automatically determined.
The CSV datastore will one also allow to automatically build the list of attributes
in case one really wants to consume all fields.
The ID3 tree
Now it comes to the ID3 tree itself. This is basically a direct implementation
of the algorithm described above. It only requires to know which
attributes it should use as input and which attribute contains the target classes.
Since attributes are already selected by the datastore configuration the ID3
tree builder is not required to receive any additional information.
The output then is a tree that splits at each level by one attribute - for each
of the discrete values there exists one subtree. On each level the tree records:
- For each internal (split) node:
- The attribute that is split on this tree level
- The information gain that is achieved by this split. During building of the
tree the builder function will calculate all gains for all candidates as well
as the probabilities for all target attribute values.
- For each decision subtree:
- A list of probabilities for all target results (i.e. for a yes/no decision
the probability that an effect occurs vs. the probability that it does not
occur). This is simply the conditional probability. Note that this
is technically not required for ID3 but it might be interesting for the
user of the ID3 tree later on.
- The new attribute filter configuration
- The selected attribute value
- For each leaf node (that lists the target):
- The number of possible target values
- The probability for each of the target values (sorted by descending probability)
The ID3 tree is built by a simple utility function that only requires some
minimal arguments:
- The pre configured data source. The data source already contains attribute
definitions and distinct values as well as the ability to filter and count
values.
- The minimal gain. If set to 0 all attributes that are possible will be included
as long as there is a split between different target states. This allows to
remove attributes that do not provide any usable gain from the tree.
- The confidence level that will be used during confidence interval calculation.
This has no influence on the building of the tree.
So to take the decision at a given level the task can be split into multiple
steps:
- First one has to be able to calculate target probabilities and Shanon entropy
for a given list of filters. This will lead to a list of probabilities (exactly
the number of possible target values) including their Wilson interval to provide
a coarse measure for the quality of the response and resulting probabilities.
- The second large task will be done by a function that iterates all splitting
candidates - i.e. all remaining attributes that have not been used for
splitting until now. For each possible candidate it will first iterate over
all possible values and calculate again the Shanon Entropy for that given
potential subtree as well as the relative branching probability into the
given subtree.
Some nice output for the ID3 tree
Since I also wanted to further process the tree Iāve decided to support various
outputs on the small CLI utility or inside the Jupyter notebook:
- ASCII output in a structured way
Applying to real word data
Now that Iāve described the algorithm and itās implementation letās apply it to
some real world data. There is a number of nice datasets that can be used as an
example. Iāve decided on the following ones:
- The Audobon Society Field Guide descriptions of various mushrooms from the
Agaricus and Lepiota Family. This guide dates back to 1987 and contains 8124
different species with 22 of their attributes.
- In the end a set of medical diagnoses and a list of the major symptoms that
indicated the given illness.
Classifying edible and poisonous mushrooms
There is a section about Agaricus and Lepiota inside the Audobon Society Field Guide
from which Jeff Schlimmer has extracted 22 different characteristics
of the 8124 mushrooms from those two families to allow one to train decision trees
to decide if mushrooms are edible or not. Please not this should not be used
as a guide. There is no guarantee this decision tree wonāt kill you
Training for the decision tree takes a while due to the huge number of
parameters. The attributes provided are:
- Is the mushroom edible or poisonous
- Cap shape
- Cap surface
- Cap color
- Does the mushroom have bruises or not
- Odor
- Gill attachment
- Gill spacing
- Gill size
- Gill color
- Stalk shape
- Stalk root
- Stalk surface above ring
- Stalk surface below ring
- Stalk color above ring
- Stalk color below ring
- Veil type
- Veil color
- Ring number
- Ring type
- Spore print color
- Population
- Habitat
As one can see the algorithm recurses only into a maximum of four attributes
and then terminates since all other properties do not provide any more information
gain. Even though the problem is pretty simple the calculation took nearly a
minute to come to a conclusion:
| Entropy: 0.9968038285222955, gain: 0.9054400254210326
| Edible = EDIBLE: 53.327000000000005% [51.921870000000006%, 54.726870000000005%]
| Edible = POISONOUS: 46.672999999999995% [45.27313%, 48.07813%]
| Branching on Odor
| Odor = ALMOND:
| | | Terminal:
| | | Edible = EDIBLE: 100.0% [98.36314%, 100.0%]
| | | Edible = POISONOUS: 0.0% [0.0%, 1.63686%]
| Odor = ANISE:
| | | Terminal:
| | | Edible = EDIBLE: 100.0% [98.36314%, 100.0%]
| | | Edible = POISONOUS: 0.0% [0.0%, 1.63686%]
| Odor = NONE:
| | | Entropy: 0.20192168248430362, gain: 0.1370967382781343
| | | Edible = EDIBLE: 96.84874% [96.03266%, 97.50132%]
| | | Edible = POISONOUS: 3.15126% [2.49868%, 3.9673399999999996%]
| | | Branching on Spore print color
| | | Spore print color = PURPLE:
| | | | | Terminal:
| | | | | Not possible according to known data
| | | Spore print color = BROWN:
| | | | | Terminal:
| | | | | Edible = EDIBLE: 100.0% [99.54983%, 100.0%]
| | | | | Edible = POISONOUS: 0.0% [0.0%, 0.45017%]
| | | Spore print color = BLACK:
| | | | | Terminal:
| | | | | Edible = EDIBLE: 100.0% [99.53473000000001%, 100.0%]
| | | | | Edible = POISONOUS: 0.0% [0.0%, 0.46527%]
| | | Spore print color = CHOCOLATE:
| | | | | Terminal:
| | | | | Edible = EDIBLE: 100.0% [87.82137%, 100.0%]
| | | | | Edible = POISONOUS: 0.0% [0.0%, 12.17863%]
| | | Spore print color = GREEN:
| | | | | Terminal:
| | | | | Edible = EDIBLE: 0.0% [0.0%, 8.46263%]
| | | | | Edible = POISONOUS: 100.0% [91.53737%, 100.0%]
| | | Spore print color = WHITE:
| | | | | Entropy: 0.3809465857053901, gain: 0.2434890145144485
| | | | | Edible = EDIBLE: 92.59259% [89.48345%, 94.83559%]
| | | | | Edible = POISONOUS: 7.4074100000000005% [5.16441%, 10.516549999999999%]
| | | | | Branching on Habitat
| | | | | Habitat = WOODS:
| | | | | | | Entropy: 0.7219280948873623, gain: 0.7219280948873623
| | | | | | | Edible = EDIBLE: 20.0% [8.576920000000001%, 39.98319%]
| | | | | | | Edible = POISONOUS: 80.0% [60.01681%, 91.42308%]
| | | | | | | Branching on Gill size
| | | | | | | Gill size = NARROW:
| | | | | | | | | Terminal:
| | | | | | | | | Edible = EDIBLE: 0.0% [-0.0%, 17.2194%]
| | | | | | | | | Edible = POISONOUS: 100.0% [82.7806%, 100.0%]
| | | | | | | Gill size = BROAD:
| | | | | | | | | Terminal:
| | | | | | | | | Edible = EDIBLE: 100.0% [54.58366%, 100.0%]
| | | | | | | | | Edible = POISONOUS: 0.0% [0.0%, 45.41634%]
| | | | | Habitat = MEADOWS:
| | | | | | | Terminal:
| | | | | | | Not possible according to known data
| | | | | Habitat = GRASSES:
| | | | | | | Terminal:
| | | | | | | Edible = EDIBLE: 100.0% [97.74096%, 100.0%]
| | | | | | | Edible = POISONOUS: 0.0% [-0.0%, 2.25904%]
| | | | | Habitat = PATHS:
| | | | | | | Terminal:
| | | | | | | Edible = EDIBLE: 100.0% [85.73315000000001%, 100.0%]
| | | | | | | Edible = POISONOUS: 0.0% [-0.0%, 14.26685%]
| | | | | Habitat = URBAN:
| | | | | | | Terminal:
| | | | | | | Not possible according to known data
| | | | | Habitat = LEAVES:
| | | | | | | Entropy: 0.6840384356390417, gain: 0.6840384356390417
| | | | | | | Edible = EDIBLE: 81.81818% [69.11085%, 90.0505%]
| | | | | | | Edible = POISONOUS: 18.181820000000002% [9.9495%, 30.889149999999997%]
| | | | | | | Branching on Cap Color
| | | | | | | Cap Color = WHITE:
| | | | | | | | | Terminal:
| | | | | | | | | Edible = EDIBLE: 0.0% [0.0%, 45.41634%]
| | | | | | | | | Edible = POISONOUS: 100.0% [54.58366%, 100.0%]
| | | | | | | Cap Color = YELLOW:
| | | | | | | | | Terminal:
| | | | | | | | | Edible = EDIBLE: 0.0% [0.0%, 45.41634%]
| | | | | | | | | Edible = POISONOUS: 100.0% [54.58366%, 100.0%]
| | | | | | | Cap Color = BROWN:
| | | | | | | | | Terminal:
| | | | | | | | | Edible = EDIBLE: 100.0% [87.82137%, 100.0%]
| | | | | | | | | Edible = POISONOUS: 0.0% [0.0%, 12.17863%]
| | | | | | | Cap Color = GRAY:
| | | | | | | | | Terminal:
| | | | | | | | | Not possible according to known data
| | | | | | | Cap Color = RED:
| | | | | | | | | Terminal:
| | | | | | | | | Not possible according to known data
| | | | | | | Cap Color = PINK:
| | | | | | | | | Terminal:
| | | | | | | | | Not possible according to known data
| | | | | | | Cap Color = PURPLE:
| | | | | | | | | Terminal:
| | | | | | | | | Not possible according to known data
| | | | | | | Cap Color = GREEN:
| | | | | | | | | Terminal:
| | | | | | | | | Not possible according to known data
| | | | | | | Cap Color = BUFF:
| | | | | | | | | Terminal:
| | | | | | | | | Not possible according to known data
| | | | | | | Cap Color = CINNAMON:
| | | | | | | | | Terminal:
| | | | | | | | | Edible = EDIBLE: 100.0% [78.28708%, 100.0%]
| | | | | | | | | Edible = POISONOUS: 0.0% [0.0%, 21.71292%]
| | | | | Habitat = WASTE:
| | | | | | | Terminal:
| | | | | | | Edible = EDIBLE: 100.0% [96.64929%, 100.0%]
| | | | | | | Edible = POISONOUS: 0.0% [0.0%, 3.35071%]
| | | Spore print color = YELLOW:
| | | | | Terminal:
| | | | | Edible = EDIBLE: 100.0% [87.82137%, 100.0%]
| | | | | Edible = POISONOUS: 0.0% [0.0%, 12.17863%]
| | | Spore print color = ORANGE:
| | | | | Terminal:
| | | | | Edible = EDIBLE: 100.0% [87.82137%, 100.0%]
| | | | | Edible = POISONOUS: 0.0% [0.0%, 12.17863%]
| | | Spore print color = BUFF:
| | | | | Terminal:
| | | | | Edible = EDIBLE: 100.0% [87.82137%, 100.0%]
| | | | | Edible = POISONOUS: 0.0% [0.0%, 12.17863%]
| Odor = PUNGENT:
| | | Terminal:
| | | Edible = EDIBLE: 0.0% [-0.0%, 2.53426%]
| | | Edible = POISONOUS: 100.0% [97.46574%, 100.0%]
| Odor = CREOSOTE:
| | | Terminal:
| | | Edible = EDIBLE: 0.0% [0.0%, 3.35071%]
| | | Edible = POISONOUS: 100.0% [96.64929%, 100.0%]
| Odor = FOUL:
| | | Terminal:
| | | Edible = EDIBLE: 0.0% [0.0%, 0.30722%]
| | | Edible = POISONOUS: 100.0% [99.69278%, 100.0%]
| Odor = FISHY:
| | | Terminal:
| | | Edible = EDIBLE: 0.0% [0.0%, 1.14242%]
| | | Edible = POISONOUS: 100.0% [98.85758%, 100.0%]
| Odor = SPICY:
| | | Terminal:
| | | Edible = EDIBLE: 0.0% [0.0%, 1.14242%]
| | | Edible = POISONOUS: 100.0% [98.85758%, 100.0%]
| Odor = MUSTY:
| | | Terminal:
| | | Edible = EDIBLE: 0.0% [0.0%, 12.17863%]
| | | Edible = POISONOUS: 100.0% [87.82137%, 100.0%]
Classifying symptoms and disease
To get a little bit more into the medical regime and illustrate what the idea of
applying decisions tree in diagnosis might look like (still keep in mind this
is just an example, this is not handling the data in any way complete enough)
Iāve also gathered another dataset from Kaggle.
This dataset contains a list of disease as diagnosed by a medical professional
as well as a list of symptoms. The dataset also would allow to weight the symptoms
according to severity - Iāve not used this here for demonstration purposes. The
same decision tree algorithm as above can be applied to a simple modified dataset
that just contains binary attributes - has a patient shown given symptoms or
not.
The datasource currently contains only around 4900 entries for 131 symptoms
which of course feels pretty limited - and indeed it is. But it should illustrate
the idea behind applying decision trees in the world of medicine pretty well - and
also some limits when training ID3 trees directly. The calculation using this
direct inefficient implementation takes around 19 hours ā¦
| Entropy: 5.357552004618081, gain: 0.8458372108970891
| fatigue = NO:
| | | Entropy: 4.786255977156568, gain: 0.8175104414185426
| | | vomiting = NO:
| | | | | Entropy: 4.2634667101159796, gain: 0.7724397067592355
| | | | | skin_rash = YES:
| | | | | | | Entropy: 2.3817253589307477, gain: 0.7884441273192315
| | | | | | | itching = YES:
| | | | | | | | | Entropy: 1.161378479448699, gain: 0.7287131042890482
| | | | | | | | | stomach_pain = NO:
| | | | | | | | | | | Entropy: 0.7742433029172697, gain: 0.48546076074591343
| | | | | | | | | | | burning_micturition = NO:
| | | | | | | | | | | | | Entropy: 0.3227569588973982, gain: 0.3227569588973982
| | | | | | | | | | | | | loss_of_appetite = NO:
| | | | | | | | | | | | | | | Disease = Fungal infection: 100.0% [93.51585%, 100.0%]
| | | | | | | | | | | | | loss_of_appetite = YES:
| | | | | | | | | | | | | | | Disease = Chicken pox: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | burning_micturition = YES:
| | | | | | | | | | | | | Disease = Drug Reaction: 100.0% [64.32109%, 100.0%]
| | | | | | | | | stomach_pain = YES:
| | | | | | | | | | | Disease = Drug Reaction: 100.0% [93.11334%, 100.0%]
| | | | | | | itching = NO:
| | | | | | | | | Entropy: 1.838026124503779, gain: 0.7870913537395843
| | | | | | | | | joint_pain = NO:
| | | | | | | | | | | Entropy: 1.5013353868059924, gain: 0.8506573567612395
| | | | | | | | | | | blister = NO:
| | | | | | | | | | | | | Entropy: 1.1386865525783176, gain: 0.48654136697818307
| | | | | | | | | | | | | pus_filled_pimples = NO:
| | | | | | | | | | | | | | | Entropy: 2.2359263506290326, gain: 0.863120568566631
| | | | | | | | | | | | | | | nodal_skin_eruptions = YES:
| | | | | | | | | | | | | | | | | Disease = Fungal infection: 100.0% [64.32109%, 100.0%]
| | | | | | | | | | | | | | | nodal_skin_eruptions = NO:
| | | | | | | | | | | | | | | | | Entropy: 1.9219280948873623, gain: 0.9709505944546687
| | | | | | | | | | | | | | | | | blackheads = NO:
| | | | | | | | | | | | | | | | | | | Entropy: 1.584962500721156, gain: 0.9182958340544894
| | | | | | | | | | | | | | | | | | | stomach_pain = NO:
| | | | | | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0
| | | | | | | | | | | | | | | | | | | | | high_fever = NO:
| | | | | | | | | | | | | | | | | | | | | | | Disease = Psoriasis: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | high_fever = YES:
| | | | | | | | | | | | | | | | | | | | | | | Disease = Impetigo: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | | | stomach_pain = YES:
| | | | | | | | | | | | | | | | | | | | | Disease = Drug Reaction: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | blackheads = YES:
| | | | | | | | | | | | | | | | | | | Disease = Acne: 100.0% [64.32109%, 100.0%]
| | | | | | | | | | | | | pus_filled_pimples = YES:
| | | | | | | | | | | | | | | Disease = Acne: 100.0% [93.87389999999999%, 100.0%]
| | | | | | | | | | | blister = YES:
| | | | | | | | | | | | | Disease = Impetigo: 100.0% [94.19448%, 100.0%]
| | | | | | | | | joint_pain = YES:
| | | | | | | | | | | Disease = Psoriasis: 100.0% [94.19448%, 100.0%]
| | | | | skin_rash = NO:
| | | | | | | Entropy: 3.9828871664512895, gain: 0.6468749738357373
| | | | | | | headache = NO:
| | | | | | | | | Entropy: 3.7560383874069343, gain: 0.6991724211329374
| | | | | | | | | swelling_joints = NO:
| | | | | | | | | | | Entropy: 3.648994047474087, gain: 0.6239247592651546
| | | | | | | | | | | dizziness = NO:
| | | | | | | | | | | | | Entropy: 3.4860352643091366, gain: 0.6908574161896694
| | | | | | | | | | | | | high_fever = NO:
| | | | | | | | | | | | | | | Entropy: 3.2946870140830082, gain: 0.6952585028600295
| | | | | | | | | | | | | | | constipation = NO:
| | | | | | | | | | | | | | | | | Entropy: 3.336579880077256, gain: 0.7747942362434017
| | | | | | | | | | | | | | | | | bladder_discomfort = NO:
| | | | | | | | | | | | | | | | | | | Entropy: 3.5758257945180882, gain: 0.7590191722627639
| | | | | | | | | | | | | | | | | | | continuous_sneezing = NO:
| | | | | | | | | | | | | | | | | | | | | Entropy: 4.506890595608519, gain: 0.7219280948873625
| | | | | | | | | | | | | | | | | | | | | itching = YES:
| | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.9182958340544893, gain: 0.9182958340544893
| | | | | | | | | | | | | | | | | | | | | | | nodal_skin_eruptions = YES:
| | | | | | | | | | | | | | | | | | | | | | | | | Disease = Fungal infection: 100.0% [64.32109%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | | | nodal_skin_eruptions = NO:
| | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.5, gain: 1.0
| | | | | | | | | | | | | | | | | | | | | | | | | stomach_pain = NO:
| | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0
| | | | | | | | | | | | | | | | | | | | | | | | | | | nausea = NO:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Hepatitis B: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | | | | | | | nausea = YES:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Chronic cholestasis: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | | | | | stomach_pain = YES:
| | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Drug Reaction: 100.0% [64.32109%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | itching = NO:
| | | | | | | | | | | | | | | | | | | | | | | Entropy: 4.251629167387823, gain: 0.6500224216483548
| | | | | | | | | | | | | | | | | | | | | | | diarrhoea = NO:
| | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 4.021928094887362, gain: 0.7219280948873619
| | | | | | | | | | | | | | | | | | | | | | | | | chest_pain = NO:
| | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 3.875, gain: 0.5435644431995956
| | | | | | | | | | | | | | | | | | | | | | | | | | | yellowish_skin = NO:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 3.6644977792004623, gain: 0.5916727785823288
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | shivering = NO:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 3.584962500721156, gain: 0.6500224216483542
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | indigestion = NO:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 3.321928094887362, gain: 0.7219280948873619
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | obesity = NO:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 3.0, gain: 0.8112781244591329
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | neck_pain = NO:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 2.584962500721156, gain: 0.6500224216483541
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | burning_micturition = NO:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 2.321928094887362, gain: 0.7219280948873621
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | muscle_wasting = NO:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 2.0, gain: 0.8112781244591329
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | stiff_neck = NO:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.584962500721156, gain: 0.9182958340544894
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | joint_pain = NO:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pain_during_bowel_movements = NO:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Acne: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pain_during_bowel_movements = YES:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Dimorphic hemmorhoids(piles): 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | joint_pain = YES:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Psoriasis: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | stiff_neck = YES:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Arthritis: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | muscle_wasting = YES:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = AIDS: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | burning_micturition = YES:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Urinary tract infection: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | neck_pain = YES:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | loss_of_balance = NO:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Osteoarthristis: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | loss_of_balance = YES:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Cervical spondylosis: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | obesity = YES:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | weight_loss = NO:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Varicose veins: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | weight_loss = YES:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Diabetes: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | indigestion = YES:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | acidity = NO:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Peptic ulcer diseae: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | acidity = YES:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Migraine: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | shivering = YES:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Allergy: 100.0% [64.32109%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | | | | | | | yellowish_skin = YES:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | nausea = NO:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Alcoholic hepatitis: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | nausea = YES:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Hepatitis C: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | | | | | chest_pain = YES:
| | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0
| | | | | | | | | | | | | | | | | | | | | | | | | | | stomach_pain = NO:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Heart attack: 100.0% [64.32109%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | | | | | | | stomach_pain = YES:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = GERD: 100.0% [64.32109%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | | | diarrhoea = YES:
| | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.5, gain: 1.0
| | | | | | | | | | | | | | | | | | | | | | | | | sunken_eyes = NO:
| | | | | | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0
| | | | | | | | | | | | | | | | | | | | | | | | | | | yellowish_skin = NO:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Hyperthyroidism: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | | | | | | | yellowish_skin = YES:
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = hepatitis A: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | | | | | sunken_eyes = YES:
| | | | | | | | | | | | | | | | | | | | | | | | | | | Disease = Gastroenteritis: 100.0% [64.32109%, 100.0%]
| | | | | | | | | | | | | | | | | | | continuous_sneezing = YES:
| | | | | | | | | | | | | | | | | | | | | Disease = Allergy: 100.0% [94.19448%, 100.0%]
| | | | | | | | | | | | | | | | | bladder_discomfort = YES:
| | | | | | | | | | | | | | | | | | | Disease = Urinary tract infection: 100.0% [94.48318%, 100.0%]
| | | | | | | | | | | | | | | constipation = YES:
| | | | | | | | | | | | | | | | | Disease = Dimorphic hemmorhoids(piles): 100.0% [94.48318%, 100.0%]
| | | | | | | | | | | | | high_fever = YES:
| | | | | | | | | | | | | | | Entropy: 0.9274479232123118, gain: 0.558629373452199
| | | | | | | | | | | | | | | cough = NO:
| | | | | | | | | | | | | | | | | Entropy: 0.28639695711595625, gain: 0.28639695711595625
| | | | | | | | | | | | | | | | | blister = NO:
| | | | | | | | | | | | | | | | | | | Disease = AIDS: 100.0% [94.48318%, 100.0%]
| | | | | | | | | | | | | | | | | blister = YES:
| | | | | | | | | | | | | | | | | | | Disease = Impetigo: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | cough = YES:
| | | | | | | | | | | | | | | | | Entropy: 0.9182958340544896, gain: 0.9182958340544896
| | | | | | | | | | | | | | | | | chills = NO:
| | | | | | | | | | | | | | | | | | | Disease = Bronchial Asthma: 100.0% [64.32109%, 100.0%]
| | | | | | | | | | | | | | | | | chills = YES:
| | | | | | | | | | | | | | | | | | | Disease = Pneumonia: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | dizziness = YES:
| | | | | | | | | | | | | Entropy: 0.8404914014731815, gain: 0.5096374678020158
| | | | | | | | | | | | | neck_pain = NO:
| | | | | | | | | | | | | | | Entropy: 1.5219280948873621, gain: 0.9709505944546685
| | | | | | | | | | | | | | | chest_pain = NO:
| | | | | | | | | | | | | | | | | Entropy: 0.9182958340544896, gain: 0.9182958340544896
| | | | | | | | | | | | | | | | | lethargy = NO:
| | | | | | | | | | | | | | | | | | | Disease = Cervical spondylosis: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | lethargy = YES:
| | | | | | | | | | | | | | | | | | | Disease = Hypothyroidism: 100.0% [64.32109%, 100.0%]
| | | | | | | | | | | | | | | chest_pain = YES:
| | | | | | | | | | | | | | | | | Disease = Hypertension: 100.0% [64.32109%, 100.0%]
| | | | | | | | | | | | | neck_pain = YES:
| | | | | | | | | | | | | | | Disease = Cervical spondylosis: 100.0% [94.19448%, 100.0%]
| | | | | | | | | swelling_joints = YES:
| | | | | | | | | | | Entropy: 1.0, gain: 0.8492647594126546
| | | | | | | | | | | stiff_neck = NO:
| | | | | | | | | | | | | Entropy: 0.28639695711595625, gain: 0.28639695711595625
| | | | | | | | | | | | | muscle_weakness = NO:
| | | | | | | | | | | | | | | Disease = Osteoarthristis: 100.0% [94.48318%, 100.0%]
| | | | | | | | | | | | | muscle_weakness = YES:
| | | | | | | | | | | | | | | Disease = Arthritis: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | stiff_neck = YES:
| | | | | | | | | | | | | Disease = Arthritis: 100.0% [94.19448%, 100.0%]
| | | | | | | headache = YES:
| | | | | | | | | Entropy: 1.6359061660790049, gain: 0.8525666663983983
| | | | | | | | | loss_of_balance = NO:
| | | | | | | | | | | Entropy: 1.1386865525783176, gain: 0.575779260731362
| | | | | | | | | | | acidity = NO:
| | | | | | | | | | | | | Entropy: 2.2516291673878226, gain: 0.9182958340544893
| | | | | | | | | | | | | chills = NO:
| | | | | | | | | | | | | | | Entropy: 1.5, gain: 1.0
| | | | | | | | | | | | | | | weakness_of_one_body_side = NO:
| | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0
| | | | | | | | | | | | | | | | | chest_pain = NO:
| | | | | | | | | | | | | | | | | | | Disease = Migraine: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | chest_pain = YES:
| | | | | | | | | | | | | | | | | | | Disease = Hypertension: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | weakness_of_one_body_side = YES:
| | | | | | | | | | | | | | | | | Disease = Paralysis (brain hemorrhage): 100.0% [64.32109%, 100.0%]
| | | | | | | | | | | | | chills = YES:
| | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0
| | | | | | | | | | | | | | | continuous_sneezing = NO:
| | | | | | | | | | | | | | | | | Disease = Malaria: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | continuous_sneezing = YES:
| | | | | | | | | | | | | | | | | Disease = Common Cold: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | acidity = YES:
| | | | | | | | | | | | | Disease = Migraine: 100.0% [94.19448%, 100.0%]
| | | | | | | | | loss_of_balance = YES:
| | | | | | | | | | | Entropy: 0.3095434291503252, gain: 0.3095434291503252
| | | | | | | | | | | nausea = NO:
| | | | | | | | | | | | | Disease = Hypertension: 100.0% [93.87389999999999%, 100.0%]
| | | | | | | | | | | nausea = YES:
| | | | | | | | | | | | | Disease = (vertigo) Paroymsal Positional Vertigo: 100.0% [47.40685%, 100.0%]
| | | vomiting = YES:
| | | | | Entropy: 3.4990336640731607, gain: 0.8507115768962774
| | | | | nausea = NO:
| | | | | | | Entropy: 2.8781892225870314, gain: 0.8236948259200888
| | | | | | | abdominal_pain = NO:
| | | | | | | | | Entropy: 2.367635889995596, gain: 0.849308608237843
| | | | | | | | | chest_pain = NO:
| | | | | | | | | | | Entropy: 1.8180959929710643, gain: 0.8525666663983981
| | | | | | | | | | | diarrhoea = NO:
| | | | | | | | | | | | | Entropy: 1.457518749639422, gain: 0.6387068973726207
| | | | | | | | | | | | | altered_sensorium = NO:
| | | | | | | | | | | | | | | Entropy: 2.807354922057604, gain: 0.8631205685666311
| | | | | | | | | | | | | | | headache = NO:
| | | | | | | | | | | | | | | | | Entropy: 2.321928094887362, gain: 0.7219280948873621
| | | | | | | | | | | | | | | | | stomach_pain = NO:
| | | | | | | | | | | | | | | | | | | Entropy: 2.0, gain: 0.8112781244591329
| | | | | | | | | | | | | | | | | | | yellowish_skin = NO:
| | | | | | | | | | | | | | | | | | | | | Entropy: 1.584962500721156, gain: 0.9182958340544894
| | | | | | | | | | | | | | | | | | | | | loss_of_appetite = NO:
| | | | | | | | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0
| | | | | | | | | | | | | | | | | | | | | | | sunken_eyes = NO:
| | | | | | | | | | | | | | | | | | | | | | | | | Disease = Heart attack: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | | | sunken_eyes = YES:
| | | | | | | | | | | | | | | | | | | | | | | | | Disease = Gastroenteritis: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | | | | | loss_of_appetite = YES:
| | | | | | | | | | | | | | | | | | | | | | | Disease = Peptic ulcer diseae: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | | | yellowish_skin = YES:
| | | | | | | | | | | | | | | | | | | | | Disease = Alcoholic hepatitis: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | stomach_pain = YES:
| | | | | | | | | | | | | | | | | | | Disease = GERD: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | headache = YES:
| | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0
| | | | | | | | | | | | | | | | | loss_of_balance = NO:
| | | | | | | | | | | | | | | | | | | Disease = Paralysis (brain hemorrhage): 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | loss_of_balance = YES:
| | | | | | | | | | | | | | | | | | | Disease = (vertigo) Paroymsal Positional Vertigo: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | altered_sensorium = YES:
| | | | | | | | | | | | | | | Disease = Paralysis (brain hemorrhage): 100.0% [93.87389999999999%, 100.0%]
| | | | | | | | | | | diarrhoea = YES:
| | | | | | | | | | | | | Entropy: 0.3095434291503252, gain: 0.3095434291503252
| | | | | | | | | | | | | chills = NO:
| | | | | | | | | | | | | | | Disease = Gastroenteritis: 100.0% [93.87389999999999%, 100.0%]
| | | | | | | | | | | | | chills = YES:
| | | | | | | | | | | | | | | Disease = Malaria: 100.0% [47.40685%, 100.0%]
| | | | | | | | | chest_pain = YES:
| | | | | | | | | | | Entropy: 1.1586048283017796, gain: 0.8426433989885903
| | | | | | | | | | | cough = NO:
| | | | | | | | | | | | | Entropy: 0.3095434291503252, gain: 0.3095434291503252
| | | | | | | | | | | | | stomach_pain = NO:
| | | | | | | | | | | | | | | Disease = Heart attack: 100.0% [93.87389999999999%, 100.0%]
| | | | | | | | | | | | | stomach_pain = YES:
| | | | | | | | | | | | | | | Disease = GERD: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | cough = YES:
| | | | | | | | | | | | | Entropy: 0.3227569588973982, gain: 0.3227569588973982
| | | | | | | | | | | | | chills = NO:
| | | | | | | | | | | | | | | Disease = GERD: 100.0% [93.51585%, 100.0%]
| | | | | | | | | | | | | chills = YES:
| | | | | | | | | | | | | | | Disease = Tuberculosis: 100.0% [47.40685%, 100.0%]
| | | | | | | abdominal_pain = YES:
| | | | | | | | | Entropy: 1.4362406790693445, gain: 0.8566594912242682
| | | | | | | | | yellowish_skin = NO:
| | | | | | | | | | | Entropy: 0.2974722489192896, gain: 0.2974722489192896
| | | | | | | | | | | swelling_of_stomach = NO:
| | | | | | | | | | | | | Disease = Peptic ulcer diseae: 100.0% [94.19448%, 100.0%]
| | | | | | | | | | | swelling_of_stomach = YES:
| | | | | | | | | | | | | Disease = Alcoholic hepatitis: 100.0% [47.40685%, 100.0%]
| | | | | | | | | yellowish_skin = YES:
| | | | | | | | | | | Entropy: 0.847584679824574, gain: 0.46899559358928133
| | | | | | | | | | | itching = YES:
| | | | | | | | | | | | | Entropy: 1.0, gain: 1.0
| | | | | | | | | | | | | loss_of_appetite = NO:
| | | | | | | | | | | | | | | Disease = Jaundice: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | loss_of_appetite = YES:
| | | | | | | | | | | | | | | Disease = Chronic cholestasis: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | itching = NO:
| | | | | | | | | | | | | Entropy: 0.3095434291503252, gain: 0.3095434291503252
| | | | | | | | | | | | | loss_of_appetite = NO:
| | | | | | | | | | | | | | | Disease = Alcoholic hepatitis: 100.0% [93.87389999999999%, 100.0%]
| | | | | | | | | | | | | loss_of_appetite = YES:
| | | | | | | | | | | | | | | Disease = hepatitis A: 100.0% [47.40685%, 100.0%]
| | | | | nausea = YES:
| | | | | | | Entropy: 2.297472248919289, gain: 0.9995003941817583
| | | | | | | muscle_pain = NO:
| | | | | | | | | Entropy: 1.4362406790693445, gain: 0.8566594912242681
| | | | | | | | | yellowish_skin = NO:
| | | | | | | | | | | Entropy: 0.5689955935892812, gain: 0.33125121848110783
| | | | | | | | | | | loss_of_balance = NO:
| | | | | | | | | | | | | Entropy: 1.584962500721156, gain: 0.9182958340544894
| | | | | | | | | | | | | itching = YES:
| | | | | | | | | | | | | | | Disease = Chronic cholestasis: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | itching = NO:
| | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0
| | | | | | | | | | | | | | | blurred_and_distorted_vision = NO:
| | | | | | | | | | | | | | | | | Disease = (vertigo) Paroymsal Positional Vertigo: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | blurred_and_distorted_vision = YES:
| | | | | | | | | | | | | | | | | Disease = Hypoglycemia: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | loss_of_balance = YES:
| | | | | | | | | | | | | Disease = (vertigo) Paroymsal Positional Vertigo: 100.0% [93.87389999999999%, 100.0%]
| | | | | | | | | yellowish_skin = YES:
| | | | | | | | | | | Entropy: 0.5907239186406502, gain: 0.4854607607459134
| | | | | | | | | | | dark_urine = NO:
| | | | | | | | | | | | | Disease = Chronic cholestasis: 100.0% [93.87389999999999%, 100.0%]
| | | | | | | | | | | dark_urine = YES:
| | | | | | | | | | | | | Entropy: 1.0, gain: 1.0
| | | | | | | | | | | | | high_fever = NO:
| | | | | | | | | | | | | | | Disease = Hepatitis D: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | high_fever = YES:
| | | | | | | | | | | | | | | Disease = Hepatitis E: 100.0% [47.40685%, 100.0%]
| | | | | | | muscle_pain = YES:
| | | | | | | | | Entropy: 1.1522290399012944, gain: 0.9994730201859836
| | | | | | | | | yellowing_of_eyes = NO:
| | | | | | | | | | | Entropy: 0.2974722489192896, gain: 0.2974722489192896
| | | | | | | | | | | skin_rash = YES:
| | | | | | | | | | | | | Disease = Dengue: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | skin_rash = NO:
| | | | | | | | | | | | | Disease = Malaria: 100.0% [94.19448%, 100.0%]
| | | | | | | | | yellowing_of_eyes = YES:
| | | | | | | | | | | Disease = hepatitis A: 100.0% [94.19448%, 100.0%]
| fatigue = YES:
| | | Entropy: 4.087113833003859, gain: 0.9011019245114962
| | | loss_of_appetite = NO:
| | | | | Entropy: 3.4394507096117723, gain: 0.8817069873806092
| | | | | high_fever = NO:
| | | | | | | Entropy: 2.7185962248540316, gain: 0.9914266810680207
| | | | | | | irritability = NO:
| | | | | | | | | Entropy: 1.9047143071995363, gain: 0.9824740868386415
| | | | | | | | | increased_appetite = NO:
| | | | | | | | | | | Entropy: 1.596184996778472, gain: 0.6731080737015489
| | | | | | | | | | | obesity = NO:
| | | | | | | | | | | | | Entropy: 3.0, gain: 1.0
| | | | | | | | | | | | | yellowish_skin = NO:
| | | | | | | | | | | | | | | Entropy: 2.0, gain: 1.0
| | | | | | | | | | | | | | | chills = NO:
| | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0
| | | | | | | | | | | | | | | | | cough = NO:
| | | | | | | | | | | | | | | | | | | Disease = Varicose veins: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | cough = YES:
| | | | | | | | | | | | | | | | | | | Disease = Bronchial Asthma: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | chills = YES:
| | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0
| | | | | | | | | | | | | | | | | continuous_sneezing = NO:
| | | | | | | | | | | | | | | | | | | Disease = Pneumonia: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | continuous_sneezing = YES:
| | | | | | | | | | | | | | | | | | | Disease = Common Cold: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | yellowish_skin = YES:
| | | | | | | | | | | | | | | Entropy: 2.0, gain: 1.0
| | | | | | | | | | | | | | | itching = YES:
| | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0
| | | | | | | | | | | | | | | | | vomiting = NO:
| | | | | | | | | | | | | | | | | | | Disease = Hepatitis B: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | vomiting = YES:
| | | | | | | | | | | | | | | | | | | Disease = Jaundice: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | itching = NO:
| | | | | | | | | | | | | | | | | Entropy: 1.0, gain: 1.0
| | | | | | | | | | | | | | | | | vomiting = NO:
| | | | | | | | | | | | | | | | | | | Disease = Hepatitis C: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | | | | | vomiting = YES:
| | | | | | | | | | | | | | | | | | | Disease = Hepatitis D: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | obesity = YES:
| | | | | | | | | | | | | Disease = Varicose veins: 100.0% [94.19448%, 100.0%]
| | | | | | | | | increased_appetite = YES:
| | | | | | | | | | | Disease = Diabetes: 100.0% [94.48318%, 100.0%]
| | | | | | | irritability = YES:
| | | | | | | | | Entropy: 1.5844996446144277, gain: 0.9241335419915457
| | | | | | | | | abnormal_menstruation = NO:
| | | | | | | | | | | Disease = Hypoglycemia: 100.0% [94.48318%, 100.0%]
| | | | | | | | | abnormal_menstruation = YES:
| | | | | | | | | | | Entropy: 0.9994730201859836, gain: 0.9994730201859836
| | | | | | | | | | | depression = NO:
| | | | | | | | | | | | | Disease = Hyperthyroidism: 100.0% [94.48318%, 100.0%]
| | | | | | | | | | | depression = YES:
| | | | | | | | | | | | | Disease = Hypothyroidism: 100.0% [94.19448%, 100.0%]
| | | | | high_fever = YES:
| | | | | | | Entropy: 2.381155648699536, gain: 0.9656361333706103
| | | | | | | chest_pain = NO:
| | | | | | | | | Entropy: 1.6826392037546638, gain: 0.9402859586706308
| | | | | | | | | chills = NO:
| | | | | | | | | | | Entropy: 1.1547717145751624, gain: 0.8452282854248372
| | | | | | | | | | | itching = YES:
| | | | | | | | | | | | | Entropy: 0.3095434291503252, gain: 0.3095434291503252
| | | | | | | | | | | | | skin_rash = YES:
| | | | | | | | | | | | | | | Disease = Chicken pox: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | | | skin_rash = NO:
| | | | | | | | | | | | | | | Disease = Jaundice: 100.0% [93.87389999999999%, 100.0%]
| | | | | | | | | | | itching = NO:
| | | | | | | | | | | | | Entropy: 0.3095434291503252, gain: 0.3095434291503252
| | | | | | | | | | | | | vomiting = NO:
| | | | | | | | | | | | | | | Disease = Bronchial Asthma: 100.0% [93.87389999999999%, 100.0%]
| | | | | | | | | | | | | vomiting = YES:
| | | | | | | | | | | | | | | Disease = Jaundice: 100.0% [47.40685%, 100.0%]
| | | | | | | | | chills = YES:
| | | | | | | | | | | Disease = Typhoid: 100.0% [94.74452%, 100.0%]
| | | | | | | chest_pain = YES:
| | | | | | | | | Entropy: 1.0, gain: 1.0
| | | | | | | | | muscle_pain = NO:
| | | | | | | | | | | Disease = Pneumonia: 100.0% [94.19448%, 100.0%]
| | | | | | | | | muscle_pain = YES:
| | | | | | | | | | | Disease = Common Cold: 100.0% [94.19448%, 100.0%]
| | | loss_of_appetite = YES:
| | | | | Entropy: 2.806836027747821, gain: 0.943622285167955
| | | | | malaise = NO:
| | | | | | | Entropy: 1.6854277290691868, gain: 0.9241335419915458
| | | | | | | coma = NO:
| | | | | | | | | Entropy: 1.1522290399012944, gain: 0.8488843249236633
| | | | | | | | | vomiting = NO:
| | | | | | | | | | | Entropy: 0.2974722489192896, gain: 0.2974722489192896
| | | | | | | | | | | abdominal_pain = NO:
| | | | | | | | | | | | | Disease = Hepatitis C: 100.0% [94.19448%, 100.0%]
| | | | | | | | | | | abdominal_pain = YES:
| | | | | | | | | | | | | Disease = Hepatitis D: 100.0% [47.40685%, 100.0%]
| | | | | | | | | vomiting = YES:
| | | | | | | | | | | Entropy: 0.3095434291503252, gain: 0.3095434291503252
| | | | | | | | | | | skin_rash = YES:
| | | | | | | | | | | | | Disease = Dengue: 100.0% [47.40685%, 100.0%]
| | | | | | | | | | | skin_rash = NO:
| | | | | | | | | | | | | Disease = Hepatitis D: 100.0% [93.87389999999999%, 100.0%]
| | | | | | | coma = YES:
| | | | | | | | | Disease = Hepatitis E: 100.0% [94.48318%, 100.0%]
| | | | | malaise = YES:
| | | | | | | Entropy: 1.9995975337661407, gain: 0.9998646331239298
| | | | | | | yellowing_of_eyes = NO:
| | | | | | | | | Entropy: 1.0, gain: 1.0
| | | | | | | | | nausea = NO:
| | | | | | | | | | | Disease = Chicken pox: 100.0% [94.19448%, 100.0%]
| | | | | | | | | nausea = YES:
| | | | | | | | | | | Disease = Dengue: 100.0% [94.19448%, 100.0%]
| | | | | | | yellowing_of_eyes = YES:
| | | | | | | | | Entropy: 0.9994730201859836, gain: 0.9994730201859836
| | | | | | | | | chest_pain = NO:
| | | | | | | | | | | Disease = Hepatitis B: 100.0% [94.19448%, 100.0%]
| | | | | | | | | chest_pain = YES:
| | | | | | | | | | | Disease = Tuberculosis: 100.0% [94.48318%, 100.0%]
Short demonstration in Python
The code used to generate the trees on this page is available in a Jupyter
Notebook as a GitHub GIST
as well as in PDF format.
This article is tagged: Programming, Statistics, Math, Data Mining