Practical Machine Learning in Healthcare

Praveen Prakash - Co-Founder & CTO, mTatva
Praveen Prakash
Co-Founder & CTO, mTatva

mTatva is a healthcare solutions company that combines latest communication technologies and medical industry expertise to offer a hassle-free platform for the patients to connect with the doctors & other healthcare professionals.

The Landscape
Machine Learning, or ML, is a subset of a much broader idea of Artificial Intelligence. ML as a subject has a variety of mature and established algorithms and frameworks. Cloud-computing makes it easy to try, perfect and then scale the deployment of AI-based systems. The actual problem here is that of understanding the data that you have and the inferences that can be drawn from it. In the absence of understanding of the data, it is simply Garbage In and Garbage Out. In other words, sure AI/ML works, but you have to know how to put it to use!

One categorization of the algorithms is whether they are statistics-based or neural-network based. Each has its strengths and shortcomings and generally the nature of data can help decide which one to choose. For example, training neural nets requires a huge amount of training data. What if the hospital operations are still growing and the training sets are not that rich? Maybe there are statistical algorithms that can give insights to begin with and then slowly graduate to more sophisticated neural-nets? Neural-network models are often not intuitive to debug or visualize – so it helps if statistical models and results can help validate initial results from them.

The most common task in AI is that of pattern recognition – patterns in data, image, sound, video and other aspects. In healthcare context all types of these are relevant. In this article we focus on patterns in data.

Sources of Data
One of the obvious interesting cases is finding patterns across clinical and demographic data. There can be multiple sources for data but with consistent tagging (or labeling) it is possible to create datasets(feature vectors) that become the input to the analytics and ML algorithms. Example of data-sources– OP prescriptions, discharge-summaries, request-forms for admission /investigation, billing data for various services, patient-provided feedback, Some of these are sources of clinical data.
Cleaning up Data
Data can exist as structured information (in databases, for example) or as unstructured text. For example, an Electronic Medical Record will have many pre-set fields or labeled-attributes for recording contents of a prescription. Demographic details of the patient may be structured as name, contact no., address, or health-history. Unstructured data may exist as part of structured-data – for example address without specific fields for postal-code, city-name, area etc. A discharge-summary may capture clinical-history as an unstructured paragraph. Parsing addresses can be done using fuzzy-matching and keyword lookups – quite rudimentary - but parsing clinical-history needs NLP techniques to determine intent, negation, tense etc. to correctly extract the relevant pieces of data.

" The goal of finding the relevant minimal set of features is important to limit the time and computing-resources required when deployed"

Working with Tagged Data
With that data neatly tagged, Association Mining within this data is a useful first-step to understanding the nature of data, general trends and the associativity among items and concepts. For example what kind of drugs are prescribed together, by which type of doctor, to which type of demographic? Two important ways in which this information can be used is in Anomaly Detection and Predictive Input – contributing towards quality-control and efficiency in general.

The other major activity to do with this data is to find patterns that might predict interesting outcomes. For example, are there any patterns leading to hospitalization of a patient? Are there any patterns in medicines(or investigations, or procedures) being getting leaked out – meaning the patient consults here but gets the medicines (or investigations, or procedures) done from elsewhere.

The Mechanics of Machine Learning
Once the data is collated and labeled one can setup a framework to run the data thru algorithms. Visualization of data often gives intuitive insight and even helps debug the setup used for algorithms or framework. The input criteria from the data are called features. Initial exploration is about finding ‘relevant’ features. A relevant set of features is one that has high probability of true-positives and a low probability for false-positives. Output value may be more sensitive to certain features than others.

The goal of finding the relevant minimal set of features is important to limit the time and computing-resources required when deployed. The initial exploration and setup may take good amount of effort and time before the framework can be setup for deployment – but it is critical to tune the algorithm and shape of your data for effectiveness. This ground work also helps plan the computing resources (and therefore the costs) needed in the medium term for actual deployment.

Special Case: Temporal Features
If using RNN/LSTM - they have their own rules for temporal data – and so no ambiguity there. But if using generic neural-nets, temporal features can be sometimes tricky to input. Say you have dates of visit for a patient. There are quite a few to encode this information into features. One could take average gap between visits and the no. of visits. One could encode it as a bit-vector – say 52 bits for an year, one per week – and set the bit in which the visit occurred.

A lot of experimentation goes in learning about the nature of data and leveraging the right algorithms. AI is no magic! Once setup, the return on investment of time and effort is huge, making it possible to deliver new and unique insights.