Hey Alexa, What’s My Diagnosis?

Pavi Dhiman
16 min readOct 15, 2022

“Hey Alexa, play Adore You by Harry Styles”

“Playing Adore You by Harry Styles”

This is a common interaction between Amazon Alexa and its user, potentially you. Decades ago, we didn’t think this communication between humans and algorithms as possible. But decades later, having a classic conversation with Alexa or Siri about the weather or demanding them to play your favourite song is the new norm.

As we know, the opportunities with this new form of communication are endless, from asking about your day to having a real conversation about an idea in your head; let’s think bigger:

“Hey Alexa, play Adore You by Harry Styles”

“Playing Adore You by Harry Styles in 30 seconds, I want to consult you on something. Considering your recent doctor visits, I would like to take a test to see if you’re at risk for coronary artery disease. Can you repeat the following phrases?”

*Repeats phrases*

“Based on your breathing patterns and acoustic gaps, there’s an 89% chance you have plaque buildup in your arteries; I recommend meeting with your medical practitioner in the coming week. Now playing Adore You’ by Harry Styles.”

So, Alexa can predict your likelihood of getting a disease and when you should see an authorized medical practitioner solely by recording your voice in the comfort of your home.

Welcome to the world where this is normal, where Artificial Intelligence (AI) meets preventative medicine.

The Current Problem

Our current healthcare system is broken, and we need preventative and remote care innovations. Three main factors lead to this diminished healthcare system: a lack of medical practitioners, a lack of resources and a lack of knowledge.

This lack of knowledge impacts the few medical practitioners available, meaning they do not have enough knowledge to treat these deadly diseases. However, the lack of resources impacts the small number of practitioners who know, as some very well-informed medical practitioners are limited because they lack proper medication, equipment and a clean environment.

A breakdown of the three main contributing factors to a broken healthcare system.

Current members of these communities lack the education to stop preventable illnesses and lack basic sanitation, accounting for 2.5 billion people, contributing to the overall healthcare access problem. Ethiopia, Malawi, Somalia, Liberia, Tanzania and the Congo are the top countries that most lack healthcare. Plus, the increasing population in these countries also means an increasing demand for healthcare.

Even within developed countries like the U.S, in a survey of 4000 adults in 2022, 73% want greater access to care everywhere.

Understanding that this barrier to accessing care is a multi-fold problem is important. It is not all about the lack of resources but also about the cultural mindset of residents in these countries. For example, most people with blindness rarely receive medical attention, even when available. More specifically, ⅔ of the adults in rural India with low vision or glaucoma have never received eye care, even when the care is available.

This multifold problem has layers which constantly unfold; in this case, we can unfold the problems of lack of awareness about treatments, cost issues, cultural beliefs and remedies and even simple distance from the facility. However, cultural beliefs about medicine play a large role as it might influence a patient’s willingness to seek or accept medical care. On the other side, we also have the logistical issues of lack of resources, specifically transportation. The distance from clinics and hospitals also causes a large physical barrier to obtaining care, with a lack of transportation services in rural areas.

It’s all about the accessibility of diagnosing, accessibility for transportation and for receiving physical medication and treatment.

Breakthroughs in Healthcare Accessibility

Telemedicine is a huge buzzword in healthcare, and it’s what Alexa has the potential to do, diagnose remotely. It integrates technology to provide all clinical services to remote locations. Essentially delivering healthcare outside of the traditional healthcare facilities. It sounds great but is yet to be implemented. In fact, despite over two decades of telemedicine adapting, countries have not achieved significant success in improving access to care.

And, as we know — people in poor countries have less access to healthcare, and the poor in these poor countries have even less access — telemedicine is known to be “finally becoming a viable option for developing countries.”

Many believe telemedicine is the solution; we might just be implementing it incorrectly.

It’s proven to work because when COVID-19 hit, telemedicine provided life-saving facilities, essentially helping stop the collapse of our healthcare system.

Plus it evolved patient care and experience specifically through accessibility. For example, consider the cycle below. A coronary artery disease patient returns home and has to maintain certain lifestyle changes and medications to progress their health. To do so, they must collect data such as their weight, blood pressure, etc. This gives the physician insight into adjusting the care without the patient entering the facility.

It’s all about making remote healthcare scalable.

Problems with Telemedicine.

However, the pace of development and new breakthroughs are slow, and the acceptance of the technology is not picking up even though telemedicine is expected to reduce the burden of hospitals, suffering patients, the need for transportation and save general time and money for the public.

One of the main barriers is policy, mainly privacy and little consensus on how the technology will work.

There’s also a lack of organizational structure, as there’s a lack of communication and structure between all stakeholders at each level of the healthcare delivery system, decreasing the chances of implementation.

Low internet connectivity is another ambiguous factor as many of these telemedicine services may require high speed and reliable internet bandwidth to run smoothly; however, with unreliable and low bandwidth internet, the connectivity is a constant underlying issue. Rural areas do not have the financial capital to invest independently in a broadband network that would provide high-speed internet.

The large upfront cost of ICT (information communication technologies) and its infrastructure can be a major barrier to implementation.

But, with new advancements in the field, telemedicine will surge, just as it did during COVID.

A New Healthcare Advancement: Voice-Based Biomarkers.

Diseases are unpredictable; they can affect all sorts of organs from different methods and “forms of attack,” each disease might target one primary organ, but it impacts other traditional body functions. The organs might be the heart, lungs, brain, muscles or even the vocal folds, altering someone’s voice.

We unlock incredible new opportunities by analyzing these voice changes with AI’s help. From diagnosing to risk prediction to complete remote monitoring, it is incredible.

Voice-based biomarkers (VBBs) work is based on the human voice, our rich medium with complex arrays of sounds coming from our vocal cords. Surprisingly enough, our vocal cords contain crucial information for diagnosing diseases.

Considering the shift already, all phones or home devices contain some form of a vocal assistant and have allowed considerable use of voice-controlled search; in fact, 31% of smartphone users already use voice technologies once a week. Plus, the evolution of this technology combined with audio-signal analysis, Natural Language Processing (NLP) and understanding allow for this potential new application for diagnosis, classification and remote monitoring through VBBs, which will increase the number of people using their vocal assistants.

With the help of AI, this is achievable. This is mainly because the human voice produces multiple frequencies. However, the human ear can only hear and understand a narrow sequence. But through AI, we can register frequencies across the entire spectrum.

All of these diagnoses are based on biomarkers. In the traditional medical field, biomarkers are any factor measured which evaluates biological processes or responses to therapeutics and can find hints for problems occurring between your body and another factor. Vocal biomarkers refer to any signature, feature or combination of features from voice signals associated with clinical outcomes to measure a condition and generate the severity of a disease.

My Hypothesis.

After understanding the positives, negatives, barriers and implications of telemedicine combination with voice-based biomarkers, I hypothesize that:

If we can close the disparities between a doctor and a voice-based assistant, then we can create the world’s most accessible doctor.

However, diagnosing particular diseases with close to perfect accuracy might take 2–4 years to master. At the moment, the industry lacks data to improve any model. But it’s not solely on classic voice recordings, instead considering longitudinal data is where the gap between voice doctors and human doctors increases. Being able to refer to a patient’s history, age and experiences is a skill all doctors must do. However, VBBs simply analyze the data without knowing if the patient is 13 or 65 years old; it’s all just simple stringed data for the algorithm.

On an implementation-based level, scaling the VBBs to developed countries with a demand for greater healthcare accessibility has a high prospect for wide-scale implementation. However, in developing countries, the issue might not be a lack of accessibility to diagnosing and instead might be cultural disparities or lack of information. Plus, considering the lack of adoption of telemedicine thus far implies that simply having an online platform for tracking a patient’s health does not provide much benefit to the consumers, but instead, creating a platform with longitudinal data and replicating the methods of human doctors with those of the voice-based assistants, is what will propel the implementation forward exponentially.

The Breakdown: How Your Vocal Cords Work.

But before we can predict the future of voice-based biomarkers, let’s understand each aspect, starting with the indicator, your vocal cords.

Consider your vocal cords to be a ballet dancer.

Okay, wait, let’s backtrack a bit.

We all know dancing is one of the human activities requiring immense training, skill and expertise. After hours of practice and numerous recitals, the top ballet dancers must constantly perfect their technique.

With its upsides, dancing can have its downsides; dancers frequently get injured.

From broken bones to neurological disorders, these can stop the dancers’ careers in a split second. Neurological disorders like Parkinson’s especially can also destroy this extraordinary talent. With millions of people worldwide suffering from this incurable weakness, tremors and terrible disease, we haven’t progressed in the right direction.

The issue we face with Parkinsons’ is the lack of early-stage biomarkers detectable by humans. The closest thing we have to this is a 20-minute neurologist test which is $300. But what if we could do this at home?

Now, it’s the same concept with your voice. Again, consider your vocal cords to be a ballet dancer. A “vocal cord dancer” must coordinate all of their vocal cords to make sounds.

But we take lots of training, and from sound, we can track the vocal fold position as it vibrates, specifically for Parkinson’s. Because just as the limbs are affected in Parkinson’s, so are the vocal cords. And this is all because speaking is a byproduct of a very complex system. We use our lungs, vocal cords, tongue, lips, nasal passages, and brain every time we speak. And so, any disease, injury or medical event with these systems leaves diagnostic clues and biomarkers.

In this case, the bottom line is an example of irregular vocal fold tremors, measuring when the voice becomes quieter and has more breath.

However, combining any digital microphone, precision voice analysis software and the latest machine learning advancements can quantify exactly where someone lies between health and disease.

Current Scope of Diseases Being Diagnosed.

From neurodegenerative diseases to cardiovascular, various diseases have already begun their journey to being diagnosed through acoustic and linguistic changes in tone and intonation.

Parkinson’s can be detected as voice changes are expected to be used as early diagnostic biomarkers or markers for disease progression. These biomarkers are primarily related to phonation (producing any type of sound) and articulation (overall, speaking clearly). Specifically, pitch variations decreased energy in the harmonic spectrum and imprecise articulation of vowels and constants. These changes are in up to 78% of patients with early-stage Parkinson’s.

Alzheimer’s and mild cognitive impairments cause subtle voice and language changes affecting verbal fluency. Typically, patients hesitate to speak, have a slow speech rate, have trouble finding words and use filter sounds (“uh,” “um”), etc.

Any cardiometabolic disease and cardiovascular diseases have several of their own voice features with coronary artery disease.

Every disease has its own vocal biomarkers, and focusing on those parameters allows us to diagnose practically anything.

The assessment can be anywhere from highly structured to very unstructured, depending on the elements of speech being analyzed, and we can split this into three categories.

  1. Basic vocal tone: this can be determined through exercises such as sustained phonation (saying “ahhh”) and a structured speech task.
  2. Fluency of phonetics: the speed of recall or fluency is typically measured through semi-structured tasks, such as repeating set phrases.
  3. Multifactorial assessment: this is commonly gaged through unstructured, spontaneous speech or free-form conversation.

After determining the category for each disease, its simple to identify the type of structure it falls into and choose the test accordingly.

Analyzing the Correct Parameters.

Once the structure is determined, selecting the corresponding parameters is vital. There are three main categories for analyzing diseases.

  1. Verbal: isolated words, short sentence repetition, reading a passage and speaking running speech.
  2. Vowel/syllable: This detects sustained vowel phonation and any diadochokinetic task (the ability to perform repetition of syllables at a maximum production rate).
  3. Nonverbal vocalizations: include coughing and breathing, some of the most common modes of detection during the pandemic.

After determining the structure and parameters, selecting the correct mode of data collection is vital. Some, like smartphones, might be the most accessible, but they are also the most complex to work with, considering the background noise associated with this method.

Specifically, there are four main categories of data collection techniques, some more accessible than others but creating more factors to account for.

  • Studio-based recording: this is where the speech recording is in a controlled environment leading to no unwanted acoustics and a clear recording.
  • Telephone-based recording: The data collection comes from various speakers and handsets. However, they have multiple disadvantages, including handset noise, lack of control of the speaker’s environment and bandwidth limitations (mainly occurring in developing countries).
  • Web-based recording is popular for large-scale data collection but relies on internet access.
  • Smartphone-based recording: a high broadband quality is necessary for these types of recording, which is, again, a disparity in implementation in developing countries with unreliable broadband connections.
This is the overall pipeline of identifying the biomarker.

The Process.

After choosing the method of recording, we move on to the audio pre-processing.

This step occurs before any analysis of the data. The preprocessing step includes resampling, normalization, noise reduction, framing and widowing the data.

But let’s break this down.

We begin with normalization. This improves the performance of the overall feature detection by reducing the amount of varying information without distorting differences in the values. For example, if we were to be analyzing an image, this step would include examining every pixel to see a feature present at that pixel. So, it’s the same concept but with audio. We’re examining every signal to identify the feature presentation.

Then we have noise detection and reduction. Through non-machine learning approaches, we obtain a clear audio signal by passing noisy audio through a linear filter. However, with recent advancements, we can define mapping functions between clean and noisy voice signals through neural networks. In short, the process removes noise from a signal, but it might distort the signal to some degree.

Moving onto framing and windowing the data. This is where the voice signal is divided into different samples. These are then multiplied by the window function to reduce the signal leakage effects — essentially any signals which can cause unwanted background noise.

And finally, we have the feature extraction using MFCCs, a method to extract the signal features. This Mel-frequency cepstral makes up the MFC (Mel-frequency cepstrum) which represents a short-term power spectrum of a sound, basically the amount of vibration at each individual frequency.

Now through this audio pre-processing, there are two ways we can approach the data: linguistically or acoustically.

For example, consider the tongue-twister phrase:

Luxembourg is a resolutely multilingual environment.

The linguistic approach is the dependency and constituency parses (breaking down the sentences into sub-phrases) and then sense tagging, indicating the appropriate sense and meaning from each word to the next.

We also have the acoustic features, which are extracted using MFCCs. This uses some of the steps previously mentioned but with some extra steps.

  • Framing: segmentation of the signal into n samples.
  • Minimizing discontinuous signals: this reduces the noise in the next FFT (Fast Fourier Transform) step. The FTT step is crucial as each signal is converted into its individual spectral component, essentially providing information about the frequency of each signal.
  • Principal Component Analysis (PCA): this step shows the dimension reduction, increasing interpretability and minimizing the information loss as the dimensionality of datasets decreases.

Audio Feature Extraction.

Audio feature extraction is the transition step between pre-processing and analysis because we must convert the audio signal into distinct features before we analyze the data. These are the most dominating characteristics of the signal which will contribute to the algorithms.

The following modes can break down the methods for identifying the acoustic features:

  • Prosodic: this is the pitch, the energy in her voice and the jitteriness in someone’s voice.
  • Voice quality: the zero crossing rate and the harmonic-to-noise ratio.
  • Phonation: the fundamental frequency and the pitch period entropy.

As per more segmentation features, the MFCCs are the most frequently used in speech analysis.

Feature Selection and Dimensionality Reduction.

Feature selection (like mRMR — minimum redundancy maximum relevance) allows part of the original feature set to be selected without changing them. It then removes the features with any missing values or low variance and finally allows for the most relevant sets of features to be considered for the prediction task.

Dimensionality reduction is a machine learning technique for reducing the number of random variables in a problem by obtaining a set of principal variables which are important for the task at hand. This can include data visualization or even random forests.

Training the Algorithms.

Neural networks, convolutional neural networks and machine learning models can all be used to train and classify the signals alone and then combine the classification with other health-related data. Like many, the algorithm is trained on one dataset and tested separately, but the main problem is the lack of data.

We can further build the algorithm out through supervised learning, but another consideration is to use transfer learning. Overall, this is where the knowledge gained from completing one task can help solve a different but related problem. The benefits from transfer learning assist in pre-processing the model.

Testing the Algorithms.

As mentioned, obtaining large-scale datasets is not feasible, so we must have reliable estimates of the performance based on the limited data available. This can, again, be done through various methods.

The cross-validation method is where the data is randomly partitioned into equally sized subsets (called folds), one for testing and one for training. This way, the performance is averaged across all of the folds.

Another potential method is using bootstrap validation. This is where the data instances are sampled with replacements from the original dataset, producing different datasets of the same size, which contain repeated and missing instances from the original dataset. If any unsampled data instances are used for testing, the method is bootstrap validation.

The Future of Voice Health.

VBBs have been marked “green,” meaning the technology has lots to offer and will grow significantly. Plus, the entire basis of the technology is that it can detect diseases earlier than the average check-up.

However, it’s not intended to provide a definitive diagnosis.

Voice-based technology is like a thermometer — the thermometer doesn’t diagnose…but it does give you a cue about how serious a temperature deviation from normal is and, in this way, educates and informs you about what the appropriate next step would be. That’s the right way to think about the kind of information voice-based biomarkers provide.

Diagnosis through a voice-based biomarker is not the be-all, end-all diagnosis, but instead can help healthcare systems and providers select the potential cases by identifying the people with a higher probability of getting x disease.

This also allows for a streamlined system of allowing physicians to remotely assess the health of their patients, then provide immediate results and finally guide clinicians in making a fast and educated decision, reducing the percentage of medical errors and making cheaper and reliable healthcare more scalable.

Pros of a voice-based test.

VBBs allow for self-administered tests at home, but there’s much more to them.

  • These tests take roughly 30 seconds to complete.
  • Voice-based tests are ultra lost cost, meaning they can be massively scalable.
  • Again, they can complete high-frequency monitoring right from home, requiring no more expensive and routine checkups.
  • This can also allow for a mass collection of data for clinical trials and even make population-scale screening for the first time.

Challenges to Creating, Implementing and Using VBBs.

Although VBBs propose exceptional positives to implementation, we tend to face several problems when looking at the creation and scalability of this technology. Specifically:

The quality of the speech data requires the absence of background noise, which increases as we use less expensive materials. As smartphone usage is the basis for increasing accessibility, we increase the chances of background noise, reaching a slight conflicting point.

Extracting audio features is another task at hand. Unwinding the structured speech is much less complex but limits the accuracy of the diagnosis. However, analyzing semi-structured and unstructured speech is quite complex but allows for increased accuracy.

However, one of the largest issues we come across is that there is no comparison to the baseline. As quick as the screening is, the tools don’t look at individuals’ changes over time, causing no baseline to refer back to. However, building longitudinal-level databases can be even more demanding than collecting data for population datasets. Trying to build a patient history database and relating the diagnosed disease poses another layer of questions. But yet again, we might have no baseline for the algorithm to furnish the algorithm. And primarily, where this baseline doesn’t exist, we might misinterpret individuals’ characteristics affecting the voice quality for poor sleep quality, mood swings, etc., resulting in false positives.

Case Studies

Although we must consider all of the pros and cons of implementation and creation, companies are currently working on building VBBs out.

Pfizer and IBM, specifically, used sensors and mobile devices to measure Parkinsons’ speech patterns. Their goal was to use the naturally occurring speech of a subject, analyze it and then monitor the status of the subject.

The Mayo Clinic and Beyond Verbal partnered up and mentioned how the voice might hold clues to detecting coronary artery disease. The organizations conducted a double-blind study where patients recorded themselves in 30-second intervals reading a neutral text and then recounting a positive and negative experience. Researchers found a difference in these voice patterns, indicating a 19-fold likelihood of having coronary artery disease and concluded a stronger correlation between voice and the disease is picked up from the negative experience recordings.

The next time you ask Siri to check the weather or ask Alexa to play a song, consider how, in the coming years, through greater advancements in AI and NLP, Alexa might be more than just a companion and assistant, but might even become your uncertified doctor.

--

--

Pavi Dhiman

A 16 y/o inquirer constantly working on new projects, content and personal growth. https://pavis-website-a1e976.webflow.io