Creating game flow using speech recognition

By: Wiandi Vreeswijk
A potential user testing the final prototype.


My name is Wiandi Vreeswijk and I am a Game Development student at the University of applied sciences in Amsterdam. The use of speech recognition has always been a mystery to me. It is used widely on phones, smartwatches and even on television remotes these days. Speech recognition is mostly used to execute simple commands that require no physical actions. After I evaluated the average purpose of speech recognition, I came up with an idea to use speech recognition in a game to execute simple commands. 

The game would be for people who are not able to use their keyboard and mouse. An example of where this could happen would be in a hospital. 

In this R&D report I will research the use of speech recognition in a game and will discuss the following subject:

Using speech recognition input to deliver the same game flow as physical input.

I will try to answer this question by first explaining how speech recognition works in the first chapter. After we know how speech recognition functions, in the second chapter we will take a look at the various speech recognition software that was considered for use during the research. When a considered speech recognition software was chosen I started integrating it into Unity which will be described in chapter three. After speech recognition was successfully integrated I had to think about a way to utilize the software in a game using game design.

The thought process behind this game design will be discussed in chapter four. In chapter five I will showcase the results of my research in the form play testing and showing valuable data used in the speech recognition model. The sixth chapter will contain my conclusion where I explain my findings on using speech recognition input to deliver the same game flow as physical input. Finally, I will discus future plans for the game I made in the seventh and final chapter.

Table of contents

  1. The working of speech recognition
    1. Convert speech into digital output
    2. Classifying into recognizable sounds
    3. Determining the input
  2. Considering speech recognition software
    1. Unity Engine namespace
    2. Microsoft Speech SDK
    3. Google Cloud Speech Recognition
  3. Prototype set up
    1. The implementation
    2. Creating the game
    3. Improving speech recognition and game-play
    4. Creating the final prototype
  4. The use of game design
    1. “State of flow”
    2. Input types
    3. Creating game flow by speech recognition
  5. Results
    1. Speech recognition analysis
    2. Play tests
  6. Conclusion
  7. Future plans
  8. Sources

1. The working of speech recognition


To understand what I am dealing with I had to research how speech recognition worked. I could write a whole separate research on this subject because to me it is fairly complicated. Therefore in this chapter I will give a very general explanation of the recognition and translation of human language into text.

Convert speech into digital output

The first step of speech recognition is the conversion of human speech into digital output. It is helpful to understand how a microphone works to understand this conversion from analog to digital.

Illustration 1: The working of a microphone (Filipe M.Cross, 2018)

A microphone is basically a small magnet wrapped in a coil of wire. When the coil of wire vibrates it creates an electric current in the wire by electromagnetic induction. The amplitudes are converted into voltage which can be read by a computer.

The conversion process of an analog signal to a digital signal is done by an Analog-to-Digital Converter (ADC). A brief simplified explanation of the function of an ADC in a microphone is converting the sound waves created by human speech into digital binary data which the computer can understand. an ADC also removes unnecessary noise and normalizes sound and speed of speech (different people speak differently).

Illustration 2: The block diagram of an ADC

How Does Speech Recognition Work? Learn about Speech to Text, Voice Recognition and Speech Synthesis. (2020, August 22). [Video]. YouTube.

Classifying into recognizable sounds

Once the computer has this digital output, the classification into recognizable sounds starts. First, the data is separated into different bands of frequency. Which a spectrogram analyzes further.

Illustration 3: Spectrogram of a speech signal with breath sound (marked as Breath), whose bounds are denoted by vertical dotted lines. (Sri Harsha Dumpala, 2017)

Lets have a look at the spectrogram. On the X-axis there is time in seconds and the Y-axis plots the frequency of a sound (as in high pitches versus low pitches). All words are made up of distinct vowel sounds that each have different frequencies. These different frequencies are recorded on this device and visualized on a spectrogram. Bright areas in the spectrogram signify high frequencies and darker areas signify low frequencies.

Earlier I mentioned that all words are made up of distinct vowel sounds that each have different frequencies. These frequency patterns can be pre-programmed into a computer. This allows the computer to recognize when a spoken sound matches specific vowel sound. Phonemes are the smallest elements of a language. These phonemes can act as frequency patterns. The computer can run phonemes through models that compare them to words in the computers in built dictionary. Models include neural networks and the hidden Markov model.

Illustration 4: An example showing phonemes of the English language.(Brain MacWhinney, 2002)

Authot. (2018, October 10). Phoneme detection, a key step in speech recognition. Authôt.

Determining the input

In the early stages of speech recognition, most phoneme recognition was done by the hidden Markov model. The hidden Markov model is an algorithm consisting of states and corresponding evidence. It changes between these stays, but this process is hidden.

Illustration 5: A simple example of the hidden Markov model to predict weather from clothing patterns. (Ravindranath, 2019)

The downside of the hidden Markov model is that every phoneme must be predefined. This is a problem because accents and mispronunciations influences human speech greatly. A solution for this problem would be the use of neural networks. Hidden Markov models are still used because neural networks require a huge amount of data. Most of the times a hidden Markov model is used in combination with a neural network. I won’t go to deep on the hidden Markov model and neural networks that are used to identify sounds, because that is not the focus of this research.

Phonemes that are identified by the previous mentioned models are then processed by a language model. It analyses the possibility of the appearance of one phoneme after another. It then analyses if the output of phonemes exists in the language model dictionary.

Once the language model identified the words of a sentence, it then evaluates the structure of it to see if it makes sense. This evaluation is done by a passing tree.

A passing tree is a technique in which the sentence is broken down in smaller and smaller parts until only elementary words remain and the sentence is valid. An elementary word can’ t be broken down in separate words. For example: the word “walking” can be broken down in “walk” (the elementary word) and “ing”.

Illustration 6: A phonetic decision tree(Steve Young, 2000)

how speech recognition works in under 4 minutes. (2020, October 25). [Video]. YouTube.


This simplified explanation of the process concerning speech recognition resembles the knowledge I acquired during the Think phase of this R&D research. From sound waves to binary digital signals, to the classification of phonemes with advanced models and finally breaking down words and sentences to get a final output.

2. Considering speech recognition software


In this chapter I will discuss different speech recognition techniques that can be used in combination with Unity. I will give an explanation about the functionality of each individual technique. At the end of the explanation I wrote down a hypothesis on a possible solution. This hypothesis was written before I started working on a prototype and thus the possible solutions are only theoretically possible.

Unity Engine namespace

UnityEngine.Windows.Speech is a namespace available in Unity when using Windows 10. The namespace can be used for a couple of techniques:

  • DictationRecognizer: listens to speech input and attempts to determine what phrase was uttered.
  • KeywordRecognizer: listens to speech input and attempts to match uttered phrases to a list of registered keywords.
  • GrammarRecognizer: reads an XML file and works the same way as the keywordRecognizer. A specific format needs to be used to specify words and phrases for the speech recognition system. 

This is the most straightforward way of using speech recognition in Unity. Nothing has to be important and I could use this straight away with an existing project. I experimented with the KeyWordRecognizer by creating a dictionary with words that the recognizer should recognize. If a word is recognized, the transform of a game object is manipulated. 

The possible solution

This technique only works when capability for voice (using the microphone) is activated in Unity. The downside of using voice input directly in Unity is that the accuracy can’ t be tweaked and therefore can be very inaccurate. I noticed that only about 50% of the time I spoke a word, the speech recognition system would actually recognize it. The lack of options can be crucial when quick and accurate voice recognition is necessary in order to have strong game flow. 

Unity Technologies. (n.d.). Unity – Scripting API: KeywordRecognizer. Unity Docs.

Microsoft Speech SDK

When looking at the UnityEngine.Windows.Speech namespace in the Microsoft documentation a suggestion is given by Microsoft: “Consider using the Unity plug-in for the cognitive Speech Services SDK. The plugin has better Speech Accuracy results and easy access to speech-text decode, as well as advanced speech features like dialog, intent based interaction, translation, text-to-speech synthesis, and natural language speech recognition.” This would be the most logical step to take after experimenting with the UnityEngine.Windows.Speech namespace. It seems that the Microsoft Speech SDK has way more functionality to tweak the accuracy of speech recognition.

I had to create an Azure account and on this account create a resource so I could use the Speech SDK in Unity. The subscription is free for the first 12 months, so I could experiment during my research time for free. The Speech SDK can then be imported in a Unity project and once again the  capability for voice (using the microphone) has to be activated in Unity. 

The first thing you do when utilizing the SDK is creating a configuration for your speech recognition where you type in your credentials and your region. 

They give me three example options from here:

  • Recognizing from microphone: can recognize input from microphone and stops recognizing when hears a silence or 15 seconds passed.
  • Recognizing from file: Can recognize .WAV files  and stops recognizing after the whole file is read.
  • Continuous recognition: Used when you want to control when to stop recognizing. It’s a bit more complicated, but gives you more freedom in controlling the speech recognition since it listens and gives back information when you tell the system to. 

There are multiple things to tweak when using this SDK:

  • Specifying the input (or source) language
  • Using phrase lists. These identify known phrases in audio data, like a person’s name or a specific location. By providing a list of phrases, you can improve the accuracy of speech recognition.
  • Custom Speech. Custom Speech is a UI-model that allows you to evaluate and improve Microsoft speech-to-text accuracy. An Azure account is needed for this.
  • Creating a tenant model. This generates a custom speech recognition model for an organization’s Microsoft 365 data. This model is optimized for technical terms, jargon and people’s names.
  • Using Speech Studio to test, compare, improve, and deploy speech recognition models using related text, audio with human-labeled transcripts, and pronunciation guidance you provide. 
  • Training and deploying a custom model. This can improve accuracy. Human-labeled transcripts are used to improve accuracy. 

The possible solution

If I would use the Microsoft Speech SDK, there is one best solution which I would choose for reaching my goal. This solution would be using continuous recognition in combination with my own trained and deployed custom model. This model can recognize both Dutch and English (can be chosen in an in-game menu) by specifying the input (or source) language. I can use Speech Studio and Custom Speech to get visual feedback on the performance of my speech recognition model. 

Google Cloud Speech Recognition

This was the original technique that I wanted to use for my research. The Speech-To-Text API from Google uses Google’s AI technologies (TensorFlow) to convert speech into text. It has one incredible advantage over the Microsoft SDK: It is available on almost every platform. This would be very beneficial for porting my game to different devices in the future. There are a couple of key features that Google’ s API provides and they could all be very beneficial for my speech recognition game:

  • Speech adaptation: Transcribe domain-specific terms and rare words by providing hints and boost transcription accuracy or specific words and phrases.
  • Domain-specific models: Choose from a selection of trained models for voice control optimized for domain-specific quality requirements. 
  • Streaming speech recognition: Receive real-time speech recognition results as the API processes the audio input stream from your application’s microphone or sent from a pre recorded audio file (could be through Google Cloud Storage).
  • Global vocabulary for creating options for different Languages.

The API is used the same way as a resource in the Microsoft Azure environment. Prices however are higher than the Microsoft Azure environment. 

The possible solution

The use of Google Cloud speech recognition would mean the integration of TensorFlow into Unity. I learnt that TensorFlow is already used in Unity in the form of Unity ML-Agents. This however mainly focuses on creating responsive and intelligent virtual players and non-playable game characters. The process for speech recognition would be creating a TensorFlow graph for use in Unity. Then setting up the ML-agents and somehow code a C# script to run the input through runtime. I could see results in the Google Cloud to showcase in my conclusion.

Google. (n.d.). Speech-to-Text: Automatic Speech Recognition |. Google Cloud.


The three possibilities of speech recognition techniques I could use are all viable for my research. After writing my possible solutions I made a final choice in the software that I would use create my prototype game. The chosen software is the Microsoft SDK. There are multiple reasons why I chose the Microsoft SDK:

  • Because of the options in tweaking to get a better speech recognition accuracy
  • Visual results of the model that is used
  • Options that are given for multiple languages
  • First 12 months of Microsoft Azure (which is needed to create a model) are free
  • Documentation by Microsoft about implementation in Unity.

3. Prototype set up


The setup of my prototype consists of various steps. The first step I had to take was implementing the chosen speech recognition technique, the Microsoft SDK. When I implemented this technique I made various prototypes with help of documentation to asses what I needed for the creation of my game. The second step was the creation of a game that would give me valuable results for the research. This involved game design and thinking about the speech recognition technique I chose.

Once there was a simple version of the game I could go to the third step which was to start play testing to increase the performance of my speech recognition model. The final step was to create a final prototype for which I created art, music and polished game-play. The final prototype was made with the goal to release a fully playable version of the game.

The implementation

Microsoft wrote documentation on the implementation of the Speech service SDK in Unity. One of the core features of this SDK is the ability to recognize and transcribe human speech (speech-to-text). To start using the Microsoft SDK there are a couple of perquisites:

  • To have an Azure account with a resource on it that contains a Speech service subscription. (free of charge for the first twelve months)
  • Installing Speech SDK Unity package and importing this into Unity.

Once the SDK is imported in Unity a configuration script needs to be written in C# to call the Speech service. This configuration includes my subscription key and associated region.

Declaring the API Key and Region to configure the Speech service.

public string SpeechServiceAPIKey = string.Empty;
public string SpeechServiceRegion = "westeurope";

Setting up the configuration for the Speech Service

SpeechConfig config = SpeechConfig.FromSubscription(SpeechServiceAPIKey, SpeechServiceRegion);

There are a couple of ways to use the SDK from here:

  • Recognize from microphone: This is the easiest one to use. Specify the specific audio input you want to use and initialize your microphone. This method uses the .RecognizeOnceASync() method. This means that the recording will stop when a silence is recognized or when a maximum of 15 seconds passed.
  • Recognize from file: To recognize speech from an audio file instead of a microphone you need to specify a file path. The code is very similar to recognizing from a microphone. This method also uses the .RecognizeOnceASync() method.

The above examples return the output of the recognized text in a text file. To handle errors and other feedback (debugging) I needed some debugging code such as giving information on a recognition match, when a recognition session starts/stops, what the result is and possible errors when the SDK crashes.

Gives back a recognized command or tells the player that the speech could not be recognized. A very basic demonstration of the recognize from microphone input method.

public static async Task RecognizeFromMicrophoneInput()
    var config = SpeechConfig.FromSubscription("YourSubscriptionKey""YourServiceRegion");
    using (var recognizer = new SpeechRecognizer(config))
        Console.WriteLine("Say: JUMP to make the character jump");
        var result = await recognizer.RecognizeOnceAsync();
        if (result.Reason == ResultReason.RecognizedSpeech)
            Console.WriteLine($"Recognized command: {result.Text}");
        else if (result.Reason == ResultReason.NoMatch)
            Console.WriteLine($"NOMATCH: Speech could not be recognized.");

As mentioned before, both the microphone input- and from file speech recognition examples use single-shot recognition. Single-shot recognition recognizes a single utterance. However for my game, I want to control exactly when to start and stop recognizing. This requires continuous recognition.

Instead of using the RecognizingOnceASync() method, the StartContinuousRecognitionASync() method is called.

Continuous recognition consists of a couple of events to get recognition results. There are a couple important ones that I will explain to show how continuous recognition works

  • Recognizing event: Containing intermediate recognition results. This can also be described as a hypothesis for the final recognized result.
  • Recognized event: Containing the final recognition results and indicates a successful recognition attempt.
  • SessionStopped event: Indicates the end of a recognition session.
  • Canceled event: Containing canceled recognition results.

The event handles that are required for continuous recognition. The recognizedString variable is what I end up using to compare to pre programmed commands in the game.

#region event handlers
private void SessionStartedHandler(object senderSessionEventArgs e)
    UnityEngine.Debug.LogFormat($"\n    Session started event. Event: {e.ToString()}.");
private void SessionStoppedHandler(object senderSessionEventArgs e)
    UnityEngine.Debug.LogFormat($"\n    Session event. Event: {e.ToString()}.");
    UnityEngine.Debug.LogFormat($"Session Stop detected. Stop the recognition.");
private void SpeechStartDetectedHandler(object senderRecognitionEventArgs e)
    UnityEngine.Debug.LogFormat($"SpeechStartDetected received: offset: {e.Offset}.");
private void SpeechEndDetectedHandler(object senderRecognitionEventArgs e)
    UnityEngine.Debug.LogFormat($"SpeechEndDetected received: offset: {e.Offset}.");
    UnityEngine.Debug.LogFormat($"Speech end detected.");
private void RecognizingHandler(object senderSpeechRecognitionEventArgs e)
    if (e.Result.Reason == ResultReason.RecognizingSpeech)
        UnityEngine.Debug.LogFormat($"HYPOTHESIS: Text={e.Result.Text}");
        lock (threadLocker)
            recognizedString = e.Result.Text + ".";
private void RecognizedHandler(object senderSpeechRecognitionEventArgs e)
    if (e.Result.Reason == ResultReason.RecognizedSpeech)
        UnityEngine.Debug.LogFormat($"RECOGNIZED: Text={e.Result.Text}");
        lock (threadLocker)
            recognizedString = e.Result.Text;
    else if (e.Result.Reason == ResultReason.NoMatch)
        UnityEngine.Debug.LogFormat($"NOMATCH: Speech could not be recognized.");
private void CanceledHandler(object senderSpeechRecognitionCanceledEventArgs e)
    UnityEngine.Debug.LogFormat($"CANCELED: Reason={e.Reason}");
    if (e.Reason == CancellationReason.Error)
        UnityEngine.Debug.LogFormat($"CANCELED: ErrorDetails={e.ErrorDetails}");
        UnityEngine.Debug.LogFormat($"CANCELED: Did you update the subscription info?");

Once the continuous recognition was set up in Unity, I could start thinking about a game that would work well around this. Using continuous recognition I created a system where I could control exactly when speech should be recognized and when more importantly when it shouldn’t be.

Checks if mic permission is granted. Which is granted manually in the Unity player preferences.

public void StartContinuous()
    if (micPermissionGranted)

Creating the config of the continuous speech service and setting up a new recognizer with its speech events.

void CreateSpeechRecognizer()
    if(recognizer == null)
        SpeechConfig config = SpeechConfig.FromSubscription(SpeechServiceAPIKey, SpeechServiceRegion);
        config.SpeechRecognitionLanguage = fromLanguage;
        recognizer = new SpeechRecognizer(config);
        var phraseList = PhraseListGrammar.FromRecognizer(recognizer);
        if (recognizer != null)
            // Subscribes to speech events.
            recognizer.Recognizing += RecognizingHandler;
            recognizer.Recognized += RecognizedHandler;
            recognizer.SpeechStartDetected += SpeechStartDetectedHandler;
            recognizer.SpeechEndDetected += SpeechEndDetectedHandler;
            recognizer.Canceled += CanceledHandler;
            recognizer.SessionStarted += SessionStartedHandler;
            recognizer.SessionStopped += SessionStoppedHandler;

Calls the StartContinuousRecognitionAsync() method after a new speech recognizer is configured.

private async void StartContinuousRecognition()
    if(recognizer != null)
        await recognizer.StartContinuousRecognitionAsync().ConfigureAwait(false);

Microsoft. (2018, March 21). Voice input in Unity – Mixed Reality. Microsoft Docs.

Creating the game

The difficult thing about using continuous recognition is that the process of receiving a result takes a fair amount of time. After all, the recognizer has to go through different stages every time it is started and starts recognizing. This was without a doubt the major challenge in my game design. My teachers gave me the advice to create a turn-based game in which timing wouldn’t be important and the player could repeat commands several times if there would be a mismatch with the recognizer. This would be a logical choice, but in my opinion not a very effective one. To really see if physical input can be matched by speech recognition input I needed to create a game where fast thinking and acting is important. This way I could get results on the difference in response times.

With this thought I created an endless runner in which the character would be controlled by speech commands. In the first version there should be a 2D world with procedural generated platform spawning. Every time a platform comes up, the player should speak a specific command to make the character jump. The purpose of this first version of the prototype would be to test the current accuracy and speed of the speech recognition service.

The first version of the game contained different elements required for play testing and feedback:

  • A character which could be controlled by voice commands.
  • Set up continuous speech recognizer which recognizes in a designated time space.
  • A world in which the character can navigate.
  • Debugging to test the speech recognition software and get feedback on game-play.

Improving speech recognition and game-play

After getting feedback by students and teachers on the first version I collected valuable advice for the next steps I should take:

  • Instead of trying to increase the speed of recognition, adjust game-play to the speech recognition system. For example by using a quick time event system.
  • Add visuals to the game which clarify the moment when the player can speak a command.
  • Port the game to mobile, because every mobile has a microphone nowadays. Do this especially if you plan to release.
  • Focus more on game design and less on using different speech recognition software. This is not possible for the time given and it would be more inline with your research.

What I realized from the feedback and advise I got, is that both teachers, fellow students and play-testers advised me to work more on strong game design instead of improving the speech recognition software. This is why I started by writing down a whole new design for the game. The new idea would be to keep the endless runner but combine it with quick time events. Every time the character should perform a certain action, the player has to speak certain commands based on visuals they see on the screen. If they speak the commands correctly, the character will perform the action correctly. This way, I can combine quick reflexes with interesting combinations of visuals and thus speech commands.

The speech recognition system could still use a lot of improvement and I thought it was interesting to experiment with in this research.

The first element to change was the source language for input. I did some research in the Microsoft dictionary to see if the speech recognition SDK would be less accurate when setting the source language to Dutch. This was indeed the case, because after I tested the US version it recognized significantly faster.

Declare a variable for the source language and add this to the configuration of the speech service before it is created.

string fromLanguage = "en-US";

T. (2021a, January 7). Taal ondersteuning-spraak service – Azure Cognitive Services. Microsoft Docs.

config.SpeechRecognitionLanguage = fromLanguage;

The second element I wanted to add where phrase lists. Phrase lists are used to identify known phrases in audio data, like a person’ s name, a specific location or in this case a specific command. Even single words can be added to a phrase list. The speech recognition SDK then boosts the recognition of the words that are added in the phrase list.

Creates a phraselist in which phrases can be added to the given recognizer.

var phraseList = PhraseListGrammar.FromRecognizer(recognizer);

The third and final method I used to improve the accuracy of speech recognition was with something called Custom Speech and Speech Studio.

Custom Speech is a set of tools which helped me evaluate and improve my Microsoft speech recognition SDK. To use it, all I needed where some recorded audio files I recorded while play testing. There are a couple of steps to consider when using Custom Speech to test and evaluate my audio data.

  1. Upload test data in the form of mono .wav recordings. I created these recordings in Audacity and uploaded them to the Speech Studio. The Speech Studio is an environment where uploaded data can be inspected.
  2. In Speech Studio I then inspect the recognition quality of my uploaded audio. I do this by comparing different used recognition models on the uploaded audio.
  3. Finally, the Speech Studio will give me a word error rate which I used to determine if additional training would be required.
Illustration 7: For this project, I used standard speech-to-text and therefore only evaluated the speech recognition model instead of also customizing it. (Microsoft Docs)

Surprisingly the word error rate was very low, so I didn’t deploy a new model for my speech recognition. A new model would need human-labeled transcripts and related text, which would be to much for the scope of this research.

T. (2021b, February 12). Overzicht van Custom Speech: spraak service – Azure Cognitive Services. Microsoft Docs.

Creating the final prototype

Screenshot of the game’s start screen.

Now that I successfully created my first prototype and evaluated and improved my speech recognition, it was time to start creating the final prototype. I implemented the feedback that I got from the first prototype as much as possible. The focus in the final prototype wouldn’t only be on the speech recognition and game-play, but also on sound and the art style.

The first note of feedback I wanted to implement was the quick time event system. I wanted to adjust the game-play to the speech recognition speed. The quick time event system in the final prototype consists of a number of simple shapes (circle, triangle, square). One of these shapes appear randomly in the screen whenever a quick time event starts. When a shape appears, the player has to speak the corresponding command to successfully complete the quick time event. During the quick time event, a slow-motion effect occurs so the recognizer has time to calculate the given response.

Screenshot of the quick time event system UI.

When called the StartQTE function will start showing a random command icon and start recognizing speech. Starts various coroutines that are used to continuously check the recognized string.

public static void StartQTE(QTETrigger trigger) {
private void _StartQTE(QTETrigger trigger) {
    this.activeTrigger = trigger;
    //pick a random QTE and show it in the display for a certain amount of time
    QTEgen = UnityEngine.Random.Range(0, commands.Count);

Resets the recognized string variable and slows time. Starts the continuous recognition and waits for given time until the recognition stops. Checks if command was correct. Finally speeds up time to normal speed.

private IEnumerator RecognizingSpeech() {
    recognizer.recognizedString = "";
    microphoneStatus.color = micActiveColor;
    commandIcon.enabled = true;
    yield return StartCoroutine(WaitForRealTime(timeGiven));
    shouldCheck = false;
    if (checkSuccesful) {
    } else {
        incorrect.color = incorrectCommandColor;
    checkSuccesful = false;
    microphoneStatus.color = micInActiveColor;

Coroutine used to wait a specific amount of time without being influenced by the Unity timescale.

public static IEnumerator WaitForRealTime(float delay) {
    while (true) {
        float pauseEndTime = Time.realtimeSinceStartup + delay;
        while (Time.realtimeSinceStartup < pauseEndTime) {
            yield return 0;

Resets the quick time event system.

private IEnumerator CommandReset() {
    yield return new WaitForSeconds(2f);
    correct.color = inActiveCommandColor;
    incorrect.color = inActiveCommandColor;
    commandIcon.enabled = false;

While the quick time event is active, constantly checks if the spoken command matches the command generated by the quick time event system.

private IEnumerator ContinuousCheck() {
    shouldCheck = true;
    while (shouldCheck) {
        if (commands[QTEgen] == recognizer.recognizedString) {
            correct.color = correctCommandColor;
            checkSuccesful = true;
            shouldCheck = false;
        yield return StartCoroutine(WaitForRealTime(0.1f));

The next element I added in the final prototype was a characteristic art style. The art style could add a lot to the clarity of when to speak a command and when not to. It would also be important to give the character and the world around it more story in case I would want to release this game someday.

The first thing I did, was creating simple textures in Adobe Photoshop for the background, the player and the platforms. I didn’t want to spend a lot of time on this, so I created textures in a minimal futuristic art style.

The background of the game created in Adobe Photoshop

After I created the textures and added them into the game, I wanted to use post processing in combination with a custom shader to control bloom on different objects. I also added particle systems to highlight areas where speech recognition was activated and to make the world more alive.

A particle system that highlights the area where speech recognition is active.

A custom shader used to multiply the color of sprites by an intensity value. This is used to control the amount of bloom emitted from sprites.

Shader "2D/Main2DShader"
        _MainTex ("Texture", 2D) = "white" {}
        _Intensity ("Intensity", Range(0, 10)) = 1
        Tags { "Queue"="Transparent"  "RenderType" = "Transparent"}
        ZWrite Off
        Cull Off
        Blend SrcAlpha OneMinusSrcAlpha
        LOD 100
            #pragma vertex vert alpha
            #pragma fragment frag alpha
            // make fog work
            #pragma multi_compile_fog
            #include "UnityCG.cginc"
            struct appdata
                float4 vertex : POSITION;
                float2 uv : TEXCOORD0;
                float4 color : COLOR;
            struct v2f
                float2 uv : TEXCOORD0;
                float4 vertex : SV_POSITION;
                float4 color : COLOR;
            sampler2D _MainTex;
            float4 _MainTex_ST;
            float _Intensity;
            v2f vert (appdata v)
                v2f o;
                o.vertex = UnityObjectToClipPos(v.vertex);
                o.uv = TRANSFORM_TEX(v.uv, _MainTex);
                o.color = v.color;
                return o;
            fixed4 frag (v2f i) : SV_Target
                // sample the texture
                fixed4 col = tex2D(_MainTex, i.uv);
                // apply fog
                UNITY_APPLY_FOG(i.fogCoord, col);
                col *= i.color;
                float alpha = col.a;
                if (alpha < 0.1) discard;
                col *= _Intensity;
                col.= alpha;
                return col;

After I added bloom and particles, chromatic aberration and some UI (created in Adobe Photoshop) the game was a lot more fun to play and easier to understand. At least, from my perspective, I still had to play test.

I also used a library called DoTween. DoTween is an object-oriented animation engine for Unity. I used DoTween to animate the character when jumping and landing after a jump. I also used DoTween to create the slow-motion effect and the day-night cycle.

Using DoTween to create a camerashake. Using DoPunchScale to create a squish effect when the character lands.

public void Land() {
    camShake.m_AmplitudeGain = cameraShake;
    if (shakeTween != null) shakeTween.Complete();
    shakeTween = DOTween.To(() => camShake.m_AmplitudeGain, x => camShake.m_AmplitudeGain = x0.0f0.5f);
    if (tween != null) tween.Complete();
    maxHead.transform.localScale = new Vector3(1.0f1.0f1.0f);
    tween = maxHead.transform.DOPunchScale(new Vector3(0.0f, amount, 0.0f), duration, bounces, elasticity);

Using DoTween to alterate the time scale and chromatic abberation.

private void _StartSlowMotion() {
    DOTween.To(() => Time.timeScale, x => Time.timeScale = x0.3f0.1f).OnUpdate(() => {
        Time.fixedDeltaTime = fixedDeltaTime * Time.timeScale;
    DOTween.To(() => chromaticAberration.intensity.value, x => chromaticAberration.intensity.value = x0.0f0.3f);
private void _StopSlowMotion() {
    DOTween.To(() => Time.timeScale, x => Time.timeScale = x1f0.1f).OnUpdate(() => {
        Time.fixedDeltaTime = fixedDeltaTime * Time.timeScale;
    DOTween.To(() => chromaticAberration.intensity.value, x => chromaticAberration.intensity.value = x0.5f0.3f);

I have been the sound designer of a couple of games in the past and being a sound designer is a great ambition for me in the future. That’s why it was interesting to me to edit some SFX and compose a music piece for this game. I mastered everything in Unity and tested it with speaking commands. The hardest challenge with speech recognition is that the player has to speak a certain commands while other sounds are playing in the background which could be very distracting.

DoTween. (n.d.). DOTween (HOTween v2).

The music piece I composed for the game
Screenshot of LMMS; the environment in which I composed a music piece for the game.

Click the button below to download the build of the final prototype. The build is currently only for Windows. That was one of the downsides of using Microsoft SDK.

4. The use of game design

Illustration 7: A quote by Jane McGonigal from one of my favourite books: Reality is broken
SuperBetter (ed. Penguin, 2015)


Now that I have a final prototype and an accurate speech recognition service I want to talk about the use of game design in the game. It is important to know how game flow works and what decisions can be made to influence the improvement of game flow. Because this research is heavily focused on the input of the player I want to give an overview of input methods that are now used in games and how they compare to speech recognition input. How speech recognition input is used to create game flow will be discussed in the final part of this chapter.

“State of flow”

Illustration 8: Flow, boredom, and anxiety as they relate to task difficulty and user skill level.(Csikszentmihalyi, 1990)

The diagram above shows that when the skill is to low and the given challenge is to hard people become anxious. And if the skill is to high and the challenge is way to ease people become bored. To enter a so called “state of flow” both skill and difficulty should be roughly at the same level. In a “state of flow” people can experience extreme focus on a challenge, loss of self-awareness and (most important for this research) a sense of active control.

Csikszentmihalyi, the 1970s psychologist that researched this “state of flow” found that there are a couple of elements in a challenge that increase the probability of Flow States:

  • Show the player concrete goals with manageable rules
    • People need clear goals to process information more effectively. People have limitations in the amount of information that can be processed on a computer screen.
  • Adjust to the person’s capabilities by demanding actions to achieve goals
    • Not being capable of achieving goals makes the player stressed and anxious. A certain commitment is needed to achieve difficult goals. This commitment dissapears if a person feels like they are not up to the task.
  • Clear feedback on performance and accomplishment
    • The association between action and outcome makes the person feel engaged in achieving a goal.
  • Facilitating concentration by zoning out distractions
    • As mentioned before, people have limitations in the amount of information that can be processed. Mechanics, visuals, UI, etc should only be visible in the screen when they have a very clear purpose.
Illustration 8: Performance as a function of Arousal/Stress. (Yerkes & Dodson, 1908, and
Hanin, 2007)

The research on game flow has interested me since year one of game development and it is interesting how valuable it is for almost every project I am working on. Thinking about how I want to reach game flow with any game I make helps me to see things from the perspective of a random player.

Baron, S. (2012, March 22). Cognitive Flow: The Psychology of Great Game Design. Gamasutra.

McGonigal, J. (2011). Reality Is Broken: Why Games Make Us Better and How They Can Change the World (Illustrated ed.). Penguin Books.

Rogers, S. (2014). Level Up! The Guide to Great Video Game Design (2nd ed.). Wiley.

Input types

In this part of the chapter I will give a couple examples of used input types in games. I think it is important to my research to evaluate multiple input types to have a clear comparison to input by speech recognition. A lot of these input types are for a specific type of game. The most widely used input types in games include the following:

  • Keyboard: physical input with the hand which offer a very wide array of controls. Mostly used in PC gaming. Because of its commonality, can be very easily learned to use by players.
  • Mouse: Accompanies a keyboard which can detect the motion of the movement of the mouse done by the players hand. Because of its commonality, can be very easily learned to use by players.
  • Joystick: Offers movement input in the form of direction and angle. It is common that a joystick comes with a couple of buttons. Is widely used in simulations and racing games.
  • Camera: Visual input using image recognition. Need access to RGB data and is mostly used for small interactive games because of very minimal input options.
  • Touchscreen: Can provide multiple inputs at once from the different fingers of the player. Used in mobile games and controllers that make use of touchscreen (such as the Dual-shock controller from PlayStation).
  • GPS: Uses the geographical location of a device in game-play. Input is mostly offered by physical movement in the real world. An example would be Pokemon-Go.
  • Gyroscope: The orientation of the device acts as the input for the player. Gestures like tilting and shaking can deliver input for a game. Very immersive game-play with quite natural and simple input. A great example would be a VR headset.
  • Motion controller: The player’s motion as input. Very natural input which can be used in many games. Downside is the space needed for it to work properly.

After I evaluated a couple of input types, there are some assumptions I made. The game flow is not necessarily directly influenced by the amount of buttons and options. A game using the GPS as major input could be more immersive then a game that uses twenty different keys on the keyboard. More physical movement (with the motion controller, GPS and joystick ) can result in more immersive game-play. The input feels more natural when you use a gyroscope to balance a ball on small surface than using a keyboard or mouse. These are all assumptions I made based on my own experience and are not facts.

Input Types. (n.d.). PACKT.

Creating game flow by speech recognition

Creating the design in itself is already very complex since there are so many elements that could influence the game flow. Player movement, camera movement, item systems and world building are examples of these elements that take a lot of experience to design.

In this research I add another complexity to game design which is changing the input type to speech recognition, which is not a very popular input type for many games. A reason for this could be the limitations for the game designers to create a game around the existing technologies used for speech recognition. The biggest downside of using speech recognition is the enormous amount of data that is needed to create an accurate enough speech recognition service. The risk of the player saying a right command which is then not recognized, would be catastrophic for the game flow. As mentioned in a previous section of this chapter the “state of flow” can be achieved by giving the player clear feedback. The worst kind of feedback would be if the system would make a mistake and punishes the player for it.

However, controls by voice could have a potential lower learning curve for beginners since they don’t have to figure out any abstracts controls but rather speak commands directly into their microphone to control their character or navigate through a menu. Games controlled by voice would make gaming more accessible for visually and/or physically impaired people.


Because the integration of speech recognition is fairly new in game design, the focus is more on improving speech recognition than game design. Game design is heavily restricted by the quality of the speech recognition system when thinking about game flow. With the current speech recognition technology it is hard to create the same open-world PC games that are controlled with keyboard and mouse. But maybe that is not the right scope for speech recognition. Maybe the right scope would be leaning more towards the socially oriented side of game design.

5. Results


This is maybe the most important chapter of my research document: the results. In this chapter I will discuss the results from my speech recognition analysis and will try to give a clear overview of the accuracy of the speech recognition service I used. This accuracy of speech input is also measured in play-tests which is a vital part in the analysis. The play-tests I performed had two purposes: The first one was to test the accuracy and speed of my speech recognition and the second was testing game design.

Speech recognition analysis

The first part of the speech recognition analysis is done in the Microsoft Azure environment. In the Microsoft Azure environment I can track different metrics on my speech recognition service. The environment holds a resource I created which shows information for example about the properties, metrics, alerts and key tokens.

These are examples of metrics I tracked which I used to analyze the speech recognition service.

Metric showing the total amount of calls that retrieved the authentication token of the speech recognition service resource.
Metric showing the total amount of calls with client side errors (HTTP response 4xx).
Metric showing the amount of successful calls that the speech recognition service received.

These metrics give me a good comparison about how many of the token calls are successful and when errors happen. The most calls happened when I started setting up my recognizer and testing different small prototypes (begin March). We can see that token calls are heavily reduced near the end of March. Token calls take a lot of time and this has an impact on game-play, so I tried to minimize these calls as much as possible by pausing a recognition session instead of restarting it(restarting it means starting a new token call).

I also analyzed the speech recognition service by logging the speech recognizer within Unity and sending the output to a text file. This was the second part of my analysis

Adds logging to the configuration of the speech recognition service.

config.SetProperty(PropertyId.Speech_LogFilename, "C:/Users/Wiandi Vreeswijk/Downloads/EndlessRunnerSpeechRecognition/LogResults.txt");

The log file showed me information about the speech SDK’ s core components, I used is mostly to check the duration of recognizing a command.

Log about the time it took for the speech hypothesis to recognize the word: “Triangle”

Response: Speech.Hypothesis message. Starts at offset 6800000, with duration 1100000 (100ns). Text: Triangle

Both the analyzing tools in Microsoft Azure and the logging in Unity helped me to boost the speed and accuracy of my speech recognition service. I must say that it was very difficult to influence these variables, but it helped me to understand what calculations are going on during the continuous recognition. I learned how to code the recognizer properly to make it as efficient as possible by looking at the metrics in Azure. By understanding the hypothesis phase of continuous recognition I could adjust my code accordingly. The game could even predict commands by comparing the hypothesis to the presented quick time event command.

Play tests

I prepared my play-tests by creating a step-by-step plan:

  1. Introduce myself and the research to the player and give them an opportunity to introduce themselves.
  2. Set up build for player and give the following instructions:
    Only use speech to control the game. Speak clear commands into the microphone. Don’ t tell the player when to speak a command, this is part of the play-test. The game should be designed in a way that this is clear.
    Risks in this step:
  • Instructions are not clear to the player
  1. Observe player while play-testing:
    Don’t give advice to the player on how to play, except that they need to use speech. Tell the player to get a couple of commands right before quitting.
    Risks in this step:
  • Player gets distracted and speaks to late
  • Character doesn’t jump as intended
  • Microphone doesn’t work as intended
  1. Ask player questions about play-through:
  • Did the game-play feel fair?
  • Where the controls easy to use? If not, what could be an improvement in your opinion?
  • What did the visuals and sound in the game add for you?
  • What would you definitely add to this game that would make it more appealing for you?

Footage of one of the play-testers: Sammie Vries

This video is a recording of a play-test with a person from my target audience. Her name is Sammie Vries and she was willing to help me get feedback on my prototype. Sammie is 24 years old and is a journalism graduate from the university of applied sciences in Utrecht.

Play-test executed by Sammie Vries and monitored by me.

Sammie’ s answers:

  • Did the game-play feel fair?
    • Yes it felt fair. The reason why I failed one time was because I had to think to long about the answer which made sense. The game responded to my commands as I expected they would.
  • Where the controls easy to use? If not, what could be an improvement in your opinion?
    • I am not very good in English, so it is hard to recognize a certain shape and then translate it in your head to English. The time it took me to do this was too long sometimes which made the game difficult in the beginning.
  • What did the visuals and sound in the game add for you?
    • The jumps where in sync with the music and this created synergy in the game.
  • What would you definitely add to this game that would make it more appealing for you?
    • Add more mechanics. The game should have more variety and create more challenge by adding more speech commands.

6. Conclusion

In this R&D report I researched the use of speech recognition in a game and discussed the following subject:

Using speech recognition input to deliver the same game flow as physical input.

I answered this question by explaining that speech recognition works by converting analog audio waves to digital binary output. This process happens in a microphone, when the digital binary output is read by the computer it can be translated using spectrograms. These spectrograms are used to define phonemes which are the smallest elements of a language. Using complex models, these phonemes are used to create words and sentences.

After explaining the working of speech recognition I had to choose a type of speech recognition software to integrate into Unity for making a prototype. I chose the Microsoft SDK speech recognition service mainly because it has a lot of options in tweaking. It was also free and had a lot of documentation available.

After I decided what software to use for creating a prototype, I started with the implementation and experimentation in Unity. Different techniques such as recognizing directly from microphone and file where used in small prototypes to test the capabilities of the SDK speech service. I finally decided to use continuous recognition for my prototype. Continuous recognition is a type of recognition which can be easily controlled during game-play. When the first prototype using continuous recognition was working, I started improving the accuracy and speed of the speech recognition service. Examples of techniques I used to achieve this are phrase lists and a UI-based tool called Custom Speech. When I was ready to create the final prototype, I started thinking about art and sound and the overall game flow.

The game flow is based on the difficulty and skill of a player. I explained this by using different game design models and examples. Game flow is also influenced by the type of input that is utilized in a game. Multiple examples where given and compared to speech recognition input. After explaining how game flow would be achieved, I explained how this could be done with the use of speech recognition input. I discuss the pros and cons about using speech recognition and give my personal opinion on the matter.

Finally, I show my results of the Microsoft speech recognition service and of the executed play-tests. These results helped me to get a better understanding on the effect of speech recognition on game flow.

7. Future plans

Illustration: The WAEM model, describing the company’s end goal and how WAEM will achieve this. (WAEM, 2021)

At the moment I am setting up my own company with a fellow game development student. The company is called WAEM and focuses on passing knowledge and creating awareness with the use of gamification and innovative game design. I think that the current research would be very valuable for WAEM, because speech recognition input could be used by visually and/or physically impaired people.

I will definitely continue working on this game and port it to mobile. Looking at the simplicity of the game, this could be a very successful mobile game with a socially oriented touch. Besides porting the game to mobile, I want to continue improving the continuous recognition and adding more mechanics to achieve an even higher “state of flow”.

I really loved working on this projected and it gave me a huge perspective on the possibilities of speech recognition input in game design.

8. Sources

Authot. (2018, October 10). Phoneme detection, a key step in speech recognition. Authôt.

Baron, S. (2012, March 22). Cognitive Flow: The Psychology of Great Game Design. Gamasutra.

Bye, T. (2020, September 15). Quickstart voor spraak-naar-tekst – Speech-service – Azure Cognitive Services. Microsoft Docs.

DoTween. (n.d.). DOTween (HOTween v2).

Google. (n.d.). Speech-to-Text: Automatic Speech Recognition |. Google Cloud.

How Does Speech Recognition Work? Learn about Speech to Text, Voice Recognition and Speech Synthesis. (2020, August 22). [Video]. YouTube.

how speech recognition works in under 4 minutes. (2020, October 25). [Video]. YouTube.

Input Types. (n.d.). PACKT.

McGonigal, J. (2011). Reality Is Broken: Why Games Make Us Better and How They Can Change the World (Illustrated ed.). Penguin Books.

Microsoft. (2018, March 21). Voice input in Unity – Mixed Reality. Microsoft Docs.

Mouncer, B. (2020, April 1). Speech Recognition, translation and intent recognition using C# & Unity. Github.

Rogers, S. (2014). Level Up! The Guide to Great Video Game Design (2nd ed.). Wiley.

T. (2021a, January 7). Taal ondersteuning-spraak service – Azure Cognitive Services. Microsoft Docs.

T. (2021b, February 12). Overzicht van Custom Speech: spraak service – Azure Cognitive Services. Microsoft Docs.

Unity Technologies. (n.d.). Unity – Scripting API: KeywordRecognizer. Unity Docs.

Related Posts