1940s and 1950s – Crawling:
This period that began in the 1940’s and which ran right through until the latter part of the 1950’s, saw an important and essential ground-work carried out upon two key paradigms: the automaton model and the probabilistic model. The automaton came to surface in the 50’s out of Turing’s model of algorithmic computation; a model which is now referred to as a "Turing machine". Turing’s model, which was considered by many to be the foundation of modern computer science, then led to the McCulloch-Pitts neuron model - (1943), and later to mathematician Stephen Cole Kleene’s contributory work to automata theory on - finite automata and regular expressions - (1951 and 1956). Another mammoth contribution to automata theory was by Claude Elwood Shannon in 1948, who applied probabilistic models of hidden Markov processes to automata for language; Shannon’s work, and to be more precise his idea of a finite-state Markov process, was drawn upon by Noam Chomsky in 1956, in which, he first considered finite-state machines as a way to characterize a grammar, and in which, he defined a finite-state language as a language generated by a finite state grammar. As a whole, these early models led to the field of formal language theory, a theory which defines a formal language as a mathematical sequence of symbols using algebra and set theory; this includes the context-free grammar which was first defined by Noam Chomsky, but independently by John Backus in 1959, and Peter Naur in 1960. It was within this same period that the development of probabilistic algorithms for speech and language processing came about; this led to Claude Elwood Shannon’s next important contribution, the metaphor of the noisy channel and decoding for the transmission of language through media like communication channels and speech acoustics. Shannon, through the borrowing of the concept of entropy from thermodynamics, was also able to measure the information capacity of a channel or the information content of a language, and it was at around this same time period that the first foundational works were laid for future work in speech recognition; first, with the development of the sound spectrograph in 1946, and then, with the foundational research that was conducted into instrumental phonetics. It was this important groundwork that brought about the first machine speech recognizers in the early 1950s with the highlight of them being a statistical system that was built by a team of researchers at Bell Labs; a machine that was able to recognise any of the 10 digits from a single speaker whilst achieving a 97–99% accuracy rate.
1957 to 1970 – Baby Steps:
By the late 1950s up until the early 1960s, speech and language processing had divided into two very distinct and separate paradigms: the first – the symbolic, and the second - the stochastic. The work of Noam Chomsky and others within the field of formal language theory and generative syntax, and in likewise manner - the work of many linguists and computer scientists on parsing algorithms, was the first line of research the symbolic paradigm took off from; a clear example of this was one of the earliest complete working parsing systems – Zelig Harris’s "Transformations and Discourse Analysis" project between the years of 1958 to 1959. the second line of research was the new field of Artificial Intelligence, when in the summer of 1956 a group of researchers came together which included the aforementioned Claude Elwood Shannon for a two month workshop called "artificial intelligence". The main emphasis of this new field was the work on reasoning and logic exemplified by Allen Newell, Herbert A. Simon, and J. C. Shaw’s work on the "Logic Theorist" and the "General Problem Solver". A clear consequence of the work that took place at the coming together of these researchers, was the use of simple heuristics in early natural language recognition systems that were built at around this time period. By the time the late 1960s had come about, more orderly logical systems were in use. The stochastic paradigm, in contrast to the symbolic paradigm, took off from within the departments of statistics and electrical engineering. The late 1950s saw the implementation of Bayesian artificial intelligence methods within the up till then problematic optical character recognition; Woodrow Wilson Bledsoe and fellow employee Sandia Iben Browning, came together, and built a system using Bayesian artificial intelligence methods for optical character recognition in 1959. Later on, in 1964, Charles Frederick Mosteller and David Wallace applied Bayesian artificial intelligence methods to the historical problem of attribution - as to who wrote each of the disputed Fedaralist papers. The 1960s also brought about the first test-able psychological human language models, human language models which were based upon transformational grammar. It also saw the appearance of the first online text collections (newspapers, novels, articles, dictionaries, etc.) - in the "Brown University Standard Corpus of Present-Day American English" (compiled by Henry Cucera and W. Nelson Francis between the years of 1963 to 1964), and William Wang’s "Chinese Dictionary on Computer" in 1967.
1970 to 1983 – Giant Leaps:
This third and key period saw research in speech and language processing reach new heights and levels; it also saw the development of numerous research paradigms which are still forerunners in the field today. The stochastic paradigm which we mentioned previously played a key role in the advancement of speech recognition algorithms within this period, for speech recognition algorithms such as the Hidden Markov Model and the metaphors of the noisy channel and decoding were especially in use during this time; these algorithms were developed independently of each other by Frederick Jelinek, Lalit R. Bahl, Robert L. Mercer, and associates at the "IBM Thomas J. Watson Research Center", and James K. Baker at "Carnegie Mellon University". This third period also brought about the beginnings of the logic-based paradigm, through the work of Alain Colmerauer and his associates on Q-systems and metamorphosis grammars in 1970 and 1975, through the work of Fernando C. N. Pereira and David H. D. Warren on Prolog and Definite Clause Grammars in 1980, and through Martin Kay’s work on functional grammar and LFG in 1979 and 1982. Another field which saw the light of day during this period was natural language understanding; it all began with the "SHRDLU" system which was created by the American professor of computer studies Terry Winograd; a system which was developed so as to simulate a robot incorporated within a blocks-type world environment. This program allowed user interaction using natural language text commands in English; a first since work within Speech Recognition in the 1940s kicked off in terms of complexity, sophistication, and extensiveness. Yale school, through its chairman of computer science Roger Schank and his associates (colleagues and students: Robert P. Abelson - 1977, Roger C. Schank, Christopher K. Riesbeck – 1981, Cullingford – 1981, Robert Wilensky – 1983, Wendy Lehnert - 1977) - built a series of language understanding programs that were specifically directed towards concepts within human knowledge; (concepts such as scripts, plans and goals, memory organisation etc.); oftentimes network based semantics was utilised and Charles J. Fillmore’s notion of case roles were also incorporated in their representations (M. Ross Quillian - 1968, D. Rumelhart and D. Norman – 1975, Roger C. Schank – 1972, Y. Wilks – 1975, Walter Kintsch – 1974). This period also brought about the discourse modeling paradigm, which focused on four key areas that are present in speech: ideas in speech structure and speech focus were proposed by Barbara J. Grosz and her fellow researchers, and work on automatic reference resolution and the Belief Desire Intention framework for logic based work on speech acts was also developed (Jerry R. Hobbs - 1978, J.F. Allen and C.R. Perrault – 1980 and P. R. Cohen and C.R. Perrault – 1979).
1983 to 1993 – Further Progress:
This decade saw the come back of two classes of models which had previously lost popularity in the late 1950’s, and the late 1960’s. The first of the two was the finite-state models which began the revival through work that was carried out in 1981; work on finite state phonology and morphology, by Ronald M. Kaplan and Martin Kay; and work carried out on finite state models of syntax by K. Church in 1980. The second class, was the rise of the probabilistic models throughout speech and language processing, models which were strongly influenced by the work that took place at the IBM Thomas J. Watson Research Center on "probabilistic models of speech recognition"; this work that took place brought about methods which spread into part of speech parsing and attachment ambiguities, tagging, and connectionist approaches from speech recognition to semantics; a considerable amount of work also took place during this time period on natural language generation.
1983 to 1993 – Smooth Riding:
As the millenium approached, clear changes in the field of Speech Recognition became even more apparent. One such change was that probabilistic and data-driven models had become common place throughout natural language processing, and therefore probabilities were incorporated and evaluation methodologies were borrowed from speech recognition and information retrieval - and thus, employed in algorithms for parsing, parts of speech tagging, reference resolution, and discourse processing. Also, due to the increases in the speed and memory levels of the computer, commercial utilisation of a variety of sub-areas of speech and language processing now became possible; sub-areas such as: speech recognition, as well as, spelling and grammar reviewing. Finally, The rise of the web brought about a serious need for language based information retrieval and extraction.
One of the most simple but yet essential capabilities possessed by man is the art of speech communication; through speech, human beings can express information at will without the need of a third party tool. Despite the fact that we are able to take in more information through the eyes (rather than through the ears), being able to communicate one to another in a visual fashion is more or less completely ineffective when compared to the possibilities and potentials of speech communication. The speech wave transmits information across through a known language - through the particular speaker’s vocal characteristics and emotion; if we also take into account that the acoustical and the linguistic structures of speech are so closely related to our level of intellectual ability, and as a consequence of this are also deeply related to our cultural and social development – then, we can clearly see just how much of a significant role speech plays in our everyday lives.
The Speech Wave - Information Retrieval by Machine:
If we take all the above into consideration, we can see why an application of artificial intelligence such as Automatic Speech Recognition has been such a sought after goal of research for a period of more than sixty years, (if of course we take into account that the original foundations for future works were laid in the 1940’s, and which eventually brought about such inspirational, for the time, science fiction wonders such as the computer ‘HAL’, in S. Kubrick’s famous movie "2001 – A Space Odyssey", and the much beloved robot "R2D2" in the G. Lucas classical movie series "Star Wars"). Despite the fact that enormous efforts have been made in trying to design and create intelligent machines, machines that are able to recognize the spoken word and then comprehend the meaning of it; we are still as yet not anywhere near enough to achieving the ultimate goal and desire we have set from the get-go: to create a machine that can recognise and thus comprehend all spoken speech, no matter the subject, no matter the saying, no matter the environment. So; Where do we currently stand when it comes to the art of automatic speech recognition? How far have these speech recognition systems come? These are not straight forward questions, or even questions that have a simple answer; what can be said however, is that - the answers lie in the ultimate goal of automatic speech recognition that we have relayed above; from this end-objective we can begin to try and answer these all important questions, and the word that should come to mind when we analyse this statement is - "accuracy". Any given system’s accuracy depends wholly upon the overall conditions of the evaluation that takes place; for example, a system that is evaluated under very narrow like conditions can easily attain a human–like accuracy, but when these conditions are broadened it becomes much harder to achieve this human-like accuracy. These conditions of evaluation and thus the overall accuracy of automatic speech recognition vary depending upon the following factors:
Vocabulary size and confusability.
Speaker dependence vs. independence.
Isolated, discontinuous, or continuous speech.
Task and language constraints.
Read speech vs. spontaneous speech.
Even if however, we take all the above factors into account, and use them as a general blueprint when performing research into speech recognition by machine as a whole, with of course the end desired goal being the ultimate, a machine that can recognise and thus comprehend all spoken speech, no matter the subject, no matter the discourse, no matter the environment, we must still not fail to recognise the inter-disciplinary nature behind automatic speech recognition by machine, and the fact that, most researchers tend to apply a monolithic approach to individual problems, problems which require the application of one or more disciplines such as the following:
Communication and information theory.
SPREX Expert System:
The SPREX system; designed by Mizoguchi, Tanaka, Fukuda, Tsujino, and Kakusho, in 1987, is a continuous speech recognition system, that uses techniques derived from knowledge engineering. The parameters within this system are defined in a symbolic fashion, and then, immediately following, forward chaining rules are applied for the recognition phase. The parameters used within this system, are formant frequencies, energy, zero crossing rate, and ratio of low frequency energy to the whole energy (L/A); variations of the feature parameters are defined through descriptors; states of features are labeled through the use of a number of varying labels; through the use of the state descriptions of the feature parameters, a good graphical interface is made available for users of the system, whose job it is to create rules; rules that are written in accordance to the level of knowledge possessed by these human experts; a rule database is used to store these rules, and they are used during the recognition phase. Now, during the recognition phase, the state descriptions and the rules are used for, segmentation, consonant recognition, vowel recognition, and re-recognition.
Despite the fact that numerous approaches towards automatic speech recognition have been developed; techniques such as template matching methods, statistical methods, neural network based approaches, and so forth, have not been able to achieve the end desired goal of, "speaker independent continuous automatic speech recognition". The majority of these approaches, are dependent upon heuristic knowledge, in addition to their own methodology, whatever it be. Within automatic speech recognition systems as a whole, the problem faced within the segmentation phase still remains to be solved; the importance of the problem is that, the segmentation phase is usually the first step within the recognition process; any errors faced in the segmentation phase will without a doubt spread to later stages, and the overall perfomance of the system will never climb above that of the segmenter. Also, despite the fact that the majority of speech recognition expert systems today, through the use of powerful tools, are able to transform the knowledge of experts into rules, and effectively maintain them; there still exists the obvious drawback that the labeling, uses thresholds, and this means that, rules for values near the threshold value, cannot be flexible, and it becomes difficult to determine the exact threshold values.
The Proposed Solution:
As mentioned above, the problem of the segmentation phase is a common one in automatic speech recognition systems. In order to avoid the problems faced in the SPREX expert system, the proposed idea is: an irregular unit based on spectral transition measure. Frames should be used as the structure of the speech recognition rules; this particular structure will provide the user with an easier means towards the creation of rules, and will enable the creation of an automatic rule generator; the use of fuzzy linguistic variables for representing the rules is also proposed, and the use of a rule generator cycle, a rule tester with an error reporter, and a state describer with additional modules, in order to make up the workings of the rule generating cycle; a cycle which could save precious time when making rules, and thus result in a high performance speech recognition system. In short, the concept of fuzziness would be applied to the system in order to provide flexibility of rules; a problem which was pinted out previously, and this of course would provide the flexibility required by human experts who create the rules; linguistic variables will be used instead of thresholds, so as to describe the state trajectory of parameters. The figure on the next page shows the definition of the rule structure; the figure on page * shows the overall structure of the automatic speech recognition expert system, and the flow of rule generation; and figure 6, on page *, shows the structure of the speech data base and the relations with other modules.
The Modern Day Market:
If it were necessary to add a present day timeline within the written history of automatic speech recognition at the beginning of this article, one that would cover our modern day age, what would it read like? Well, for starters, it would be safe to note that almost anything that is something today (in this technological age), has automatic speech recognition technology incorporated within it in some form or another; whether it be the current gadgets that we use in our daily life style (mobile phones, televisions, automobiles etc.), or whether it be in the performance of important everyday duties using modern systems (telephone banking, secure payment gateways via the telephone, customer service queries via the telephone etc.), we can safely assume that the list would be a very long one; let us therefore elaborate on the most popular incorporations of it in our modern day society:
There were two main reasons that automatic speech recognition was introduced into the telecommunications network in the early 90’s: (1) so as to reduce company costs through the automating of customer service attendant functions, (2) so as to provide previously impractical (due to the high associated costs of using customer service attendants) revenue generating services; examples include: Automation of Operator Services - (the "Voice Recognition Call Processing" system by AT&T; the "Automated Alternate Billing" system by Nortel); Automation of Directory Assistance - (systems that were created to assist operators in determining telephone numbers in response to customer voice queries); Voice Dialling – (systems created for the sole purpose of voice dialling by name, and by number, so as to provide customers with the option of completing calls without having to use the keypad);
Voice Banking Services – (systems that were previously unavailable provided customers with the possibility of accessing their account balances and transactions); Voice Prompter – ("Interactive Voice Response" systems that replace touch tone input with voice replacement allowed the customer to speak the touch tone option; these systems have evolved over time allowing customers to speak the service associated with the touch tone position); Directory Assistance Call Completion – (a system that handles completion of calls that are made via requests for "Directory Assistance"); Reverse Directory Assistance – (a system created so as to provide information such as the name and the address associated with a spoken telephone number); Information Services – (systems that allow customers to access lines of information so as to retrieve information about sporting events, weather and traffic reports, theatre bookings, restaurant reservations, etc.).
In the domain of healthcare, automatic speech recognition technologies can be implemented both in the front-end of the medical documentation process, and in the back-end of the medical documentation process; in the front-end, the speaker dictates into a speech recognition engine and almost immediately - the words that are recognised are displayed on screen. If the speaker is then happy with the recognition process, he immediately becomes responsible for signing off the document, and if not, he then has the option of editing it before signing it off. In the back-end, the speaker dictates into a digital dictation system, and the voice file is then routed to a transcriptionist along with the recognised draft document; the transcriptionist is then responsible for both the editing and the finalising of the draft document.
Within the military, automatic speech recognition can be implemented in high-performance fighter aircrafts, helicopters, battle management command centres, and within the training technique methods for air traffic controllers. In high-performance fighter aircrafts and helicopters, the pilot operates in a heavy workload environment, an environment which requires the constant use of the eyes and the hands. This is where an automatic speech recogniser system can really soften the load on the pilot by assisting him in common tasks such as, setting a radio frequency, or choosing a weapon - all of this without the need for him to move his hands from the controls of the aircraft, or even shift his gaze away to within the interior side of the cockpit.
Battle management command centres require the quick access and control of large and ever changing information databases; these databases require to be queried in the most convenient manner possible, in an environment very similar to an aircraft pilot’s – an eyes busy environment where the majority of the information is presented in a display format, which is yet another example of how human interaction with machine by speech can be extremely useful, and how it has the potential to be of even more use in the foreseeable future.
Within the training techniques for air traffic controllers that are currently in use, speech recognition systems make use of excellent applications that reduce training and support personnel costs, whilst at the same time ensuring a maintained quality of training for all trainee controllers.
Other Popular Areas
Some other key areas where this technology is in use are: navigation systems, the automotive industry, smart homes, and the video gaming industry.b) A Peek Under the Bonnet:
The Decision Making Process:
There are three approaches that can be taken so as to reach the intended goal of automatic speech recognition by machine:
The acoustic phonetic approach.
The pattern recognition approach.
The artificial intelligence approach.
The first approach, and the most simple and straightforward approach towards performing speech recognition by machine - the accoustic phonetic approach, sees the machine (in a sequential manner) attempt to decode the speech signal, in a manner which is based entirely upon the observed acoustic characteristics of the signal, and any known relations between these characteristics and phonetic symbols. It is most definitely a practical approach to automatic speech recognition but, for a number of valid reasons this approach has not seen the same successful results in speech recognition systems as have other more modern approaches; It is an approach based upon the theory of acoustic phonetics and assumes that there exist fixed, distinctive, phonetic units in all spoken language, and that these units are largely characterised by agreed upon properties that are accepted as being present in the speech signal, or its spectrum, over time. Despite the fact that the acoustic properties of phonetic units are greatly inconsistant for both speakers and neighbouring phonetic units, it is generally presumed that the rules governing these inconsistancies are straightforward and can be freely learned and thus applied in real-world situations; therefore, the first step within the acoustic phonetic approach towards automatic speech recognition is called a segmentation and labeling phase - because the segmentation of the speech into distinct sections is carried out (a segmentation process in which the acoustic properties of the signal are illustrative of one or more phonetic units or classes), and then, dependant upon the acoustic properties - one or more phonetic labels are attached to each segmented region. A second step is then required in order to carry out the speech recognition, a step in which a valid word or string of words is attempted to be determined via the sequence of phonetic labels shaped in the initial phase. The second approach, the pattern recognition method, is simply an approach within which, the speech patterns are used directly without any obvious feature determination and feature segmentation. There are two steps involved as in the majority of pattern recognition approaches - the training of speech patterns, and the recognition of speech patterns through the comparing of patterns.
This first step (the training procedure) is where the speech knowledge is brought into the system; a procedure which should be able to sufficiently characterise the acoustic properties of the pattern without knowledge of any other pattern. The second step (the recognition procedure) is where the pattern comparison takes place; a comparison of the unknown speech still to be recognised, with that of each possible pattern already learned in the training phase; the unknown speech is then classified according to the quality of match in pattern. It would be fitting to note at this stage that, the pattern recognition approach is the accepted method of choice for automatic speech recognition systems for three reasons:
Simplicity of use.
Robustness and invariance to different speech vocabularies, users, feature sets, pattern comparison algorithms and decision rules.
Proven high performance.
Now the third method, the artificial intelligence approach to automatic speech recognition, is a mixture of the acoustic phonetic approach, and the pattern recognition approach (this will be touched upon later in much more detail), in that it makes use of concepts from both. This method aims to automate the recognition procedure in accordance to the way that a person applies its intelligence in visualising, analysing, and lastly, and most importantly, making a decision on the measured acoustic features.
Now, as we concluded above – the heart of any automatic speech recognition system is the pattern recognition and decision operations. We shall expand further upon these important topics below:
Mathematical Formulation of the ASR Problem:
The problem of ASR is expressed as a statistical decision problem; to be even more specific, it is denoted by formula as a Bayes maximum a posteriori probability (MAP) decision process; a process in which we endeavour to find the word string Ŵ (in the task language), that maximises the a posteriori probability P(W|X) of that string, given the measured feature vector, X, i.e.,
This equation (9.2), shows that the calculation of the a posteriori probability, is simplified into two terms, one that expresses the a priori probability of the word sequence W, namely P(W), and the other that expresses the likelihood that, the word string W, produced the feature vector X, namely P(X|W). For all future calculations we ignore the denominator term P(X), since it is independent of the word sequence W, which is being optimised. The term (P(X|W), is known as the "acoustic model", and this model is usually denoted as PA(X|W) so as to emphasize the acoustic nature of this term. The term P(W), is known as the "language model", and is generally denoted as PL(W), so as to underline the linguistic nature of this term. The probabilities associated with PA(X|W), and PL(W), are estimated or learned from a set of training data that have been labelled by a knowledge source, usually a human expert, where the training set is as large as is practically possible. The recognition decoding process of (9.2) is often written in the form of a 3-step process, i.e.,
Where Step 1, is the computation of the probability associated with the accoustic model of the speech sounds in the sentence W; Where Step 2, is the computation of the probability associated with the linguistic model of the words in the utterance; and where Step 3, is the computation associated with the search through all the valid sentences in the task language, - for the most likely sentence. In order to be more clear about the signal processing, and computations associated with each of the three steps of (9.3), we need to go into further detail about the relationship between the feature vector X, and the word sequence W. As discussed above, the feature vector X, is a sequence of acoustic observations, that correspond to each of T frames of the speech, of the form:
Where the speech signal duration is T frames (i.e., T times the frame shift in ms), and each frame xt, t = 1,2, . . . ,T is an acoustic feature vector of the form:
that characterizes the spectral/temporal properties of the speech signal at time t; and D is the number of acoustic features in each frame. In the same manner, we can express the optimally decoded word sequence, W, as:
where there are assumed to be exactly M words in the decoded string.
A Universal Paradigm:
The above is a universal paradigm for speech recognition. The model describes, in a visual fashion, the process from beginning to end of automated speech recognition. It shows the progress (from left to right) of: a user generating a speech signal (through the act of speaking); the spoken product is then recognised via the speech signal being decoded into a sequence of words that are meaningful according to the syntax, the semantics, and the pragmatics, of the recognition task; at this stage the meaning of the comprehended words is acquired by a higher level processor that uses a dynamic knowledge representation in order to change the syntax, the semantics, and the pragmatics accordingly; in this fashion, certain things are omitted from consideration by the machine at the risk of misunderstanding, but at the same time minimizing errors for sequentially meaningful inputs; the feedback then obtained from the higher level processing box reduces the complexity of the recognition model by limiting the search for valid input sentences from the user. Finally, the recognition system responds to the user in the form of the requested action being performed, with the user being prompted for further input.
"The explicit and systematic management of vital knowledge and its associated processes of creating, gathering, organizing, diffusion, use and exploitation, in pursuit of organizational objectives." (2 p. 95)
Problems faced everyday by businesses worldwide are greatly affected by knowledge. What is knowledge? Well, in simple words, it is nothing but the understanding that people in all walks of life have developed whilst solving a variety of problems faced in such an environment using information and data. Businesses, via the correct handling of knowledge, are able to gain strategic and competitive advantages over their competitors, and any one organisation can achieve success if it is able to meet its users continuously changing demands, through its products, processes, and people; the main resource for the capturing of, the using of, and the sharing of, knowledge, are the employees within the organisation. Assets that are managed by any single knowledge management system are either explicit or tacit; documents, reports, policies, files, and databases, usually make up what we refer to as explicit knowledge, whereas tacit knowledge, is usually embedded within organisational practices, insights, norms, and key people who are responsible for these. Explicit knowledge can be easily represented and communicated through words and symbols, whereas tacit knowledge, must be solely represented in a manner which the system can read so as to allow that the knowledge be further operated upon during the knowledge management process; a process during which sub-processes such as identification, acquisition, organisation, representation, use, sharing, and conversion, are all incorporated in order to attain an optimum management of this vital knowledge for the organisation. All knowledge management activities conclude in the recording of this knowledge in the form of a centrally available repository, a repository which contains knowledge assets such as business processes, customers, key individuals within the organisation, organisational memory, and relationships.
The Integration of Knowledge:
Automatic speech recognition systems require knowledge that has been gathered and assembled from a vast range of disciplines; in only this manner will these systems be successful.
As mentioned previously, a key part of the automatic speech recognition decision process is pattern recognition. Pattern recognition, as has also previously been stated – is made up of two steps: the training of speech patterns, and the recognition of patterns via pattern comparison. It is within this first step, the training of speech patterns, that speech knowledge is brought into the speech system; the concept behind this is that – if enough varieties of a pattern to be recognised, (whether it be a sound, or a word, or a phrase, etc.), are integrated into the training set delivered to the algorithm, the training procedure should be effectively able to characterise the acoustic properties of the pattern, even without the knowledge of any further pattern introduced to the training procedure; we also mentioned that this type of characterisation of speech via training, is known as pattern classification, because the machine learns, which acoustic properties are reliable and which can be repeated across all training tokens of the pattern.
The effectiveness of this method is the pattern comparison stage, a stage in which a direct comparison takes place of the unknown still to be recognised speech, to that of each probable pattern learned in the training phase; this then allows the unknown speech to be classified according to the similarity of match in pattern. Another source area where the incorporation of knowledge is required is in the area of segmentation and labelling (another area which was touched upon earlier); the AI approach within this area would be to supplement the commonly used acoustic knowledge with phonemic, lexical, syntactic, semantic, and even pragmatic knowledge. These types of knowledge can be further defined as follows:
Acoustic knowledge - the evidence of sounds that are spoken based on spectral measurements, and the presence or the absence of features.
Lexical knowledge - the combining of acoustic evidence, so as to assume words as set forth by a lexicon; a lexicon that maps sounds into words or which in a vice versa manner - breaks down words into sounds.
Syntactic knowledge - the combining of words, so as to form grammatically correct strings (according to a model of language); strings such as sentences or phrases.
Semantic knowledge - the comprehending of the task domain, so as to be able to authenticate sentences or phrases that are consistent with the task being performed, or with previously decoded sentences.
Pragmatic knowledge - inference ability required, so as to resolve ambiguity of meaning, based on multiple ways in which words are generally used.
c) Knowledge Management:
The Universally Accepted Benefits:
With the use of knowledge management, organisations can increase the understanding of, the sharing of, and the utilisation of all existing knowledge; by adopting the systematic form of knowledge management this can take organisations on a path which leads to higher-level innovations; reasons that might drive certain organisations towards taking such a path might appear in the following forms of potential concerns – (which of course, once knowledge management techniques are applied can lead to much desired benefits in these areas):
Size and dispersion of an organisation – Large organisations that have multiple employees and complicated business procedures can find it a difficult task to manage and control their intellectual assets; furthermore, the knowledge is distributed across numerous geographical locations. In situations such as these a virtual source of organisational knowledge storage can be a valuable asset; an asset which can also add to the globalisation of the business.
Reducing the risk and uncertainty – A dependance upon the people who possess important forms of knowledge can be risky, however, risks such as these can be reduced by collecting and managing this knowledge as a whole; thus making it always available on demand in a timely fashion.
Improving the quality of decisions – Knowledge which is readily accessible aids the management in making better and more informed decisions; thus creating new opportunities, and possible forms of knowledge which can later support new innovative practices.
Improving customer relations – Every successful organisation must be able to correctly predict and therefore satisfy – their customers needs. By storing the relevant knowledge and thus making it readily accessible in an on demand fashion, customer services can be improved, for no longer will the customer be required to rely on the services of a single person, whose possible absence or even lack of knowledge can cause the relationship between the customer and the organisation to suffer. Instead; using knowledge management techniques, new alternatives can be invented so as to carry out these tasks in a cheaper, simpler, and much quicker fashion.
Technocentric support – Due to the fact that knowledge management is highly reliant on technology, as technology advances, it becomes even more possible to serve both customers and employees through better services, and higher quality, timely decisions.
Intellectual asset management and prevention of loss of knowledge – The knowledge held by an organisation is available to access without the obvious barriers of geographical location, time, and availability of experts. Once the required information is documented, a much needed backup is at all times kept in a safe location - as a further precautionary step. All of this of course, makes it an easier task to manage, share, test, or even clone the available resources.
Future use of knowledge – In addition to all the aforementioned procedural benefits, such as the discovery, the use of, and the sharing of knowledge, the stored knowledge can be used later for the training of future experts, or for backtracking to a particular event - so as to analyse or track mistakes for the goal of learning from them, and thus taking the appropriate measures so not to repeat them.
Increase the market value and enhance an organisation’s brand image – The cheaper and better, quicker solutions that a knowledge management system can provide make all the beneficiaries of the system happy. The customers as a result feel that they can rely on the organisation, which furthermore goes a long way in improving the company’s brand image. The priceless knowledge gained from customer feedback and expert opinion improves the overall quality of the end product and services of the company, thus resulting in an increased market value of the product.
Shorter product cycles – Due to the fact that organisations are not dependent on any particular expert, and because business decisions are made in a knowledge oriented manner from any location, - the business transaction cycle is shorter. For this reason, the organisation satisfies customers and employees needs in a better, quicker manner.
Restricted access and added security – Due to the technocentric approach to knowledge management that we spoke of earlier, the knowledge becomes available to select people. This can be done via passwords, access rights, or other utilities. Such controlled access makes the information even more secure and also offers a clear view of the picture by hiding details that are not required.
The Beneficial Impact Within Automatic Speech Recognition Systems :
For an automatic speech recognition system to be succesful in its intended task, it requires knowledge and expertise derived from a great number of disciplines, a range far broader than any single individual can logically be expected to possess. For this reason, it is absolutely vital that a researcher has a good strong understanding of the essentials within the field of automatic speech recognition, - so that numerous available techniques can be applied to a range of problems. One way of actually comprehending the real beneficial impact knowledge has on the ASR system, is to compare a language processing system to a simple data processing system; what actually distinguishes these language processing applications is their ability to use the knowledge of language. Consider the "Unix wc" program, a program which is utilised in order to count the total number of bytes, words, and lines - in a text file; now take into account that a data processing application such as wc is sufficient enough when it comes to counting the bytes and the lines, however, when it comes to counting the number of words in a file it requires knowledge about what it means to be a word, and in like manner, goes on to transcend into a language processing system. More advanced language agents such as HAL (refer to 1a where brief mention is made of this system) require a much broader and deeper knowledge of language; a knowledge which would make HAL capable of analysing an incoming audio signal for example and then recovering the precise word sequence the speaker used so as to produce that signal; furthermore, HAL in response, would also be required to show the ability to take a sequence of words and then generate an audio signal that the speaker can recognise; in order for HAL to have the ability to succesfully undertake any of these tasks knowledge and expertise derived from a great number of disciplines is required.
The Process of Knowledge Management Within ASR Systems:
Earlier we made a brief comparison between a simple system "WC", and a more sophisticated language agent "HAL". Now in order to expound on the process of knowledge management within the automatic speech recognition system, I will again attempt to do so using HAL. Now HAL, as stated earlier, requires a much broader and deeper knowledge of language than for example a simple speech recognition system such as WC; let us briefly take into consideration some of what HAL would need to know in order to engage in the following dialogue scenario:
Now, in order to determine what Dave is saying, it is necessary that HAL is capable of analysing an incoming audio signal, and then recovering the precise sequence of words that Dave used in order to produce that signal. In a similar fashion, HAL must need be capable of taking a sequence of words and then generating an audio signal in response, - so that Dave can in turn recognise; both these above tasks require knowledge about phonetics and phonology, a knowledge which can help model how words are pronounced in common everyday spoken speech. It is also worth noting at this point that HAL is even capable of producing contractions such as "I’m" and "can’t"; now in order to produce and recognise these and other variations of individual words; words such as "doors" or "handles", - which are both in their plural forms, require a knowledge about morphology; a knowledge which can capture information about the shape and behaviour of words in context. If we move further, beyond individual words, HAL must also also find itself capable of analysing the structure which is underlying to Dave’s request. Why? Well, amongst other reasons such an analysis is necessary so that HAL be able to determine whether Dave’s utterance is a request for action or not, as opposed of course to a simple statement that requires no spoken action; such an example might be as follows:
HAL, the pod bay door is open.
HAL, is the pod bay door open?
Furthermore to the above, it is necessary that HAL uses a similar structural knowledge in order to properly string together the necessary words that will form its response; HAL, for example would require knowledge of the fact that the following sequence of words (despite the fact that the same set of words as the original are contained within the sequence, will not be of any sense to Dave:
I’m I do, sorry that afraid Dave I’m can’t.
Such knowledge, required so as to order and group words together correctly comes under the heading of syntax, and knowledge of the nature of Dave’s request (Dave’s command is about "HAL" opening the pod bay door, rather than an inquiry about the day’s lunch menu) requires a knowledge of the meanings of the component words, the domain of lexical semantics, and a knowledge of how these seperate components combine so as to form larger meanings, - the domain of compositional semantics. Further to what we have already said above, it is worth noting that HAL in response to Dave’s command, instead of replying with "No" or "No, I won’t open the door", demonstrated that it knew enough to be polite in its response and first responded with the phrases "I’m sorry" and "I’m afraid", and only then, (in an indirect manner) signals its refusal with the more direct and truthful "I won’t"; the knowledge to use this appropriate and somewhat polite and indirect language comes under the heading of pragmatics. Finally, instead of simply ignoring Dave’s command and leaving the door closed, HAL opts to take part in a structured conversation centred around Dave’s initial request; HAL’s correct use of the word "that", in its answer to Dave’s command is an illustration of the ability on HAL’s behalf to use a somewhat in-between utterance device common; this displays a knowledge on HAL’s part known as discourse conventions, a knowledge that is required by any automatic speech recognition system so as to correctly structure such conversations. Let us now summarise the six distinct categories of knowedge of language so as to recall them:
Phonetics and Phonology – The study of linguistic sounds.
Morphology – The study of the meaningful components of words.
Syntax – The study of the structural relationships between words.
Semantics – The study of meaning.
Pragmatics – The study of how language is used to accomplish goals.
Discourse – The study of linguistic units larger than a single utterance.
The Key Components Within the ASR System:
Using the diagram on page 53 as a perfect example of what key components any automatic speech recognition system is comprised of, we can go further in explaining the key role held by each one of these, and their important contribution within the system as a whole:
Any raw audio signal, (received through the use of a microphone for example), is in too complex of a form for the task of speech recognition; it requires that it be converted into a more manageable form. This is the main role of the front end.
An acoustic model holds all the data that describes the acoustic nature of all the phonemes that are understood by the system. Acoustic models are put together through a training process using large quantities of transcribed audio. Usually an acoustic model is specific to any one language and could be tailored to either a specific accent, or trained to work together with a broader range of accents. Phonemes usually sound different depending on what the previous and next phonemes are, and for this reason context-dependent-phonemes are utilised by the acoustic model; these are phonemes within the context of a preceding and following phoneme, and are known as triphones. Each one of these triphones are represented by a hidden Markov model (HMM), and this describes how the sound of the phoneme develops from beginning to end, through the splitting up of the sound into a number of HMM states. The acoustic model then in turn assigns parameters to each triphone state, and these parameters describe a probability density function; finally, this function will be used in order to calculate a probability that the triphone state matches a feature state.
The dictionary holds the set of words, and their pronunciations; pronunciations which are expressed using the set of phonemes that have been understood by the acoustic model.
A language model designates how individual words can be combined to form longer sequences of words; sequences in the form of sentences, sub-sentences, commands, etc.). In the same manner a human will utilise the knowledge of language in order to succesfully decipher individual words; words that for some reason or another might be ambiguous or simply just not clear enough, a decoder also will do the same using the language model. Two main forms of language model exist: the statistical (SLM) and constrained grammars.
The heart of any automatic speech recognition system is the decoder; it has the task of decoding a sequence of features which are received from the "front end", so as to reveal what words were spoken, or at least its best hypothesis. The decoding process is a layered process, and is ultimately built upon the processing phonemes (the acoustic model), followed by the words (the dictionary), then the sequences of words (the language model). For a number of reasons, the recognition of phonemes is significantly error prone, and for this reason the phoneme process cannot be successful of its own accord; it must be informed by the language model.
Tools and Techniques:
A very important sector within speech recognition, is the working towards the development of techniques, tools, and systems, for speech input to machine; now whilst it would be impractical to make reference to all tools and techniques available within the area of speech recognition, I will attempt to make detailed reference to the most common:
The Accoustic Phonetic Approach:
The first technique shown in the table above, is one that has briefly been made mention of already within this article; one that has already been touched upon earlier within this discussion; the accoustic phonetic approach. This early approach assumes that finite, distinctive phonetic units (phonemes), exist within all spoken language, and that they are widely characterised by a set of acoustic properties that become manifest within the speech signal over time. Despite the fact that the acoustic properties of the phonetic units are highly inconstant (known as the co articulation effect); it is hypothesised within this acoustic phonetic approach that the rules governing this variability are straight-forward, and that they can easily be learned by machine. The initial step in this acoustic phonetic approach is a spectral analysis of the speech, all mixed in together with a feature detection that takes the spectral measurements, and converts them into a set of features that effectively define the broad acoustic properties of the various phonetic units. The next step as we also briefly mentioned earlier, is the segmentation and labelling phase; a phase in which the speech signal is segmented into sure acoustic regions; this in turn is followed by the attaching of one or more phonetic labels to each segmented region, which in turn results in a phoneme lattice characterisation of the speech. The final step within this particular approach endeavours to determine a valid word, or a string of words, from the phonetic label sequences produced during the segmentation to labelling phase. During the validation process, linguistic constraints on the task such as the vocabulary, the syntax, and other semantic rules are all called upon, in order to access the lexicon for word-decoding based on the phoneme lattice.
The Pattern Recognition Approach:
This second approach, another of which has been briefly touched upon earlier within this paper - consists of two very important steps: the pattern training, and the pattern comparison. The crucial feature within this approach is the fact that it uses a very well formulated mathematical framework (one described earlier in great detail), and establishes reliable speech pattern representations, for trustworthy pattern comparison. A speech pattern representation can be found in the form of a speech template, or a statistical model, one such as a hidden markov model; it can then be applied to a sound, one which is smaller than a word, or a word, or even a phrase. During the pattern comparison stage, a direct comparison is made between the unknown speech, which of course is yet to be recognised, against each possible pattern that had been learned in the training stage; this is carried out so that the identity of the unknown can be determined according to goodness of pattern matching.
The Template Based Approach:
This approach has led to a whole host of techniques; techniques that have contributed greatly to the field over the last six decades and advanced it. The idea behind this approach is simple; an assembly of ideal speech patterns are stored as reference patterns; these represent the dictionary of candidate’s words; recognition is then carried out by simply matching an unknown spoken utterance with each of these reference templates, and then selecting the category of the best matching pattern. Two key ideas within the template method approach is: 1) to derive a typical sequence of speech frames for a word, (otherwise known as a pattern under these circumstances), via the use of some average procedure, and then to rely upon the use of local spectral distance measures, in order to compare patterns; 2) to use a form of dynamic programming in order to align patterns temporarily, and thus account for differences in speaking rates amongst speakers, as well as across the repetitions of any single word by the self-same speaker.
The Stochastic Approach:
This type of approach, stochastic modeling, involves the use of probabilistic models in order to correctly handle uncertain information or incomplete information. Within the field of speech recognition, uncertain and incomplete information derive from a number of different sources; sources such as: confusable sounds, speaker variability’s, contextual effects, and homophones words; this approach, which allows the correct handling of these types of useless information, makes the stochastic model a particularly suitable approach towards speech recognition. The most popular stochastic approach today is without a doubt the hidden Markov model, which is characterised by a finite state markov, and a set of output distributions. Within the Markov chain models are transition parameters known as temporal variabilities, whilst within the output distribution model are spectral variabilities known as the output distribution model; these two types of variabilities are at the heart of speech recognition.
Dynamic Time Warping (DTW):
Dynamic time warping, is an algorithm that is used to measure the similarity between two sequences, sequences which may differ in terms of time or speed. This algorithm has been applied to video, audio, and graphics; in short, it would suffice to say that, any form of data which can be transformed into a linear representation of it, can be analysed with this type of algorithm. It is a type of method which allows a computer to find the ideal match between two given sequences, (e.g. time series), with certain restrictions in place; the sequences are "warped" in a non-linear fashion, within the time dimension, so as to determine a measure of their similarity, independent of certain non-linear variations within the time dimension; this sequence alignment method is used oftentimes within the context of hidden Markov models. A well known application of Dynamic Time Warping over the years has been, automatic speech recognition, for it is an algorithm which is very well suited to the matching of sequences; sequences where information is missing (providing that the existing segments are long enough for a match to occur).
Artificial Intelligence Approach (Knowledge Based):
This approach is (as we briefly touched upon earlier), a hybrid of the acoustic phonetic approach, and the pattern recognition approach. This simply entails that, the hybrid approach makes good use of ideas and concepts from within both techniques. This knowledge based approach uses the information (as we have again shown earlier through wording and diagrams), with regard to spectrogram, linguistic, and phonetic. Some researchers within the field of automatic speech recognition, have developed recognition systems that use acoustic phonetic knowledge, in order to develop classification rules for speech sounds; whilst some have used a template based approach, and have been very succesful in designing an effective speech recognition system; these researchers however provided very little insight as to how they went about conducting the human speech process, and therefore made tasks such as error analysis, and knowledge-based system enhancement, very difficult, whilst on the other hand, large quantities of linguistic and phonetic literature was able to provide certain insights, and some understanding, within the field of human speech processing. This of course, in the most logical of manners, brings us to knowledge engineering design in its purest form; the direct and explicit incorporation of experts’ speech knowledge, into a automatic speech recognition system; this knowledge is mainly derived from a careful study of spectrograms, and then incorporated into the system using rules or procedures. On the other hand, in a more indirect manner, knowledge has also been used to guide the design of the models and algorithms of other techniques; techniques such as template matching and stochastic modeling. This manner of knowledge application, makes an essential distinction between knowledge and algorithms, for algorithms assist us in the solving of problems, whereas knowledge allow the algorithms to work better. The design of all successful strategies that are known of today, have been considerably contributed to by this form of knowledge based system, for it plays a key role in the selection of a suitable input representation, the definition of units of speech, or the design of the recognition algorithm itself.
Connectionist Approaches (Artificial Neural Networks):
This particular artificial intelligence approach, attempts to turn the recognition procedure into a mechanical method; a method according to the same manner that a person would apply intelligence in visualising, analysing, and characterising speech, based on a set of measured acoustic features. Amongst techniques that are used within this line of methods are the use of an expert system, such as a neural network; an expert system that incorporates phonemical, lexical, syntactical, semantical, and even pragmatical knowledge, for the segmentation and labeling phase; - an expert system which uses tools such as artificial Neural Networks, in order to learn the relationships among phonetic events. Now; the emphasis within this approach has been mainly in the representation of knowledge, and the integration of knowledge sources. Within these connectionist models, knowledge or constraints are not encoded in individual units, rules, or procedures, but rather, - distributed across many simple computing units. Uncertainty is not modeled as likelihoods or probability density functions of a single unit, but instead - by the pattern of activity in many units; these computing units are of a simple nature, and the knowledge is not embedded into any one individual unit function, but on the contrary, lies in the connections and interactions that exist between linked processing elements. In a similar fashion to stochastic models, connectionist models heavily rely upon the availability of good training, or learning strategies; these connectionist learning strategies seek to optimise or organise a network of processing elements, however unlike the stochastic approach no assumptions need be made here about the underlying probability distributions, for multilayer neural networks can be trained to generate fairly complex non-linear classifiers, or mapping function. It is this simplicity and uniformity within the workings of the underlying processing element, that makes connectionist models an attractive choice for hardware implementation.
Support Vector Machine (SVM):
The SVM(97) is one of the most powerful tools used in pattern recognition; it uses a discriminative approach via the use of linear, and non-linear separating hyper-planes, for data classification. Due to the fact however, that SVMs are only able to classify fixed length data vectors, this particular method cannot be easily applied to tasks that involve variable length data classification; instead, the variable length data must be transformed to fixed length vectors, before SVMs can be used. It is a generalised linear classifier with maximum margin fitting functions. This fitting function allows regularisation, which in turn helps the classifier generalise more efficiently. This particular method is free of dimensionality, and is able to utilise spaces of very large dimensions, which in turn permits the construction of a very large number of non-linear features, and then performing "adaptive feature selection" during training; thus, by shifting all non-linearity to the features, the SVM is able to use a linear model of which the VC dimensions (Vapnik-Chervonenkis dimensions) are known.
"Intelligent systems for computer-aided software engineering (CASE), are another type of KBS. These systems guide the development of information/intelligent systems, - for better quality and effectiveness." (2)
Case based reasoning (CBR), is the perfect technology in making use of historical knowledge. The CBR paradigm covers a host of different techniques and methods; methods developed for organising, indexing, retrieving, and utilising, the knowledge retained in historical cases. These cases may be kept as concrete experiences, or a set of similar cases may come together to form a generalised case. Cases may be stored as separate knowledge units, or may even be split up into various sub-units and then distributed within the knowledge structure. These cases may be indexed by an already fixed or open vocabulary, and within a flat or hierarchical structured index. The solution from a historical case may be applied directly to a problem that exists at the present time, or modified in accordance to the differences that exist between the two cases. CBR provides quick answers to newly arising problems, even when there is an absence of relevant information, providing that there are a great number of previously solved cases available. In CBR, the initial description of a problem defines a new case; this new case is then used so as to retrieve a case from the already existing library of historic cases; the retrieved case is then combined with the new case, and then further revised into a solved case. The two processes which are evident in this cycle, are the revision process, and the retain process; throughout the revision process, the procedure tests for success, and then repairs if it fails; during the retain process, experience that is of use is retained for future use, and the case-base is updated by a new learned case, or by the modification of certain existing cases.
The Integration of it Within Speech Recognition:
As we have already stated above, CBR is a learning, problem solving, paradigm; a paradigm that solves new problems by re-calling, and then re-using, - specific stores of knowledge obtained from historic experience. CBR algorithms have been applied within speech recognition in the past, and despite its success in real world applications, CBR suffers from some obvious limitations: Firstly, it does not produce concise representations of concepts that are easy to understand, and thus reason by humans or even machines; and secondly, because CBR systems are highly sensitive to noise and other irrelevant features. On a positive note though, rule induction systems are able to learn general domain specific knowledge from a set of training data, and represent the knowledge in understandable condition-action rules; they are also very successful in identifying small sets of highly predictable features, and most importantly, - can make good use of statistical measures in order to combat noise. As in CBR however, induction systems also have their weaknesses, and have been blamed for forming axis-parallel frontiers in the instance space only, as also having great trouble in recognising exceptions in small low frequency sections of the space. Aside from these problems, we can also come to add a more obvious one – rules by nature are symbolic, and therefore are no good at representing continuous functions. Now, when these two systems are combined appropriately – CASE-based reasoning and rule induction, can take automatic speech recognition to the next available level - to solve problems in systems where, when one technique fails the other can provide a satisfactory solution; (this type of system is only one of many ways of integrating CASE-based knowledge into Automatic Speech Recognition machines); some examples of such systems similar to the one decribed above are:
Pros and Cons of Knowledge Management in ASR systems:
Despite the fact that knowledge management is so widely used in automatic speech recognition systems today, this does not mean it is a practice which entails benefits alone, but also a number of disadvantages:
Compatibility: Through the integration of knowledge within Speech Recognition, it has now become possible to, through the use of a voice command, type a letter, translate a memo, and even store a meeting’s information. Applications such as these are even flexible enough to work along different types of software and systems.
Convenience: Whilst typing away on our keyboard, at times we tend to pause and think, then type, and then pause and think again… Actions such as these bring about a disruption within the creative process; it is a much more convenient and effective experience if you just simply talk, and your words are then converted into text, and your ideas be seen conveyed before you. For those people who oftentimes record meetings and memos, this is a perfect solution; all the above of course not being possible without the integration of knowledge.
Speed: The art of typing requires practice; the art of typing fast requires even more practice, and of course plenty of experience. There is little doubt however that, - speaking your thoughts out loud is a much faster practice, and which of course requires little if any experience.
Hands Free: Multi-tasking is made possible, for whilst the speech recognition system applies itself upon your speech, you can perform other necessary tasks.
Easy to Learn: Once the commands have been mastered, an ASR system becomes a simple system to use. This in turn entails an easier computing experience for those who are either new to computers, or relatively devoid of technical understanding. It is also ideal for the disabled, elderly, and those who are unable to type away effectively.
Accuracy: All is not smooth riding; there is always a certain margin of error, and speech recognition systems all vary in degree of accuracy. If for example a paragraph contains 50 words, a high degree of accuracy displayed by the system might entail in 46-58 words translated correctly; if low however, this might entail that text will need to be re-typed, and thus corrected manually. The same applies to voice commands; for example, speaking the command "Open Word", might result in the computer mistakingly opening an entirely different program.
Sound Quality: The method of speech, the tone and quality of voice, are all important factors. We as humans find it difficult to comprehend the words of our fellow subject when spoken to in a slurry, mumbling, or low tone; why should the application differ in this area to us? A clear, crisp tone, is required. Background noise is a also a major factor; a crowded, or noisy background environment, will without a doubt affect the embedding process; the application will not be able to separate the unwanted noise from the meaningful speech.
Learning: The way we speak at times can differ; a word may be pronounced slightly different on occassions; the software must either learn or adjust to the quirk or trait within your speech signal. Due to the fact that it is a progressive application, it will learn through its mistakes, and thus learn over time; whilst you await this progression however, you will most likely find yourself editing, and thus correcting, - the end product.
Environment: In a relatively quiet environment, others situated around you may be disturbed by your speech; on the other hand, they may even mistakenly think that you are talking to them; this of course can lead to a range of problems.
Price: Typing into a keyboard, is cheaper than using a speech recognition system; no third party tools or applications are required, aside from the keyboard itself; a high-end speech recognition system on the other hand; one that is worth your time and attention, is anything but cheap.
"Repeated or continuous observations or measurements of the patient, his or her physiological function, and the function of life support equipment, for the purpose of guiding management decisions, including when to make therapeutic interventions, and assessment of those interventions". (3)
In environments such as the intensive care unit, and the operating room, patient monitoring is an extremely intricate process; a process which involves clinicians, nurses, and vast amounts of information; information which ranges from clinical observations, and patient data, all acquired from bedside monitors, to laboratory results; these highly complex processes, can place excessive demands on the cognitive skills of the clinician. The failure to recognise the existence of a problem, or the failure to identify it in a timely manner, could result in discomfort, disability, or in the worst scenario, even death for the patient.
Integrating knowledge and intelligence:
Artificial intelligence techniques can be introduced into these environments so as to:
alleviate some of the common problems encountered with information management;
to better reflect the possibility that a problem with the patient exists;
to try to correctly identify the cause of the problem.
The vast majority of intelligent monitoring systems; systems that have been developed for the "intensive care unit", and the "operating room", show a method of working which displays five dissimilar functional levels; to begin with, signal acquisition takes place; the system receives raw data from clinical sensors which usually includes analogue to digital conversion, and some low-level signal processing; there then follows a signal validation level, where data validity checks, and artifact removal, - both take place; the third level of function, is where feature extraction and trend analysis is performed; this is simply the transformation from numerical features that characterise the signals to a symbolic presentation; the next level, known as the "inference" and "smart alarm level", consists of the different reasoning elements that are used to arrive at diagnoses, explanations, or the prediction of events; using this information, the decision whether or not to initiate an alarm is then made. The fifth and final level, is the presentation of the information that supports the conclusive decision arrived at, including visual alarms, - to the clinician using the operator interface.
The majority of work put forward into the developing of intelligent monitors for the ICU, has concentrated thus far on two particular applications, namely: cardiovascular monitoring, and ventilator management for post-operative care of cardiac surgery patients; examples of such intelligent monitoring systems, applications, and methods, might appear in the form of:
The automatic control of drug infusion so as to control patient haemodynamic states, (also used in the operating room, and related to the development of smart sensors is also).
In the designing of application interfaces – providing intelligent decision support system information that can be utilised by the clinician, if presented in a way that helps explain how a decision was reached.
Temporal pattern recognition.
Template based methods.
Signal processing of the ECG, and many, many, more.