Monday 31 December 2012

XVoice: speech control of Linux desktop applications


XVoice

An open source speech control project up for adoption

The Early Days

Around the end of the 1990's IBM released a Linux version of their ViaVoice speech recognition engine. It was always a beta product, it never had the full set of features of the original Windows program, but at the time it was the only good recognition engine for Linux, so I started playing around with it. 

Soon I discovered an open source project, XVoice. XVoice was an application that used the ViaVoice engine for speech-to-text, but then used the resulting text to control the Linux desktop. It was a hack that used a bunch of programs in ways they were never designed for, and it achieved something rather exciting: you could speak to Linux and it would do your bidding. 

One of the great features of the ViaVoice engine was that it allowed you to define a grammar, and then the engine would match whatever speech was input against that grammar. This meant that without training, recognition rates were near-perfect for a domain specific grammar (nicely defined in BNF). 

Progress and Success

After a few months of regular development in early 2000, XVoice had proper support for user-defined command grammars. These grammars mapped spoken commands to keystrokes, and you could have multiple grammars, one for each application. XVoice had some (hacky) heuristics; you could specify a regex that would match against the window title, which then would automatically load the right grammar file. You could control the mouse too, XVoice split the screen into ever smaller 3x3 grids that you navigated until the mouse was where you wanted it. The grammars were hierarchical  so you could include the grammar for spelling out numbers in your emacs control grammar, and they supported pattern substitution, so the command sent to an application could include some of the words you said. 

There were some quite motivated users who contributed a lot to the development. One was a programmer who used Vi but had severe RSI that was making it difficult for him to work. He defined a comprehensive Vi grammar that allowed him to program, and interestingly, claimed he was more efficient because he was using higher level Vi commands than he would normally. Some Emacs users had huge grammars that let them read news, send emails, program in lisp and who knows what else. 

As an aside, my experience working on XVoice left me in no doubt that for regular people, voice control of your computer is a fun trick, but the only people who would use this on an ongoing basis are those who have no other choice. Talking for several hours a day is physically difficult, and even with the clever grammars some people designed, it's not something you'd choose to do unless you had to. We had at least a few quadriplegic users who were starting to use the system with some success, for them, XVoice was the only way they could operate a Unix machine. This realization made the future of the project clearer: we'd focus on features that helped people who couldn't type at all or only with great difficulty. 

As a programming project, working on XVoice was just great, and I learned a lot from the other programmers who were much more experienced and capable than I was. By the end (version 0.9.6 I think...) it was a really cool program, being used by people who really needed it to work. We encountered and solved some of the problems of voice control of graphical user interfaces. The command grammar was a pretty elegant and extensible system, and I recall that the sense at the time was that most of the interesting work ahead lay in defining bigger and better libraries of grammars. 

An Abrupt End

Unfortunately IBM didn't seem very interested in ViaVoice on Linux. We had some contact with the developers, who were quite helpful, but the "official" IBM people would never tell anyone what the plans for continuing Linux support were. The only contact we had with them was when they demanded that we make it clear on the XVoice related websites that we didn't distribute ViaVoice with XVoice, that people had to buy it separately (which we always did). 

Then one day IBM discontinued ViaVoice for Linux. It just disappeared from their website. At the time, CMU Sphinx was the only plausible candidate for an open-source speech-to-text engine that we could use instead of ViaVoice, but it wasn't very mature and had some issues that would have been tough to work with. The main coders on the project had personal or work issues that meant they couldn't work on XVoice for a while, and so the project lost momentum. 

Wistful Thinking

Every once in a while when I read an article about speech to text, particularly about command and control systems, I wish that we'd had the time to rework XVoice to work against an open source engine. It's disappointing that there's still no clear (open source or other) solution for people who can only interact using speech. Many big tech companies pay lip service to accessibility, but in this case the big boys didn't do anyone any favors.

The Future

I think it's important to distinguish between projects that focus on building better general recognition accuracy for dictation, and accessibility oriented command and control systems. Complete control with speech is a difficult problem, I don't think you solve it as part of a larger generic command and control platform. You have to focus on accessibility and talk to users who have real accessibility issues, and get them to work with you to overcome them. If you are working on command and control, it's worth remembering that these people are the only ones who will be still using your software when the novelty wears off and their throat is sore. In the final few months, that's where XVoice was focusing, there were a bunch of awkward problems we'd need to fix, but it was pretty exciting. 

The key feature that XVoice or any other command and control system relies on is the ability to feed context-specific grammars to the recognition engine on the fly. The underlying accuracy of the engine isn't very critical if the grammar is sufficiently constrained. All modern engines are likely good enough. But as far as I know, the current HTML5 implementations of speech input don't yet support setting grammars. CMU Sphinx appears to, but it's not clear how well it works in practise, their configuration files seem quite complex. 

The XVoice code is all GPL, it's on Sourceforge and now on github  so please feel free to go nuts. Before today it was about 10 years since I looked at it but the docs are actually pretty good and the code isn't as bad to read as I expected. It's mostly pre-RAII C++, so would need a cleanup and a dose of smart pointers to bring it up to modern standards. Even if the project isn't resurrected, the ideas around how command grammars are structured and used might be useful to another project, or the code for generating X events could be reused. There's a sample set of grammars in the modules subdirectory that make for interesting reading - there's even one for "netscape", how quaint. 


Sunday 23 December 2012

Lessons learning Haskell



Lessons learning Haskell

It's often claimed that learning Haskell will make you a better programmer in other languages. I like the idea that there's no such thing as a good programmer, just a programmer who follows good practices. As soon as we stop following good practices  we suck again. So, Haskell must introduce and indoctrinate better practices that we carry back to our other languages. Right? I think it's true but it's not obvious, so I've written this article to outline some of the habits and practices that I think changed after I used Haskell for a while.

Purity

Haskell makes your functions pure by default. You have to change the type signature if you want to do IO, and mark every function between your function and main() as tainted with IO. This forces you to be conscious of IO. It encourages you to keep functions that do IO as high up the stack, close to main, as possible. Purity also means you can't read or write global variables, that's just another type of IO. If your function needs some data, you pass it as a parameter. So the type signature of a pure function is a complete itinerary of everything it can access, and therefore is a very good spec for what the function does in most cases.

Experienced programmers who pay attention already know that IO and global vars mustn't be taken lightly. Every IO operation is a potential source of errors, exceptions, and failures. Functions that do IO are difficult or impossible to test. Programmers know this, but Haskell makes sure you never forget it when it matters. It incessantly shunts you in the direction of keeping the call-stack of IO-doers as small as possible.

When I go back to my other languages, I now put all my IO in top-level functions that are called directly from main or the event loop. I gather every scrap of data I need to do the computation. I marshal the data into typed structures and pass it all into a pure function that does the work. Then the structure that's returned is demarshalled and transmitted, displayed or stored as required. If I need to do some computation to determine what data I need to fetch, I make sure this is not commingled with the IO functions, so my code fetches data, calculates what else needs to be fetched, fetches that data, and so on.

This has some nice effects. The IO doers are isolated and distinct. Error handling and exception catching is clearer and simpler. The compute code is pure. This makes it much easier to test, debug, and understand.


Clean Syntax for Static Types


boo :: Map Integer String -> String -> Integer

I have no idea what the word "boo" is supposed to mean when used as a function name. But I can be almost certain what this "boo" does. It takes a map that has Integer keys and String values, and a second argument that is just a String, and it returns an Integer. So I'm fairly sure that this function does a reverse mapping - you give it a String value and a Map, and it finds the Integer key for that value.

A lot of this depends on the fact that the signature tells me that there is no IO going on. If the bit at the end of the line was "-> IO Integer" instead of "-> Integer", all bets are off. The function could be sending the Map and String to launch control, and -> IO Integer could be the number of seconds it took to get a response, or the price of a gallon of gas in pennies (hence "boo", perhaps). The point is, you can't confidently reason about a function from its signature if IO is involved.

The Haskell type signature of a function is particularly clear and easy to follow. Functions just map one type to another "foo :: Author -> DateOfBirth". Parameterized types just list the parameter types "Map Integer String". There are very few boilerplate tokens for the eye to scan.

But how has this changed what I do in other languages? I now sketch out the design for larger components in this Haskell signature notation. Particularly if I'm writing a library with a public API. I've shown these sketches to other developers as we discuss the design of a program, and they get it. Most of the time, I don't mention that the notation is Haskell. The only slight oddity for them is the use of -> between function "inputs". They expect foo :: A,B -> C instead of foo :: A -> B -> C. But they get over it immediately, and I have never had to mention currying or partial application, since they're usually just pleased that the notation is clearer than anything else we've ever used.

Container Operations


I think one of the reasons I started using Lisp, then Erlang and then Haskell was that I must have typed "for (size_t i=0; i<..." just about a million times and I was sick of it. C++ teases with approximations to map, filter, fold, scan, just enough so that you'll try them for a few months until you eventually give up or your colleagues smack you. When I want to filter items from a container, I don't want to start by saying "for(size_t i=0...". I want to say "filter f xs" and I want my colleagues to read that too.

It might seem like an overreaction. But even in big classfull C++ projects, where I was senior developer, I spent my days writing functions. Functions consisting of loops and branches, because C++ didn't do a great job of accommodating "operations on containers". Despite all the guff written about STL separating iterators from algorithms from containers (from allocators...ahem), nobody provided a simple set of primitive container operations that regular programmers would use.  

Using Haskell for a while, the effect goes further. It forced me to think of every such problem as a chain of the primitive list operations, maps, folds, filters and scans. Now I always think in these terms. I "see" the transformation of a container as a simple sequence of these operations. Before, I would have thought in terms of munging multiple actions into the body of a single loop.

I see things using higher level concepts, and I write my comments and code with these in mind. I usually still reach for the trusty for loop in C++, but I'll factor together common container operations where appropriate into higher level functions. Filtering is a really common example that seems to come up all the time.

What's Changed


Using Haskell definitely gives you a lot of warm fuzzy feelings, (until your filehandle is closed before you've actually read the data because you didn't ask for a result, silly). Part of the joy of the language is that it forces you to take a new approach to problems you've solved conventionally before. When the answer clicks and you see that the new approach is more elegant and powerful and general than what you've been using all these years, it's hard not to smile with sheer pleasure.

In real world commercial software projects, if you don't properly test your code, and do code reviews, it doesn't matter what language you use, you're leaving the big wins on the table. A team that does these things well consistently will beat any team that does not, regardless of what language or technology stack they're using.

Using Haskell changed my practices so that the code I write is easier to test and easier to code review.  There's a bunch of other stuff too, some of it can be articulated, some will probably always be just a "warm fuzzy".