Richard Socher, Chief Scientist, Salesforce - December 14, 2017
Language is a multifaceted AI challenge that requires people to simultaneously understand tone, context, sentiment, logic, world knowledge, cultural references and much more. Teaching a computer to understand language and its inherent nuances requires a solution that goes beyond leveraging algorithms to decipher just text alone. There must be an understanding of how speech relates to text; visual and linguistic concepts must intertwine and be grounded in order to provide proper contextualization; and computers have to learn entirely new languages such as SQL in order to deliver precision and reliability. In addition, architectures have to scale to support these different tasks and models have to learn how each task relates to one another. Mastering these concepts makes natural language understanding one of the most difficult challenges facing AI research because it requires a single solution that solves for each of these facets.
Salesforce Research has made significant progress in deep learning and natural language processing (NLP) over the course of the last year that address these challenges. We’ve built faster, more scalable model architectures, developed a reinforcement learning agent that programs new neural network architectures, and improved training methods in order to take full advantage of the wealth of data currently available and to increase model performance for each individual NLP task. Today, I’m excited to announce new breakthroughs that bring us closer to a unified approach to tackling the many facets of natural language, paving the way for humans and machines to communicate more effectively and work side-by-side.
The best machine learning model is frequently dictated by various requirements such as the size of the available dataset, how quickly a model must run and the type of data that it will process. This means the optimal machine learning model should be tailored to each dataset and each usage, but there are not enough machine learning engineers and researchers to cover every possibility. We’re introducing a novel approach for automating architecture search to produce recurrent neural networks (RNNs) that are more flexible and take into account non-standard components. By using a domain-specific language (DSL) in automating architecture search it is possible to produce novel RNNs of arbitrary depth and width. The DSL is also flexible enough to define standard architectures such as the Gated Recurrent Unit and Long Short Term Memory and allows the introduction of non-standard RNN components such as trigonometric curves and layer normalization. By improving the efficiency of these architecture generation methods, Salesforce Research is helping move towards a future where the grunt work of testing several different architectures is left to a machine, leaving humans to focus on more challenging aspects of NLP.
A generator produces candidate architectures by iteratively sampling the next node (either randomly or using an RL agent trained with REINFORCE). Full architectures are processed by a ranking function and the most promising candidates are evaluated. The results from running the model against a baseline experiment are then used to improve the generator and the ranking function. Learn more here.
When humans learn a new task, we leverage our existing skill set whenever possible. For instance, learning long division requires us to first know how to do simple division. Similarly, enabling models to combine simple, previously learned tasks alongside and as subcomponents of new skills allows them to take on more complex tasks. To do this, we introduce a novel framework for efficient multi-task reinforcement learning (RL) that identifies underlying relationships between skills. As the RL agents executes a task it breaks it into smaller actions using a hierarchy of “policy networks”-- also known as neural networks-- that predict when to use a previously learned policy and when to learn a new skill. This enables programing agents to continually acquire new skills more efficiently during different stages of training. To represent the learned task in an interpretable way, we trained an RL agent to communicate its plans and decisions using human language instructions such as "put down." We validate our approach on Minecraft games, which are designed to explicitly test the ability to reuse previously learned skills while simultaneously learning new skills. In addition to automating chatbot conversations, these results could make it possible to automate bot’s actions, such as tracking down inventory and placing an order on behalf of a customer.
Example of our multi-level hierarchical policy for a given task, in this case stacking two blue blocks. Each arrow represents one step generated by a certain policy and the colors of arrows indicate the source policies. Note that at each step, a policy either utters an instruction for the lower-level policy or directly takes an action. Learn more here.
Questions that require counting a variety of objects within an image remain a major challenge in visual question answering (VQA). The most common approaches to VQA involve classifying answers based on fixed-length representations of both the image and question, and interpreting individual components from each section of the image. We see counting — an important yet challenging subtask of VQA — as a sequential decision process that relies heavily on interpretability of the discrete visual representations within an image. We introduce a new Reinforcement Learning Counter (IRLC) model, which learns to count by numbering relevant objects in the scene and sequentially deciding which objects to add to the count. Specifically, the model sequentially selects from detected objects and uses inferred relationships between objects to influence subsequent selections. This approach has the advantage of intuitive and interpretable output, as discrete counts are automatically grounded in the image. Our method outperforms the state of the art architecture for VQA on multiple metrics that evaluate counting.
The IRLC model inputs a counting question and image. Detected objects are added to the returned count through a sequential decision process. Learn more here.
Speech recognition is fundamentally changing the way we interact with smart devices. Traditional phonetic-based recognition techniques require training individual models for separate speech components such as pronunciation, acoustic and language-- each with its own training objectives. As a result, it is difficult to improve speech recognition systems because updating one of the components does not necessarily lead to improved performance of the whole system. We propose a novel solution using end-to-end models that jointly train all speech components simultaneously with a single objective. However, this presents its own set of challenges. First, there can be several million training parameters that can result in overfitting. Second, training objectives and testing metrics are commonly different due to optimization limitations, which may lead to inferior models. We tackle these challenges by improving the regularization of the model during training and using policy learning to optimize the performance metric. Both approaches are highly effective and improve the performance of the end-to-end speech model significantly.
Model architecture of our end-to-end speech model. Different colored blocks represent different layers as shown on the right, the triangle indicates dropout happens right before the pointed layer.
Today’s advancements hold significant potential impact for how we move forward as an industry to build a more productive and intuitive relationship between humans and machines. There is still more work to be done, and our focus for the coming year is to continue to push the boundaries of research in deep learning and NLP.