“Two pizzas sitting on top of a stove top oven”
“A group of people shopping at an outdoor market”
“Best seats in the house”

People can summarize a complex scene in a few words without thinking twice. It’s much more difficult for computers. But we’ve just gotten a bit closer -- we’ve developed a machine-learning system that can automatically produce captions (like the three above) to accurately describe images the first time it sees them. This kind of system could eventually help visually impaired people understand pictures, provide alternate text for images in parts of the world where mobile connections are slow, and make it easier for everyone to search on Google for images.

Recent research has greatly improved object detection, classification, and labeling. But accurately describing a complex scene requires a deeper representation of what’s going on in the scene, capturing how the various objects relate to one another and translating it all into natural-sounding language.
Automatically captioned: “Two pizzas sitting on top of a stove top oven”
Many efforts to construct computer-generated natural descriptions of images propose combining current state-of-the-art techniques in both computer vision and natural language processing to form a complete image description approach. But what if we instead merged recent computer vision and language models into a single jointly trained system, taking an image and directly producing a human readable sequence of words to describe it?

This idea comes from recent advances in machine translation between languages, where a Recurrent Neural Network (RNN) transforms, say, a French sentence into a vector representation, and a second RNN uses that vector representation to generate a target sentence in German.

Now, what if we replaced that first RNN and its input words with a deep Convolutional Neural Network (CNN) trained to classify objects in images? Normally, the CNN’s last layer is used in a final Softmax among known classes of objects, assigning a probability that each object might be in the image. But if we remove that final layer, we can instead feed the CNN’s rich encoding of the image into a RNN designed to produce phrases. We can then train the whole system directly on images and their captions, so it maximizes the likelihood that descriptions it produces best match the training descriptions for each image.
The model combines a vision CNN with a language-generating RNN so it can take in an image and generate a fitting natural-language caption.
Our experiments with this system on several openly published datasets, including Pascal, Flickr8k, Flickr30k and SBU, show how robust the qualitative results are -- the generated sentences are quite reasonable. It also performs well in quantitative evaluations with the Bilingual Evaluation Understudy (BLEU), a metric used in machine translation to evaluate the quality of generated sentences.
A selection of evaluation results, grouped by human rating.
A picture may be worth a thousand words, but sometimes it’s the words that are most useful -- so it’s important we figure out ways to translate from images to words automatically and accurately. As the datasets suited to learning image descriptions grow and mature, so will the performance of end-to-end approaches like this. We look forward to continuing developments in systems that can read images and generate good natural-language descriptions. To get more details about the framework used to generate descriptions from images, as well as the model evaluation, read the full paper here.


(Cross posted on the Official Google Australia Blog)

This week, thousands of people from more than 160 countries will gather in Sydney for the once-in-a-decade IUCN World Parks Congress to discuss the governance and management of protected areas. The Google Earth Outreach and Google Earth Engine teams will be at the event to showcase exemplars of how technology can help protect our environment.

Here are a few of the workshops and events happening in Sydney this week:

  • Monday, November 10th - Tuesday, November 11th: Over the last couple of days, the Google Earth Outreach and Earth Engine teams delivered a 2-day hands-on workshop to develop the technical capacity of park managers, researchers, and communities. At this workshop, participants were introduced to Google mapping tools to help them with their conservation programs. 
  • November 13 - 19: Google will be at the Oceans Pavilion inside the World Parks Congress to demonstrate how Trekker, Street View and Open Data Kit on Android mobile devices can assist with parks monitoring and management. 
  • Friday, November 14, 9:30-10:30am: Join a Live Sydney Seahorse Hunt in Sydney Harbour, via Google Hangout, with Catlin Seaview Survey and Sydney Institute of Marine Science. Richard Vevers, Director of the Catlin Seaview Survey, will venture underwater to his favorite dive site and talk with experts about the unique marine life (including seahorses!) that explorers can expect to find around Sydney. Tune in here at 10:30am to catch all the action. 
  • Saturday, November 15th, 8:30am: Networking for nature: the future is cool. Hear about how technology-driven ocean initiatives can help us better understand and strengthen our connection with our natural environments. WPCA-Marine’s plenary session will includes presentations by Sylvia Earle and Mission Blue, Catlin Seaview Survey, Google, Oceana, and SkyTruth. The session will also feature leading young marine professionals Mariasole Bianco and Rebecca Koss. 
  • Saturday, November 15th, 12:15pm: We’ll be hosting a panel discussion on using Global Forest Watch to monitor protected areas in near-real-time. Global Forest Watch is a dynamic online alert system to help park rangers monitor and preserve vast stretches of parkland.
  • Saturday, November 15th, 1:30 - 3:00pm: At the Biodiversity Pavilion join Walter Jetz from Yale and Dave Thau from Google for a presentation on Google Earth Engine and The Map of Life. The presentation will showcase how Google Earth Engine is being used in a variety of conservations efforts - including monitoring water resources, the health of the world's forests, and measuring the impact of protected areas on biodiversity preservation. We will also announce a new global resource from The Map of Life for mapping and monitoring biodiverse ecosystems. 

We believe that technology can help address some of our world’s most pressing environmental challenges and we look forward to working with Australian conservationists to integrate technology into their work.

You can find us at the Oceans Pavilion inside the World Parks Congress, where we will be joined by our environmental partners including The Jane Goodall Institute, The World Resources Institute and The Map of Life.

We hope to see you at one of our events this week!


Recently, at the 27th ACM User Interface Software and Technology Symposium (UIST’14), Google Senior Research Scientist Shumin Zhai and University of Cambridge Lecturer Per Ola Kristensson received the 2014 Lasting Impact Award for their seminal paper SHARK2: a large vocabulary shorthand writing system for pen-based computers. Most simply put, this is one of those rare works that is responsible for fundamental and lasting advances in the industry, and is the basis for the rapidly growing number of keyboards that use gesture typing, including products such as ShapeWriter, Swype, SwiftKey, SlideIT, TouchPal, and Google Keyboard.

First presented 10 years ago at UIST’04, Shumin and Per Ola’s paper is a pioneering work on word-gesture keyboard interaction that described the architecture, algorithms and interfaces of a high-capacity multi-channel gesture recognition system-SHARK2. SHARK2 increased recognition accuracy and relaxed precision requirements by using the shape and location of gestures in addition to context based language models. In doing so, Shumin and Per Ola delivered a paradigm of touch screen gesture typing as an efficient method for text entry that has continued to drive the development of mobile text entry across the industry.
"Awarded for its scientific contribution of algorithms, insights, and user interface considerations essential to the practical realization of large-vocabulary shape-writing systems for graphical keyboards, laying the groundwork for new research, industrial applications, and widespread user benefit."
Prior to joining Google in 2011, Shumin worked at the IBM Almaden Research Center for 15 years, where he originated and led the SHARK project, further developing and refining it to include a low latency recognition engine that introduced the ability to accurately recognize a large vocabulary of words based upon the patterns (sokgraphs) drawn on a touchscreen device. SHARK and SHARK2 subsequently continued further development as ShapeWriter. During his tenure at IBM, Shumin additionally pursued a wide variety of HCI research areas including, but not limited to, studying the ease and efficiency of HCI interfaces, camera phone based motion sensing, and cross-device user experience.

At Google, Shumin has continued to inspire the Human-Computer Interaction research community, publishing prolifically and leading a group that incorporates HCI research, machine learning, statistical language modeling and mobile computing to advance the state of the art of text input for smart touchscreen keyboards. Building on his earlier work with SHARK/ShapeWriter, Gesture Typing is just one of the innovations that make things like typing messages on mobile device easier for hundreds of millions of people each day, and remains one of the most prominent features on Android keyboards.

Shumin has been highly active in academia during his career, as both visiting professor and lecturer at world-class universities, and is currently the Editor-in-Chief of ACM Transactions on Computer- Interaction, a Fellow of the ACM and a Member of the CHI Academy. We’re proud to congratulate Shumin and Per Ola on receiving one of the most prestigious honors in the Human-Computer Interaction (HCI) research community, and look forward to their future contributions.


Each year the flu kills thousands of people and affects millions around the world. So it’s important that public health officials and health professionals learn about outbreaks as quickly as possible. In 2008 we launched Google Flu Trends in the U.S., using aggregate web searches to indicate when and where influenza was striking in real time. These models nicely complement other survey systems—they’re more fine-grained geographically, and they’re typically more immediate, up to 1-2 weeks ahead of traditional methods such as the CDC’s official reports. They can also be incredibly helpful for countries that don’t have official flu tracking. Since launching, we’ve expanded Flu Trends to cover 29 countries, and launched Dengue Trends in 10 countries.

The original model performed surprisingly well despite its simplicity. It was retrained just once per year, and typically used only the 50 to 300 queries that produced the best estimates for prior seasons. We then left it to perform through the new season and evaluated it at the end. It didn’t use the official CDC data for estimation during the season—only in the initial training.

In the 2012/2013 season, we significantly overpredicted compared to the CDC’s reported U.S. flu levels. We investigated and in the 2013/2014 season launched a retrained model (still using the original method). It performed within the historic range, but we wondered: could we do even better? Could we improve the accuracy significantly with a more robust model that learns continuously from official flu data?

So for the 2014/2015 season, we’re launching a new Flu Trends model in the U.S. that—like many of the best performing methods [1, 2, 3] in the literature—takes official CDC flu data into account as the flu season progresses. We’ll publish the details in a technical paper soon. We look forward to seeing how the new model performs in 2014/2015 and whether this method could be extended to other countries.

As we’ve said since 2009, "This system is not designed to be a replacement for traditional surveillance networks or supplant the need for laboratory-based diagnoses and surveillance." But we do hope it can help alert health professionals to outbreaks early, and in areas without traditional monitoring, and give us all better odds against the flu.

Stay healthy this season!


(Cross-posted on the Chromium Blog and the Google Online Security Blog)

At Google, we are constantly trying to improve the techniques we use to protect our users' security and privacy. One such project, RAPPOR (Randomized Aggregatable Privacy-Preserving Ordinal Response), provides a new state-of-the-art, privacy-preserving way to learn software statistics that we can use to better safeguard our users’ security, find bugs, and improve the overall user experience.

Building on the concept of randomized response, RAPPOR enables learning statistics about the behavior of users’ software while guaranteeing client privacy. The guarantees of differential privacy, which are widely accepted as being the strongest form of privacy, have almost never been used in practice despite intense research in academia. RAPPOR introduces a practical method to achieve those guarantees.

To understand RAPPOR, consider the following example. Let’s say you wanted to count how many of your online friends were dogs, while respecting the maxim that, on the Internet, nobody should know you’re a dog. To do this, you could ask each friend to answer the question “Are you a dog?” in the following way. Each friend should flip a coin in secret, and answer the question truthfully if the coin came up heads; but, if the coin came up tails, that friend should always say “Yes” regardless. Then you could get a good estimate of the true count from the greater-than-half fraction of your friends that answered “Yes”. However, you still wouldn’t know which of your friends was a dog: each answer “Yes” would most likely be due to that friend’s coin flip coming up tails.

RAPPOR builds on the above concept, allowing software to send reports that are effectively indistinguishable from the results of random coin flips and are free of any unique identifiers. However, by aggregating the reports we can learn the common statistics that are shared by many users. We’re currently testing the use of RAPPOR in Chrome, to learn statistics about how unwanted software is hijacking users’ settings.

We believe that RAPPOR has the potential to be applied for a number of different purposes, so we're making it freely available for all to use. We'll continue development of RAPPOR as a standalone open-source project so that anybody can inspect and test its reporting and analysis mechanisms, and help develop the technology. We’ve written up the technical details of RAPPOR in a report that will be published next week at the ACM Conference on Computer and Communications Security.

We’re encouraged by the feedback we’ve received so far from academics and other stakeholders, and we’re looking forward to additional comments from the community. We hope that everybody interested in preserving user privacy will review the technology and share their feedback at


As anybody who has tried to use a smartphone to photograph a dimly lit scene knows, the resulting pictures are often blurry or full of random variations in brightness from pixel to pixel, known as image noise. Equally frustrating are smartphone photographs of scenes where there is a large range of brightness levels, such as a family photo backlit by a bright sky. In high dynamic range (HDR) situations like this, photographs will either come out with an overexposed sky (turning it white) or an underexposed family (turning them into silhouettes).

HDR+ is a feature in the Google Camera app for Nexus 5 and Nexus 6 that uses computational photography to help you take better pictures in these common situations. When you press the shutter button, HDR+ actually captures a rapid burst of pictures, then quickly combines them into one. This improves results in both low-light and high dynamic range situations. Below we delve into each case and describe how HDR+ works to produce a better picture.

Capturing low-light scenes

The camera on a smartphone has a small lens, meaning that it doesn't gather much light. If a scene is dimly lit, the resulting photograph will contain image noise. One solution is to lengthen the exposure time - how long the sensor chip collects light. This reduces noise, but since it's hard to hold a smartphone perfectly steady, long exposures have the unwanted side effect of blurring the shot. Devices with optical image stabilization (OIS) sense this "camera shake” and shift the lens rapidly to compensate. This allows longer exposures with less blur, but it can’t help with really dark scenes.

HDR+ addresses this problem by taking a burst of shots with short exposure times, aligning them algorithmically, and replacing each pixel with the average color at that position across all the shots. Averaging multiple shots reduces noise, and using short exposures reduces blur. HDR+ also begins the alignment process by choosing the sharpest single shot from the burst. Astronomers call this lucky imaging, a technique used to reduce the blurring of images caused by Earth's shimmering atmosphere.
A low light example is captured at dusk. The picture at left was taken with HDR+ off and the picture at right with HDR+ on. The HDR+ image is brighter, cleaner, and sharper, with much more detail seen in the subject’s hair and eyelashes. Photos by Florian Kainz
Capturing high dynamic range scenes

Another limitation of smartphone cameras is that their sensor chips have small pixels. This limits the camera's dynamic range, which refers to the span between the brightest highlight that doesn't blow out (turn white) and the darkest shadow that doesn't look black. One solution is to capture a sequence of pictures with different exposure times (sometimes called bracketing), then align and blend the images together. Unfortunately, bracketing causes parts of the long-exposure image to blow out and parts of the short-exposure image to be noisy. This makes alignment hard, leading to ghosts, double images, and other artifacts.

However, bracketing is not actually necessary; one can use the same exposure time in every shot. By using a short exposure HDR+ avoids blowing out highlights, and by combining enough shots it reduces noise in the shadows. This enables the software to boost the brightness of shadows, saving both the subject and the sky, as shown in the example below. And since all the shots look similar, alignment is robust; you won’t see ghosts or double images in HDR+ images, as one sometimes sees with other HDR software.
A classic high dynamic range situation. With HDR+ off (left), the camera exposes for the subjects’ faces, causing the landscape and sky to blow out. With HDR+ on (right), the picture successfully captures the subjects, the landscape, and the sky. Photos by Ryan Geiss
Our last example illustrates all three of the problems we’ve talked about - high dynamic range, low light, and camera shake. With HDR+ off, a photo of Princeton University Chapel (shown below) taken with Nexus 6 chooses a relatively long 1/12 second exposure. Although optical image stabilization reduces camera shake, this is a long time to hold a camera still, so the image is slightly blurry. Since the scene was very dark, the walls are noisy despite the long exposure. Therefore, strong denoising is applied, causing smearing (below, left inset image). Finally, because the scene also has high dynamic range, the window at the end of the nave is blown out (below, right inset image), and the side arches are lost in darkness.
Click here to see the full resolution image. Photo by Marc Levoy
HDR+ mode performs better on all three problems, as seen in the image below: the chandelier at left is cleaner and sharper, the window is no longer blown out, there is more detail in the side arches, and since a burst of shots are captured and the software begins alignment by choosing the sharpest shot in the burst (lucky imaging), the resulting picture is sharp.
Click here to see the full resolution image. Photo by Marc Levoy
Here's an album containing these comparisons and others as high-resolution images. For each scene in the album there is a pair of images captured by Nexus 6; the first was was taken with HDR+ off, and the second with HDR+ on.

Tips on using HDR+

Capturing a burst in HDR+ mode takes between 1/3 second and 1 second, depending on how dark the scene is. During this time you'll see a circle animating on the screen (left image below). Try to hold still until it finishes. The combining step also takes time, so if you scroll to the camera roll right after taking the shot, you'll see a thumbnail image and a progress bar (right image below). When the bar reaches 100%, your HDR+ picture is ready.
Should you leave HDR+ mode on? We do. The only times we turn it off are for fast-moving sports, because HDR+ pictures take longer to capture than a single shot, or for scenes that are so dark we need the flash. But before you turn off HDR+ for these action shots or super-dark scenes, give it a try; we think you'll be surprised how well it works!

At this time HDR+ is available only on Nexus 5 and Nexus 6, as part of the Google Camera app.

Posted by Karen Parker, Education Program Manager and Jason Ravitz, Education Evaluation Manager

(Cross-posted on the Google for Education Blog)

Since 2009, Google’s CS4HS (Computer Science for High School) grant program has connected more than 12,000 computer science (CS) teachers with skills and resources to teach CS in fun and relevant ways. An estimated 600,000 students have been impacted by the teachers who have completed CS4HS professional development workshops so far. Through annual grants, nearly 230 colleges and universities have hosted professional development workshops worldwide.

Grantees use the funds to develop CS curriculum and professional development workshops tailored for local middle and high school teachers. These workshops expose teachers to CS curriculum using real-world applications that spark students’ curiosity. As feedback from those teachers rolls in, we want to share some highlights from what we’ve learned so far.

What went well:
  • 89% of participants reported they would recommend their workshop to others
  • 44% more participants reported a “high” or “very high knowledge” of CS after their workshop vs. before
  • More than half of participants said they would use “most” or “all” of the activities or resources presented during their workshop.
  • In 2014 the number of teachers who took part in a CS4HS professional development workshop increased by 50%, primarily due to the funding of multiple MOOCs.

Ways to make a bigger impact:

  • Just 53% of participants said they felt a sense of community among the other workshop participants. Research by Joyce & Showers (2002) and Wiske, Stone, & Levinson (1993) shows that peer-to-peer professional development, along with ongoing support, helps teachers implement new content, retain skills, and create lasting change. We’ll explore new ways to build community among participants as we plan future workshops.
  • 83% of participants reported being Caucasian, which is consistent with the current demographics of CS educators. This indicates a need to increase efforts in diversifying the CS teacher population.
  • Outcome measures show us that the most knowledge gains were among teachers who had no prior experience teaching CS or participating in CS professional development -- a population that made up just 30% of participants. While we see that the workshops are meeting a need, there remains an opportunity to develop materials geared toward more experienced CS teachers while also encouraging more new teachers to participate.

We know there are many challenges to overcome to improve the state of CS teacher professional development. We look forward to sharing new ideas for working in partnership with the CS education community to help address those challenges, in particular by helping more teachers teach computer science.

At the University of Sydney CS4HS workshop teachers are learning how to teach
Computer Science without a computer during a CS Unplugged activity.