Google Cloud Search: First Impressions

Chad Johnson February 13, 2017

On February 14th, Google G Suite Business and Enterprise edition customers will begin to have access to Google Cloud Search - the first step in Google’s journey to replace the Google Search Appliance. You can read the official announcement here: https://blog.google/products/g-suite/introducing-google-cloud-search-g-suite/

SADA Systems has activated Google Cloud Search on our domain, so I would like to show you some details of what is included in the first release of Google Cloud Search. I have run a series of tests to show you how it works on our own content.

Can Search Get Too Personal?

Chad Johnson April 19, 2016

The following article caught my attention this afternoon:

The UX Of Ethics: Should Google Tell You If You Have Cancer?

“If I’m on a park bench, and I’m next to someone, and I hear them talking about symptoms of cancer, am I obligated to turn around and tell them they might have cancer?”

… artificial intelligence products are rapidly approaching the same diagnostic power. Google Search can already predict coming flu trends with some level of success. It’s not hard to imagine a system that can track my searches over a year—an ache, a cough, a rash—recognizing a cascade of symptoms that point to a disease with surprising accuracy.

source: http://www.fastcodesign.com/3058943/the-ux-of-ethics-should-google-tell-you-if-you-have-cancer

I have not often considered the ethical implications of enterprise search. The article above reminded me that the information people type into those small search boxes can be extremely sensitive and revealing. Do you recall the 2006 release of anonymized search queries by AOL? Some very simple detective work made it possible to identify real people and real situations, even with usernames removed. For example, a vanity search for someone’s own name followed later in the day by a search for a rare medical condition or a financial problem. And it got more personal and embarrassing from there.

My Initial Impressions of Google’s Cloud Vision API

Chad Johnson April 6, 2016

When Google released the Cloud Vision API last month I started to think of the interesting ways that artificially intelligent image recognition could be used. One idea that jumped to mind was enhancing the ability to search for marketing images and stock photography in a digital asset management system.

As much as companies encourage individuals to tag and annotate images accurately, invariably the metadata is poor or missing altogether. What if we could use artificial intelligence to add metadata to images automatically at the time they are uploaded to the system? Could this new metadata make it easier or faster to find images?

Google’s Cloud Vision API and IBM Watson’s AlchemyVision API both provide this service. And Adobe is incorporating this features natively into their AEM DAM, calling it Smart Tags. I have tested the Google and IBM services and initially found that the Google Cloud Vision API returned more metadata for each image, and was slightly more precise with the labels it suggested. I will continue to test these services as they advance, but for this post, I am going to focus on giving my impressions of the Google Cloud Vision API.

Google’s DeepMind Artificial Intelligence Wins!

Chad Johnson March 9, 2016

For the first time, a computer has beaten a world champion in the game of Go. Google’s DeepMind program defeated grandmaster Lee Sedol after a three and a half hour battle. This accomplishment was once considered impossible, given the complexity of the game and the sheer number of permutations possible in the game.

What I find most interesting is the change in approach from IBM’s Deep Blue computer that beat chess champion Gary Kasparov in 1997. Deep Blue could analyze billions of potential future scenarios and pick the most logical and advantageous move at each step of the game. It essentially used brute force and statistics to pick the best path and stay ahead of Kasparov.

Google Releases Cloud Vision API. I Eat Crow.

Chad Johnson March 3, 2016

Last year Google introduced groundbreaking image recognition technology as part of their release of Google Photos. Deep neural networks could identify objects, people and places inside of photographs with astonishing accuracy, completely autonomously.

As fascinating as this technology is, I was certain that Google would keep it locked away inside their own products like Google Photos and Google Drive for the foreseeable future. I did not anticipate it being sold standalone as an API.

I was wrong. Two weeks ago, Google announced a beta version of their Cloud Vision API, allowing anyone to process images and identify things like everyday objects, faces (w/sentiment) and text or logos.

For example, processing the picture below with the Cloud Vision API that identifies common objects …

Google Search Appliance Sunsetting: Week One Reactions

Chad Johnson February 11, 2016

It has been exactly one week since Google announced that it would be sunsetting the Google Search Appliance over the next three years. And what a week it has been. I have had the opportunity to speak with media representatives at Fortune and CIODive, I have spoken with several current and prospective customers of the Google Search Appliance, I have spoken with numerous other search companies, and I have chatted informally with our consulting team and friends in the enterprise search ecosystem.

I would like to share some of my initial thoughts to the news, and the rollercoaster of reactions that I, and others, have had.

Did the news surprise me?

The most common question I have been asked is if the news surprised me. Was I surprised that Google would focus more energy on a cloud-based service than a physical yellow box? Not at all. It makes good sense for Google to develop a technology that does not require shipping GSAs around the world and installing them in data centers. I have been a partner with Google for 7 years and I have come to expect periodic hard-left turns from them. They usually turn out to change an industry in retrospect.

What is the Timeline for a Typical GSA Implementation?

Chad Johnson September 4, 2015

I’ll get this out of the way quickly — there is no such thing as a typical GSA implementation.

With that said, I have been involved in scoping and delivering around 250 GSA implementations, and some averages and trends have emerged. Most projects involve the same five core phases: Design and Requirements, Hardware and Security, Content Acquisition, User Interface, and Testing and Deployment. I will walk through how we estimate the duration of these phases, with the understanding that we always do a thorough scoping workshop with a customer to refine this timeline. Thanks to our large volume of projects, our estimates have gotten very accurate. While certain tasks might not always be completed in the order we predict, the overall effort is usually correct.

Machine Learning Pt. 2: Applications in Enterprise Search

Chad Johnson August 5, 2015

In last week’s article, What is Machine Learning?, I gave an overview of Machine Learning and provided a link to a very informative visual guide that describes how the technology works. I also hinted at some useful applications within Enterprise Search. I would like to cover a few of those that I have seen in recent months.

Query Analysis with Clustering

There are often many ways to search for the same concept or idea. When we are reviewing search logs and trying to draw conclusions from what people are searching for, slight variations from query to query can make it difficult to do a comprehensive analysis. For example, we recently worked with a major airline and one of their top queries was “bereavement fares”, i.e. people looking for discounted fares when attending a sudden funeral. We knew that this particular query was popular, and that it needed some tuning to best serve the customers, but further analysis reveled other similar but different queries suffering from the same problem. The other forms of the query were less common, and if treated individually, they would have flown under the radar (no pun intended) in our analysis.

Spam is dead! Well, at least for Google Mail users

Chad Johnson July 20, 2015

Last week, Google announced some astounding statistics about the success rate of their spam filtering technology in Google Mail. Google says that less than 0.1% of email in the average user’s Gmail inbox is spam, and the rate for legitimate email ending up in the spam folder is even lower at less than %0.05.

Despite these superb results, Google is continuing to innovate and find new ways of detecting and blocking spam. This year, Google has announced several innovations across their product line that utilize their machine learning technology, affectionately referred to as Google Brain. Using their artificial neural network software, Google Photos can identify objects in your photographs, like dogs or cats or bridges or birthday cakes. Google Maps can automatically detect new businesses or speed limit signs in the imagery collected by their entire fleet of Street View cars. Google Earth Engine can detect new refuge camps or deforestation boundaries by scanning for visual patterns in petabytes of satellite images.

Putting your Google Search Appliance index on a diet

Chad Johnson July 16, 2015

Setting up a web crawler in the Google Search Appliance is a piece of cake — you enter a starting URL and some boundaries and let it rip. The GSA will spider its way around until it finds every reachable page in the site. For a well-structure site, this usually produces very good results, but not all sites are created equally. While the GSA does have features for detecting cyclical loops and excessively redundant pages, it will often find significantly more pages than you expect. This can cause a reduction in the quality of your search results. The extra pages can dilute the relevancy of higher-quality pages, making it difficult to find the desired results.

In the worst-case scenario, the GSA index size will reach the licensed limit, resulting in only part of your site being indexed and searchable. In this situation, the GSA starts evicting pages to make room for presumably better pages, but in practice, the eviction algorithm is not perfect and can result in essentially random pages being removed. Regardless, eviction or truncation is not a happy thing and you will want to take action to fix the problem.

What do Google’s I/O announcements mean for the enterprise?

Chad Johnson May 29, 2015

Google announced a wide variety of new products and features during the Google I/O 2015 Keynote yesterday. At first glance, most of them are consumer focused. Google Photos, Brillo for smart home automation, virtual reality headsets, etc. However, several of the announcements or themes have an impact on the enterprise world.

Globalization and Mobile First

Google emphasized many times during the keynote how little of the world is currently covered with fast broadband. They are now deploying special low-bandwidth versions of their most important sites like google.com, Maps and YouTube to help the rest of the world access them reliably. Those emerging markets are almost entirely “mobile first”, or really “mobile only”. Most new customers in India and China will *never* own a desktop computer. They rely exclusively on mobile computing devices.

When building enterprise applications that could possibly be accessed outside of the US or a few other select countries, we should follow Google’s lead. Don’t overlook the need for low-bandwidth, offline-capable, mobile-ready configurations.

Ascending the Enterprise Search Value Map

Chad Johnson May 5, 2015

When I wake up in the morning I have a couple of choices for what to do first. I could brush my teeth. I could take a shower. Or I could fix that broken shelf in the garage.

Come again? Two of those sound very typical and easy. But the last one is less obvious, much harder, and not what most people would do first in the morning. That is, until I tell you the story about how a hammer fell off the shelf last night and nearly crashed through the windshield of my car. It sounds more important now, doesn’t it?

All work has value, but not all work has the same value. I could spend an hour with my kids drawing pictures on the sidewalk with chalk, or I could spend an hour with them sorting a box of beads and explaining how they can be groups by size or color or shape. Both tasks would be fun (theoretically), but working on basic math skills is arguably more valuable than drawing with chalk. I’m not saying they don’t both have value; one just happens to be more valuable than the other. Since I have limited free time, I should try to maximize the value of the work I do with the kids. The challenge is that drawing with sidewalk chalk is so much fun!

Much-maligned Google Glass will get a second chance

Chad Johnson March 24, 2015

I tried on Google Glass a few times. I wasn’t impressed. Does that mean it’s an awful product? No. Does that mean it is not useful? No. So what happened?

This week the Wall St. Journal reported that Google is not giving up on Google Glass. Google acknowledges that it was a first generation product with some rough edges, but they still see a future for wearable computing.

[Eric Schmitt, Google Executive Chairman] said Glass, like Google’s self-driving car, is a long-term project. “That’s like saying the self-driving car is a disappointment because it’s not driving me around now,” he said. “These things take time.”

I kept my initial impressions in check. I was certainly impressed with the execution of such a radical new product, but I knew the first version was not meant to be the Nintendo 3DS or Sony Walkman – some wildly popular consumer electronics device that everyone will own. It was a proof of concept that would allow Google to learn what worked and what didn’t work in the wearable computing space.

Google Search Appliance Connector Development: Discovery Phase

Chad Johnson March 10, 2015

Sorry… this post is not going to teach you how to build a connector. Maybe next time.

Instead, I am going back a little earlier in the process. I want to describe the process for qualifying our ability to build a new connector in the first place. How do we estimate the level of effort? How do we ensure that we will have sufficient access to the content or that the necessary APIs exist?

First, not all integrations or connections are alike – there can be wide variations in complexity or level of effort. When we run through this qualification process, we always evaluate our options in order, from easiest to most difficult. For each option, there is a set of criteria that must be met in order to be successful. If any criteria cannot be met, we move on to the next-most-complex option.

Zen and the Art of Search Statistics

Chad Johnson October 10, 2014

Yesterday a client asked me what Key Performance Indicators (KPIs) they should be tracking on their public site search. At first I was afraid I would not be able to answer this question – I don’t know their business or website nearly as well as they do. I know the wide variety of statistics that are possible to capture, but what should I recommend to this particular client?

I thought for a moment and realized they needed to know their goals for search before diving headfirst into any analysis. There is no point capturing and analyzing search statistics unless you need them. They should have a clear reason or objective behind each KPI they track. Second, they needed to think in terms of deltas or comparisons instead of raw numbers or simple orders of magnitude. Statistics like quantity of searches or frequency of keywords are unlikely to reveal anything by themselves, but observing something increasing or decreasing over time can tell a very interesting story. Likewise, comparing any certain statistic between users that ran a search vs. users that did not run a search can be very revealing.

Is your Google Search Appliance platform secure?

Chad Johnson August 11, 2014

If you have read Google’s product literature, you know that the Google Search Appliance is a very secure device. The bright yellow appliance runs a hardened version of CentOS, and the inner-workings are safely hidden behind root login.

So, assuming we are dealing with an appliance with Fort Knox-level protection, what risks remain? Below are several potential vulnerabilities that could jeopardize the security of your GSA platform. Some of these risks can be mitigated easily, while others (particularly those involving human beings) may never be 100% avoidable. I am not trying to cause panic. I only hope to better educate the community so that simple risks can be avoided, and more complex risks can be appropriately understood and mitigated.

Administrator access

The GSA admin console offers two levels of accounts, Administrator and Manager. In a perfect world, we would issue most accounts at the Manager-level. But in practice, Manager-level accounts are not typically powerful enough to satisfy the needs of most users. Manager accounts cannot adjust crawl URL patterns, manage query expansion files, or even set up dynamic navigation facets. Because of this, I find that a most accounts in the GSA admin console are created at the Administrator level.

Are my GSA search results good?

Chad Johnson July 28, 2014

“Are my GSA search results good?” Well, if you don’t know, then I don’t know. This is obviously a subjective question, and you likely know your own content much better than I do. While I may not be able to answer the question for you, I would like to offer some tips to help objectively measure the quality of your GSA search results.

Step 1: What are we measuring?

In order to make any conclusions about search result quality, we must first establish a set of queries to evaluate. There are infinite possibilities, but we can use some smarts to pick a good list of queries to measure. Here are a few suggestions:

Most popular queries
Worst-performing queries
- Queries that returned no results
- Queries where the user paginated very deep into the results
- Queries where the user did not open any results
- Queries where the user opened a bunch of different results in quick succession
Trending queries
- Queries with a significant increase over a short period of time
- Seasonal or event-related queries

Reverse engineering the GSA Query Suggestions feature

Chad Johnson July 7, 2014

Google often quotes that the average query submitted to their website is 1.7 words long. That means that most queries contain only two words, and a fair number contain only one word. Google can be magical at times, but using one or two words to search through trillions of web pages seems at best hit or miss. In practice, longer and more specific queries tend to produce better results.

Query Suggestions to the Rescue

How do we encourage users to use more than 1.7 words in their query? The Query Suggestions feature is one popular and effective solution. By presenting the user with a list of suggested queries below the search box, they only have to type a few letters before seeing a list of queries to pick from.

Plexi (Part 2) – Adaptors are evil!

Chad Johnson June 25, 2014

Previously I discussed the Plexi Adaptor framework for the Google Search Appliance. Adaptors can provide a simple and elegant way to index a content repository. An Adaptor sits in front of a repository, making it behave like a web site from the GSA’s perspective. This off-loads the state-management and queue-management to the GSA’s built-in web crawler, simplifying the implementation.

But I mentioned that I was leery of Adaptors. I have written dozens of connectors and I like the control that the Connector Manager framework affords. I have written Connectors that do not seem transferable to the Adaptor methodology. I would like to discuss a few reasons that I am not completely letting go of Connectors.

Change Detection

When I implement connectors with complex hierarchical ACL’s or that require joining multiple database tables, I often have to track the state of multiple objects to do change detection. For example, the ACLs in our Atlassian JIRA Connector take into consideration various objects, including the Project, Permission Schemes, Issue Schemes and Custom Attributes, to compute the ACL for a single Issue. Changes could occur in any of these objects, with ripple effects throughout the entire repository. Implementing stateless change-detection with a Retriever would be very difficult because these permission objects, and the complex interactions between them, do not have timestamps to reveal modifications. Instead, we store a snapshot of the permissions for each object in our Connector, and that allows us to quickly check for even subtle changes to the permissions. There is nothing that prevents an Adaptor from storing this kind of state information, but it goes against the recommended design.

Sequential Iteration