Notes on the Google Search Appliance

The Google appliance which we use on campus to index web content, can index content in several different ways.

1. Crawling. This is the traditional method of Google indexing whereby you give it a URL and it hunts down any content under that location. This is the way we currently index the IS&T web site and other sites at MIT.

2. Database access. Content hidden away in a database (Oracle, for example) can be indexed - you tell Google how to connect to the database and supply queries for it to execute.

3. Feed API. This method allows you to push content to Google for indexing. You can either push URLs or full content. In this way, content that is neither in a URL-accessible file system nor a network-enabled database can be indexed. Content in a run-time Alfresco CMS could be indexed in this way by the Google appliance.

To summarize, these options give us considerable flexibility in how we design our web app while satisfying the requirement that Google provide the search capability.

MIT's license for the Google appliance limits us to 500,000 pages. In talking to Dave Conlon of IS&T, it appears we are well within that limit. As the IS&T web site work is generally not adding a lot of new pages, merely moving them from one location to another, I don't think at this point we will make an impact on the limits of the license.

Google search and keywords

MIT's Google does index meta tag content and this content is taken into account for searches, but by default there is no particular weight given to meta tag content in a search.

However, the meta tag fields can be specifically queried by adding arguments to the search command. See https://web.mit.edu/google/v4/mit/xml_reference.html#request_meta_filter

In short, we can search on keywords stored in meta tags as long as we construct the search query correctly.

Child pages

Notes on the Google Search Appliance

Google search and keywords