Background

The Edgerton Digital Collections (EDC) project will involve browsing and searching material from more than one content management system. While managing a connection between a user-facing application and a single content management system may not be complicated, as the number of systems grows and as the protocols for browsing or searching and the formats of results vary across systems, the integration work becomes complex and burdensome to the application. Add into the mix dynamic discovery and incorporation of new systems, and the case for insulating the application from managing and using this federation of content providers becomes compelling.

The architectural design of EDC is informed by lessons learned from similar projects that have had to address this issue. Specifically, EDC designers intend to employ a federator module that handles the tasks of multi-system search and encapsulates heterogeneity and many protocol and format details. This approach limits the scope of work in the application to connecting with a single federating service, allowing the application developers to focus their efforts on user-facing features.

High-Level Functions of the Federator

Within the overall design of EDC, the federator is responsible for two functional areas: (1) managing the working set of content connectors and (2) dispatching federated search requests and organizing search results.

The connection manager is responsible for proving details about each content source in the working set, discovering new sources that can be added to the working set, removing sources from the working set, and providing an instance of a connector to a specific content source.

The search manager is responsible for accepting a search query, passing queries to content sources via their connector, handling responses from sources, and organizing responses into a federated result set. A query consists of the set of content sources to which to direct a search, the search criteria, the kind of search to perform, and any search options.

Connection Management in Detail

EDC provides a user-facing application with access to content from several sources. If the specific sources are known and unlikely to change over the short- or medium-term, it does not make sense to invest much, if any, development effort in code to handle adding and removing sources, discovering new candidate sources, supporting source connector updates in place, etc. On the other hand, when the number of sources is not known and is likely to change over time and when connector implementations are likely to be improved and substituted, it is wise to make an investment in support for dynamic conditions. The EDC connection manager can provide just this kind of support for a dynamic working set. The connection manager can support the following:

Connector Availability Management

Return the name and other attributes of any connectors available for inclusion in EDC.
Fulfill a request to install / uninstall a specific connector.
Determine if a specific connector has an update.
Fulfill a request to update an installed connector.

Connector Instance Management

Return the name and other attributes of any connectors available for inclusion in the next browse or search operation.
Preserve and retrieve the subset of connectors used in the the most recent search operation.
Return a configured instance of a specific connector.

EDC has the option of adopting the existing registry infrastructure for dynamic discovery, installation, and use of implementations of the Repository OSID. This is work that was done for other projects at MIT and for the VUE project at Tufts. One way forward is to define the EDC requirements in more detail and see if the existing system works or can be modified to meet the need.

Let's look at the requirements in more detail. EDC needs to decide who should be authorized and equipped to review candidate content sources and install, uninstall, or update them. Is this an end-user capability or a system administration function? EDC has a scholarly and academic brand; it is not a grab-bag. EDC is likely to want to exercise a good deal of control over which content systems are included as potential search targets. EDC should have a centrally administer edset of content sources and not let users have a direct role in this function. Note, we are talking about the pool of content sources that could be browsed or searched, not the subset of this pool that might be involved in a particular browse or search operation. EDC administrators can use a command-line tool to discover, install, uninstall, and update. While a graphical user interface (GUI) might be nicer to look at, a system tool should suffice and one has already been built for the existing registry.

Having suggested that EDC use the existing command-line tool for managing connector availability, the next issue is instance availability. At run-time, EDC will need to obtain an instance of a connector to a content source. This connector may require configuration. A simplifying assumption is that all users of EDC have the same authorization regarding available content sources and results. This means that configuration is not user-specific, it is application-specific, and EDC is the application. Configuration data should reside outside the connector implementation. The configuration is suitable for storing in an XML file or a properties file. At EDC start-up, the connection manager can load the appropriate configuration and program files and make available a set of configured connector instances. Consumers of the connection manager can ask which content sources are available (name, description, etc)? This data can be used to create a list of content sources for the user to review and optionally select a subset. Storing and retrieving this subset should be the responsibility of the application.

Browsing

The connection manager provides instances of configured connectors. A browser user interface can request the name and description, etc for each content source. Browsing is assumed to start from a single root and descend as a parent (folder) / children (item) hierarchy, without any particular depth limit to the number of levels. A browser user interface can progressively disclose successive sets of parents and children through successive calls. Pseudo-code to create a "browse tree" might be:

  1. get list of connectors
  2. for each connector, get its name and description
    1. prepare a UI control for each connector
  3. when a connector is selected, get a list of its top-level containers and items
    1. represent a container as a folder and items directly, perhaps by name or thumbnail
  4. when a folder is selected, get a list of its containers and items
    1. represent a container as a folder and items directly, perhaps by name or thumbnail

Search Management in Detail

A search manager is responsible for taking search requests, delivering them appropriately to content sources, awaiting results from each source, and organizing results across sources. A search request consists of the following, expressed in a format supported by both the application and the search manager:

  • search criteria
  • the kind of search to perform
  • search options or properties
  • the subset of content systems (targets) to query

Basic vs Advanced Searches

Search criteria appear in two common forms: "basic" and "advanced". The syntax for a basic search is often space-delimited strings. Systems use different algorithms for determining what constitutes a match. Typically, a matching test will look at a title, description, perhaps a keywords field, perhaps other fields. Usually, each additional string in the criteria makes the search more restrictive; this implies an "and" requirement. For example, the criteria "orange juice" will rank higher content that includes both "orange" and "juice", and in that order, and adjacent, over other patterns. These rules are not universally applied across implementations, but represent a popular interpretation.
When a federation, i.e. more than one target repository, is involved, it is common for the interpretation of a basic search to be slightly different. The user knows they are using a low common denominator and has the expectation that this kind of search results in a procedure that makes its best effort to find appropriate matches. Users expect that an advanced search is more exacting and precise and also requires more care to use. A basic search is a scattershot approach, while an advanced search is more more targeted. A basic search may find extraneous material, false positives. An advanced search carries with it the risk of being too narrow a missing useful results.

Performing a basic search across a federation is practical. The same many not be true for an advanced search. As the number and variety of repositories increases, the effectiveness of an advanced search may decline. For example, an advanced search typically matches named fields to values. Often the user is presented with a list of fields and fills in values as desired. There may also be kinds of matching tests to pick from such as case sensitivity or insensitivity. It is unlikely that the list of fields will be the same across repositories. This means the user might have to compose a repository-specific query, which makes for an unwieldy interface or the user might be presented with the union or intersection of fields. No matter the model, the system is more complex.

An addition element is that advanced searches tend to have more complicated query syntax. A user-facing application should be insulated from all this "bookkeeping detail". One approach is to use a basic search and an advanced search abstraction. This makes the application's integration task pretty simple.

For basic searching, the syntax is a space-delimited string, case insensitive, increasingly restrictive search that prefers exact matches and order. For advanced searching, the syntax is an XML with field name, field value(s), boolean operators elements, etc. A unit of code can stand between the application (consumer) and any search interface (provider) which translates from these formats to whatever is appropriate.

Kinds of Searches and Result Sets

Targets

  • Apply to a single, specific repository.
  • Apply to one or more repositories.

Kinds of matching

  • Match case.
  • Match order.
  • Match all criteria (vs just one or more criterion).
  • Match exactly (vs a substring for string-valued criteria)
  • Match within a range, specified by end-points and inclusive / exclusive.

Kinds of search

  • Search for any assets that match one or more fields. Fields matching criteria can be joined using AND, OR, NOT operators.

Convenience searches

  • Search for a single asset using a unique id. This is an exact, case-sensitive match for a single field.
  • Search for all assets (browse).
  • Search for any assets that match keyword(s). This is a case-insensitive, exact, order-neutral, multi-field search over a specific set of fields.

Filters

  • Limit to specific asset type (or types).

Search Properties

There are a number of options that a search might respond to:

For the result set overall

  • limit the total number of responses
  • return the next "n" results using a cursor
  • apply a particular sorting rule
  • set a maximum wait-time for results

For each repository

  • suggest a particular ranking rule to employ
  • limit the number of responses
  • set a maximum wait-time for results

Ranking

Each content management system will have its own rules for ranking results and thereby setting the result order. Some systems may be willing to accept caller parameters in this area. The application user may want results ranked across repositories. One option is to configure any system to accept a plug-able ranking engine. The engine would accept a result (asset and metadata) and return a ranking value. A convention could be to normalize ranking values to 0.0..1.0 with 1.0 the most relevant.

Search Targets

Repositories will have descriptive metadata. One value should be a unique identifier enforced by the content manager. This allows search targets to be identified unambiguously. The search manager should be able to accept one or more targets, specified by id.

Federated Search Flow

The first goal is to gather the criteria, search type, search properties, and search targets. Either the user will be able to select a subset of targets or all targets is the only option. If all targets are fixed, then there will be a basic search, and advanced search, or both. If there are both, the user will have to pick which one they want. For the advanced search, the fields are perhaps the intersection of all fields, the union, or something else. If the targets are not fixed, the fields in an advanced search may be determined by which targets are included, or may be fixed.

In general, an intersection approach does not let the user enter a field value unless it is universally supported, a limitation. A union approach runs the risk that the user may enter a field value expecting all targets to support the field, running a risk of missing results. For example, if there is an author field in only some targets, the user may not realize that a target without that field will not get that input.

At the end of this process, the criteria and type are in hand. The UI could present search properties for user input, or these could be fixed. As with advanced search fields, properties are not guaranteed to be the same across targets and this needs to be handled.

As a final wrinkle, there could be search edit controls that are valid only for a single target or set of targets. These could be dynamically loaded, if the UI supports such things.

The second goal is to deliver the search to search target. It can be a requirement of the consuming application that all connectors support the search criteria, type, and properties in the same format. Alternatively, the search manager could "fix up" these inputs on a target-by-target basis. Yet another option is to delegate any fixing up to a helper class that is loaded as needed. Such classes are associated with the target's id.

Each target receives the search inputs on a separate thread. There can be an overall timeout, or not. The manager can track when each thread is complete and report these.

The third goal is to provide an integrated result set. The search manager should be able to provide the following for any target and all targets:

  • has the search completed
  • how many results were returned - note this may be expensive
  • were there any results
  • return the results in an iterator or cache
  • sort the results
  • rank the results, perhaps using a plug-able ranking helper class

Note the use of an iterator. Some systems may provided a very large number of results and it may take some time to fetch each. Under either condition, it may not be efficient to get a count of total results.

  • No labels