Mårten Berggren, Anna Brümmer

both at Lund University Library Development Department NetLab,
Marten.Berggren@lub.lu.se , Anna.Brummer@lub.lu.se,

Design Considerations for the EULER project

Introduction

The goal of the EU-funded project EULER, EUropean Libraries and Electronic Resources in Mathematical Sciences, is to create a one-stop shopping site for locating information in the field of mathematics. The resulting service will integrate a wide range of material, such as full text databases covering preprints, electronic journal articles and (other) web pages, as well as (more traditional) bibliographic databases, e.g. online public access catalogues (OPACs) and abstract and index databases. It will use the bibliographical information from these different databases to perform its functions and also in its presentations of material, i.e. the hit lists. The project will design an open, scalable and extensible architecture, which means that the system will be capable to adjust to future needs. The system will be designed to assure the ability to deal with the fact that many data providers may wish to join or leave the system as time goes on. The project partner group consists of Cellule de Coordination Documentaire Nationale pour les Mathématiques, Centrum voor Wiskunde en Informatica in Amsterdam, The European Mathematical Society, Fachinformationszentrum Karlsruhe (Dept. Math. & Comput. Sci., Berlin) Lund University Library development department NetLab, Niedersächsische Staats- und Universitätsbiblioth Göttingen, Technische Universität Berlin, Universitā degli Studi di Firenze. As in all EU-funded projects the end-user perspective is very important. The EULER service is meant to be easy to use for both mathematicians, persons from related sciences (who work with mathematical methods), and librarians i.e. user groups with different information retrieval skills.

The current paper is based on an internal EULER deliverable on architectural and interface design. Some of the technical details given there are left out in this edited version since they may change in some aspects in the ongoing iterative design, implementation and test process.

The foremost requirement of an open architecture is that is uses suitable standards, formats and protocols. EULER will comply with this by using the de facto metadata standard Dublin Core v1.0 for resource description, Z39.50 as search protocol and other standard Internet technology. Briefly, the EULER architecture consists of a number of independent databases through which different partners provide their data. All these databases can be queried simultaneously over the Internet through a single interface. This is displayed in the picture below and described in further details in the following sections.



Architectural scheme of the EULER service.

Distributed architecture

The main aim of EULER is to bring several kinds of mathematical bibliographical and full text data from a wide range of suppliers into one single user interface. In order to achieve a common way of searching a range of different databases the project has to uniform the data, the resource descriptions. Participating institutions will however not give up their independence, i.e. there is no wish to merge the databases into one large pooled database. Instead the EULER architecture consists of a number of databases searchable over Internet simultaneously. The chosen approach, to build a distributed system where every data supplier is autonomous, has several advantages:

Distributed costs
Each provider is responsible for their own data and is compelled to cover the costs for that, as opposed to a centralised solution where the financing of the conversions has to be agreed upon. Still, it is possible for two or more providers to cooperate and make their data available through a common server.
Distributed quality
Each partner is responsible for the quality of their own service. A provider who puts a lot of effort into maintaining high quality of data and fast network connection will probably be used much more than suppliers with low quality of data and slow network connection.
Freedom of software
The protocol used to search the distributed databases is Z39.50. Since this is a standard protocol which has been availably since 1984 and is popular within the library community, there will be a lot of different software to choose from. Data providers can either configure existing Z39.50 servers to comply with the EULER Z39.50 profile or set up separate databases, containing information from the original database. The partners can also benefit from software developments outside of the EULER project.
Re-usability
Since the Z39.50 protocol is a standard, data suppliers can reuse the databases for other purposes, such as having a local copy of the EULER gateway for faster access or providing other kinds of access.
Wide audience
The use of World Wide Web as distribution channel of the service vouches for a world wide audience.

Dublin Core as switching language

Searching various databases concurrently is a complicated matter for several reasons. Their records differ, covered fields differ (e.g. not all bibliographic databases provide information on the location of a conference in a record for proceedings), the syntax in the fields differ and access protocols differ (if accessible over network at all). The means chosen to realize the project goal was to develop a joint database record profile. Individual fields in each original database were mapped to this common profile. As the next step each partner will extract the relevant fields from their database and convert their data into a local EULER database. Thus creating similar front end databases, containing similar resource descriptions (but describing different material), at several sites.

To create the initial common record profile, the EULER partners needed a known resource description method, which could serve as a "switching language" between individual database records. For this purpose the project decided to use the Dublin Core Metadata Element Set (DC) (the Dublin Core home page is available at http://purl.org/DC/). Dublin Core is

a 15-element metadata element set intended to facilitate discovery of electronic resources. Originally conceived for author-generated description of Web resources, it has also attracted the attention of formal resource description communities such as museums and libraries. (1)

This decision was partly based on the fact that DC has become a widely used de facto standard for resource description on the Internet (especially in the library community), partly it was based on the outcome in the evaluation of existing formats done in the EU Telematics for Research (4th Framework) project DESIRE (http://www.ukoln.ac.uk/metadata/desire/overview/). Finally, since DC quickly is becoming a widely used format it assures the project semantic interoperability between EULER and other services on a European and global level.

Mapping to Dublin Core

In the mapping process the partners tried to find fields that semantically matched the fifteen basic elements of DC in each of the databases that were to be converted. The fifteen DC elements were however not sufficient for the functionality that the end-user was believed to appreciate. In a few cases some needed (and available) information could not be matched with any of the 15 basic DC elements (and their qualifiers), thus they were put in a EULER-specific hierarchy.

In the selection process efforts were made to exclude fields that only a few partners (involved in EULER today) provide data for. There are however exceptions to the rule. The full text of web pages in a robot generated mathematical web index ("All mathematics") will be given a field in order to make it searchable in the EULER service. The reason for this is that many web pages still do not contain metadata for the robot to extract from them and if they are not made searchable in full text, they will not be possible to search at all. It is also expected that future partners in EULER will want to make their full text searchable. Further about the selection, fields judged to become useful (in a foreseeable future), will not be discarded. There were two important reasons for these selection rules. First, an architecture should be as generic and simple as possible, avoiding complexity of special solutions for each supplier. Even if such additional complexity could be managed during the lifetime of the project, it could easily get out of hand if EULER becomes a success as a service, with many data providers after the project is finished (and EULER turned into a regular service). Secondly, users would not be aware that they do not search all records when they select a certain field but only those that have data in that field. Providing fields which only can be filled in by a few providers, the users may be lead to think that EULER covers less than it actually does, because they would only search a fraction of all records. Example: Assume that we include the Swedish SAB classification system and that only 1% of all the resources in EULER are classified according to that system. A user defining his/her search to classification and a SAB code will thus only search 1% of all the resources in EULER. S/He will not find any resources among the other 99% that would have been given the classification s/he searches for. The user may conclude that there is only very few resources in the subject s/he is interested in.

Specific schemes have been assigned to most (EULER) DC-fields thus the EULER participants will conform their data, both within the respect of covered fields and the structure/formats of the information in those fields. An example, the field Date has different formats in different databases, e.g. in a Swedish database it would normally be written 11/1-99, whereas in DC it should follow ISO 8601 (YYYY-MM-DD) and thus be written 1999-01-11.
In some DC-fields it is not appropriate to use a scheme, i.e. freetext entries are allowed (e.g. Title).

The following EULER DC specification is a working draft and almost certain to be subject to future changes. Further, some EULER specific fields may be migrated to the DC hierarchy, if DC development makes it possible.

Field name DC/EULER mapping Semantics Scheme Repeatable
Title DC.Title Title None (i.e. freetext) No
Alternative title DC.Title.Alternative Any titles other than the main title; including subtitle, translated title, series title, vernacular name, etc. None (i.e. freetext) Yes
Creator DC.Creator The creator of the resource that the record is describing. Family name, first name (MARC) Yes
Uncontrolled keyword DC.Subject Any keyword (NOT full text). None (i.e. freetext) Yes
Library of Congress Subject Headings (LCSH) keyword DC.Subject Controlled keyword from the LCSH. LCSH Yes
Mathematics Subject Classification Scheme (MSC) classification DC.Subject MSC code, not the explanatory text. MSC Yes
Dewey Decimal Classification (DDC) classification DC.Subject DDC code, not the explanatory text. DDC Yes
Description DC.Description Abstract and other freetext describing the resource. NOT the full text of a web-resource or such like. None (i.e. freetext) Yes
Publisher DC.Publisher Name identifying the publisher of the of the resource the record is describing. For the time being freetext will be used but the project partners are investigating if there is a suitable standard scheme that could be used. Yes (Some publications are published as a joint effort by several publishers.)
Date DC.Date Date of resource publication or availability. YYYY-MM-DD (ISO 8601) Yes (For limited use in exceptional cases)
Type DC.Type Intellectual content type of the resource described in the record. All of the suggested types in the "Dublin Core Resource Types, Structuralist DRAFT: July 24, 1997" (2) are accepted, but some of them are more relevant for the EULER project (and might be used as search help, e.g. eligible in a list over resource types), why they are pointed out in this list of "standard" types. Further, the EULER partners have decided to describe a few additonal types, not represented in the above mentioned draft, and therefore we have created an EULER specific list of types.

The Frequent/Appropriate Types from "Standard"
  • Text.Abstract
  • Text.Article
  • Text.Homepage
  • Text.Monograph
  • Text.Preprints
  • Text.Proceedings
  • Text.Serial
  • Text.TechReport
  • Text.Thesis
  • Image.Moving.Film
  • Software.Executable
  • Software.Source
  • Data.Numeric
EULER Specific Types
  • Text.x-Separatum (Separatum)
  • Text.x-Patentspec (Patent Specification)
  • Text.x-Bibliography
  • Text.x-LectureNotes
  • Text.x-Review
Yes
Format DC.Format IANA MIME-type of file Internet Media Types IMT (3) Yes
Physical carrier DC.Format.x-physical Physical carrier of information. The reason for applying this EULER-invented sub-field is that the end-user should be able to conclude if the resource described in the bibliographic record (displayed in the hitlist) is available online or not. Example: book (= paper) -is the physical description (compared to monograph which is an entity, irrespective of how it is "delivered", in a printed version (paper) or in a file). The projects draft list of relevant EULER specific physical carriers:
  • printed material (= paper) (includes notes, Braille material)
  • hand-written material
  • letters
  • Internet file (= file)
  • film (when not file)
  • computerreadable material (but not Internet files or other formats described)
  • object
  • microfilm
  • microfiche
  • CD-ROM (=file/cdrom)
No
URN DC.Identifier URN of resource described in the record URN (4) No
ISSN DC.Identifier ISSN of resource described in the record ISSN No
ISBN DC.Identifier ISBN of resource described in the record ISBN No
URL DC.Identifier URL of resource (described in the record) or where it can be acquired URL (http://..,ftp://... etc) Yes
EULER specific identifier DC.Identifier EULER-own resurce descriptions (in order to find duplicates) The project partners have decided to use Soundex in order to produce this identfier. No
Language DC.Language Language of the resource the record is describing ISO 639 Yes
EULER specific fields
EULER identifier EULER.Identifier The purpose of this field is to identify the resource in other ways than those provided by the other fields. This can be page-, issue- or volume-numbers in serials or similar. (Can be used differently in different databases, e.g. ISO 4-1984) None (i.e. freetext) No
Full text EULER.Fulltext The fulltext of web-pages and other resources available as a whole. This does NOT mean "all fields" in a search! None (i.e. freetext) No
Event location EULER.Event.Location Location of event for/at which the resource described in the record was created Getty Thesaurus of Geographic Names (TGN) (5) No
Event date EULER.Event.Date Date of event for/at which the resource described in the record was created YYYY-MM-DD (ISO 8601) No
Event name EULER.Event.Name Name of event where document was created None (i.e. freetext) No
Record source EULER.Record.Source The source for the record i.e. describes which database has delivered the record EULER will produce a list over participating databases, based on the DNS-system. No
Record source id EULER.Record.Sourceidentifier Identifier of source record for the description delivered in EULER The scheme/list used in EULER.Record.Source and add the record number in the database (i.e. original record) No
Record creator EULER.Record.Creator Creator of the record (describing the resource), e.g. a reviewer Family name, first name (MARC) No
Address for delivery information EULER.Delivery Meant to give the URL to the library where the resource described in the record can be acquired. URL No
Additional retrieve/delivery information EULER.Delivery.Description Additional information that a local library need to retrieve/deliver the resource described in the record. This is useful information provided the end-user is at the same location as the library holding the database from which the record came. None (i.e. freetext) Yes

Z39.50

In order to make all databases accessible from a single interface, they are all connected to the Internet through Z39.50. This is a standard client/server search and retrieve protocol ( http://lcweb.loc.gov/z3950/agency/1995doce.html) maintained by Library of Congress (see http://lcweb.loc.gov/z3950/agency/). Since it was originally proposed in 1984 it has increased in popularity in the library community, becoming one of the most used standards to achive interoperability in search and retrieval.
As Z39.50 is supposed to cover all possible needs for searches in bibliographic databases, its design is quite flexible, allowing the implementor to choose which features of the protocol to implement and how to use these to achieve the desired functionality. In order to make it easier to create software that can work with several databases, some sub-standards to Z39.50 have been developed. These standards determine what features a Z39.50 server should support and how it should respond to certain requests. Two of these standards are used within EULER;

Bib-1
This is not really a separate sub-standard but rather a part of the Z39.50 standard. It contains definitions that may be used to express a query to a server, such as which fields to search, how the matching against the fields should be performed. More information is available in the standard and at ftp://ftp.loc.gov/pub/z3950/defs/bib1.txt. For the latest additions not yet in the standard document, see http://lcweb.loc.gov/z3950/agency/defns/bib1.html.
GILS (version 2)
GILS (Government Information Locator Service) is a standard for providing information about governmental documents. Among other things, it defines a set of searchable fields, some which are mandatory, some which are optional, and how a record should be transported back to the client. The latter is expressed in GRS-1, a hierarchical format defined in Z39.50. More information about GILS can be found at http://www.gils.net/prof_v2.html and http://www.gils.net/elements.html.
Cross-Domain Attribute Set
Similar to Bib-1 above, this is a standard containing definitions which clients may use expressing queries to a server. This standard has the benefit of being designed with Dublin Core in mind, hence it would be a lot easier to use within the EULER project. Unfortunately, this standard is not quite finished yet so it can not be used right now. See http://www.oclc.org/~levan/docs/crossdomainattributeset.html for further information.

Compliance to these kinds of standards also makes it easy to search several databases simultaneously through a single user interface, saving time and effort for the end user. This technology will be used in EULER in order to let end-user search all front end databases within the project from a single interface. This interface could be any Z39.50 client, but since most Z39.50 clients requires fairly deep knowledge of the protocol we will use a web gateway for accessing the databases.

Mapping to Z39.50

After the selection of fields included in the project specific DC profile, a EULER Z39.50 profile was created by mapping the fields into the Z39.50 protocol. EULER uses DC as a switching language, thus it would have been natural to use a DC Z39.50 profile. However, no such profile is yet available. In the meantime, some other profile must be used as a vehicle for the EULER profile. GILSv2 was singled out as the interim solution since it is a fairly generic profile intended to describe documents. Again, this is merely a temporary solution for the communication over Z39.50. EULER is not trying to achieve full GILSv2 compliance. This is reflected by the fact that the interim EULER solution does not conform to GILSv2 semantic or syntactic standards (for all fields). Furthermore, Bib-1 use-attributes were used when these were deemed more appropriate for the contents of our fields. EULER partners will monitor the standard developments during the project and try to switch to a DC profile, if/when possible.

Since the architecture of EULER is supposed to handle resources from all over Europe, we must consider the differences in the alphabets used in different countries. UNICODE was considered for the project but was ruled out for three reasons. First of all, current input devices (i. e. keyboards) do not cater very well for UNICODE. It is hard to enter characters outside the ISO 8859-1 range. Most users would thus not be able to enter special characters of foreign languages. Secondly, most fonts provided by popular operating systems do not contain glyphs for generic UNICODE. This means that browser programs are not able to display characters not provided by the font. Finally, the software used in the project can not handle UNICODE. Instead, we use ISO 8859-1 as the basic searching mechanism and also provide a mapping to 7-bit ASCII for those fields that contain national characters, which can be used to search for records containing characters which cannot be entered at all keyboards. For displaying purposes we use HTML entities, which makes it possible to display national characters today and in the future will make it easy to use new entities when they become widely available in new versions of HTML.

The EULER Engine

The next step in the development of the EULER service will be to make it possible for end-users to search (all of) these databases simultaneously from one single easy to use interface. This will be done through a Z39.50 to WWW gateway. Through this gateway, users will be able to search the EULER databases and view the results. Along with the gateway itself there will be utilities, such as thesauri and classification browsers. Together, the gateway and the utilities are called the EULER engine.

To implement the user interface to the EULER engine widely available standard web technologies will be used. The basic functionality will only require support for HTTP (Hyper Text Transfer Protocol, the standard used to transfer files over the World Wide Web (WWW)) and HTML (Hyper Text Markup Language). Additional functionality may be achieved if the browser program supports JavaScript.

User Interface Design

In this paper the word design has two meanings. It is the technical architecture of the service, making it possible to achieve the goals of the project. But it is also, design considerations in the sense of future functions of the service and layout. An end-user study has been performed within the frames of the project and will be considered as most important source of input for the actual user interface design.

The gateway should provide:

The access-control to the gateway will use regular HTTP methods, making it possible to use whatever standard package for user administration available. Access to the databases will be controlled by IP-number, allowing each provider to select who should have access to their data.

Helper applications will consists of at least the two following utilities:

There are several reasons why the final user interface of the service, what it will look like in terms of layout etc, are not covered in this paper. The most obvious one is the fact that the World Wide Web is changing constantly and an end-user interface preferably will be developed during the project, instead of being layed down from the start. Thus only some functions and utilities that might be of use (in an end-user perspective) are presented. The starting point for the described considerations is that, potentially, many EULER users are found outside the library community, i.e. end-users of the service can not be assumed to be advanced information searchers. Hence, the design will be aimed at ease-of-use rather than features that only can be used by information retrieval experts.

What comes next

EULER started in April 1998 and has funding until the autumn 2000. The partners hope to present an alpha version of the service during summer 1999. More information on the project (and later the alpha version of the service) is available at the EULER homepage http://www.emis.de/projects/EULER/. The authors of this article are working at Lund Unviersity Libary development department NetLab (http://www.lub.lu.se/netlab/) and their e-mail addresses are anna.brummer@lub.lu.se, marten.berggren@lub.lu.se.

References

(1) Dublin Core home page
(http://purl.oclc.org/metadata/dublin_core/) Visited October 1998, this page has now disappeared (January 1999)
(2) Dublin Core Resource Types, Structuralist DRAFT: July 24, 1997
(http://sunsite.berkeley.edu/Metadata/structuralist.html) Visited October 1998
(3) IMT, Internet Media Types
(http://www.isi.edu/in-notes/iana/assignments/media-types/media-types) Visited October 1998
(4) URN, Uniform Resource Names
(http://www.ietf.org/html.charters/urn-charter.html) Visited October 1998
(5) TGN, Getty Thesaurus of Geographic Names
(http://www.gii.getty.edu/vocabulary/tgn.html) Visited October 1998


This article was written January 1999