both at Lund University Library Development Department NetLab,
Marten.Berggren@lub.lu.se , Anna.Brummer@lub.lu.se,
The goal of the EU-funded project EULER, EUropean Libraries and Electronic Resources in Mathematical Sciences, is to create a one-stop shopping site for locating information in the field of mathematics. The resulting service will integrate a wide range of material, such as full text databases covering preprints, electronic journal articles and (other) web pages, as well as (more traditional) bibliographic databases, e.g. online public access catalogues (OPACs) and abstract and index databases. It will use the bibliographical information from these different databases to perform its functions and also in its presentations of material, i.e. the hit lists. The project will design an open, scalable and extensible architecture, which means that the system will be capable to adjust to future needs. The system will be designed to assure the ability to deal with the fact that many data providers may wish to join or leave the system as time goes on. The project partner group consists of Cellule de Coordination Documentaire Nationale pour les Mathématiques, Centrum voor Wiskunde en Informatica in Amsterdam, The European Mathematical Society, Fachinformationszentrum Karlsruhe (Dept. Math. & Comput. Sci., Berlin) Lund University Library development department NetLab, Niedersächsische Staats- und Universitätsbiblioth Göttingen, Technische Universität Berlin, Universitā degli Studi di Firenze. As in all EU-funded projects the end-user perspective is very important. The EULER service is meant to be easy to use for both mathematicians, persons from related sciences (who work with mathematical methods), and librarians i.e. user groups with different information retrieval skills.
The current paper is based on an internal EULER deliverable on architectural and interface design. Some of the technical details given there are left out in this edited version since they may change in some aspects in the ongoing iterative design, implementation and test process.
The foremost requirement of an open architecture is that is uses suitable standards, formats and protocols. EULER will comply with this by using the de facto metadata standard Dublin Core v1.0 for resource description, Z39.50 as search protocol and other standard Internet technology. Briefly, the EULER architecture consists of a number of independent databases through which different partners provide their data. All these databases can be queried simultaneously over the Internet through a single interface. This is displayed in the picture below and described in further details in the following sections.
The main aim of EULER is to bring several kinds of mathematical bibliographical and full text data from a wide range of suppliers into one single user interface. In order to achieve a common way of searching a range of different databases the project has to uniform the data, the resource descriptions. Participating institutions will however not give up their independence, i.e. there is no wish to merge the databases into one large pooled database. Instead the EULER architecture consists of a number of databases searchable over Internet simultaneously. The chosen approach, to build a distributed system where every data supplier is autonomous, has several advantages:
Searching various databases concurrently is a complicated matter for several reasons. Their records differ, covered fields differ (e.g. not all bibliographic databases provide information on the location of a conference in a record for proceedings), the syntax in the fields differ and access protocols differ (if accessible over network at all). The means chosen to realize the project goal was to develop a joint database record profile. Individual fields in each original database were mapped to this common profile. As the next step each partner will extract the relevant fields from their database and convert their data into a local EULER database. Thus creating similar front end databases, containing similar resource descriptions (but describing different material), at several sites.
To create the initial common record profile, the EULER partners needed a
known resource description method, which could serve as a "switching language"
between individual database records.
For this purpose the project decided to use the Dublin
Core Metadata Element Set (DC) (the Dublin Core home page is available at
http://purl.org/DC/).
Dublin Core is
a 15-element metadata element set intended to facilitate discovery
of electronic resources. Originally conceived for author-generated
description of Web resources, it has also attracted the attention of formal
resource description communities such as museums and libraries.
(1)
This decision was partly based on the fact that DC has become a widely used de facto standard for resource description on the Internet (especially in the library community), partly it was based on the outcome in the evaluation of existing formats done in the EU Telematics for Research (4th Framework) project DESIRE (http://www.ukoln.ac.uk/metadata/desire/overview/). Finally, since DC quickly is becoming a widely used format it assures the project semantic interoperability between EULER and other services on a European and global level.
In the selection process efforts were made to exclude fields that only a few partners (involved in EULER today) provide data for. There are however exceptions to the rule. The full text of web pages in a robot generated mathematical web index ("All mathematics") will be given a field in order to make it searchable in the EULER service. The reason for this is that many web pages still do not contain metadata for the robot to extract from them and if they are not made searchable in full text, they will not be possible to search at all. It is also expected that future partners in EULER will want to make their full text searchable. Further about the selection, fields judged to become useful (in a foreseeable future), will not be discarded. There were two important reasons for these selection rules. First, an architecture should be as generic and simple as possible, avoiding complexity of special solutions for each supplier. Even if such additional complexity could be managed during the lifetime of the project, it could easily get out of hand if EULER becomes a success as a service, with many data providers after the project is finished (and EULER turned into a regular service). Secondly, users would not be aware that they do not search all records when they select a certain field but only those that have data in that field. Providing fields which only can be filled in by a few providers, the users may be lead to think that EULER covers less than it actually does, because they would only search a fraction of all records. Example: Assume that we include the Swedish SAB classification system and that only 1% of all the resources in EULER are classified according to that system. A user defining his/her search to classification and a SAB code will thus only search 1% of all the resources in EULER. S/He will not find any resources among the other 99% that would have been given the classification s/he searches for. The user may conclude that there is only very few resources in the subject s/he is interested in.
Specific schemes have been assigned to most (EULER) DC-fields thus the
EULER participants will conform their data, both within the respect of
covered fields and the structure/formats of the information in those fields.
An example, the field Date has different formats in different databases,
e.g. in a Swedish database it would normally be written 11/1-99,
whereas in DC it should follow ISO 8601 (YYYY-MM-DD) and thus
be written 1999-01-11.
In some DC-fields it is not appropriate to use a scheme, i.e.
freetext entries are allowed (e.g. Title).
The following EULER DC specification is a working draft and almost certain to be subject to future changes. Further, some EULER specific fields may be migrated to the DC hierarchy, if DC development makes it possible.
| Field name | DC/EULER mapping | Semantics | Scheme | Repeatable |
|---|---|---|---|---|
| Title | DC.Title | Title | None (i.e. freetext) | No |
| Alternative title | DC.Title.Alternative | Any titles other than the main title; including subtitle, translated title, series title, vernacular name, etc. | None (i.e. freetext) | Yes |
| Creator | DC.Creator | The creator of the resource that the record is describing. | Family name, first name (MARC) | Yes |
| Uncontrolled keyword | DC.Subject | Any keyword (NOT full text). | None (i.e. freetext) | Yes |
| Library of Congress Subject Headings (LCSH) keyword | DC.Subject | Controlled keyword from the LCSH. | LCSH | Yes |
| Mathematics Subject Classification Scheme (MSC) classification | DC.Subject | MSC code, not the explanatory text. | MSC | Yes |
| Dewey Decimal Classification (DDC) classification | DC.Subject | DDC code, not the explanatory text. | DDC | Yes |
| Description | DC.Description | Abstract and other freetext describing the resource. NOT the full text of a web-resource or such like. | None (i.e. freetext) | Yes |
| Publisher | DC.Publisher | Name identifying the publisher of the of the resource the record is describing. | For the time being freetext will be used but the project partners are investigating if there is a suitable standard scheme that could be used. | Yes (Some publications are published as a joint effort by several publishers.) |
| Date | DC.Date | Date of resource publication or availability. | YYYY-MM-DD (ISO 8601) | Yes (For limited use in exceptional cases) |
| Type | DC.Type | Intellectual content type of the resource described in the record. |
All of the suggested
types in the "Dublin Core Resource Types,
Structuralist DRAFT: July 24, 1997" (2) are accepted,
but some of them are more relevant for the EULER project
(and might be used as search help, e.g. eligible in a list over
resource types), why they are pointed out in this list of "standard" types.
Further, the EULER partners have decided to
describe a few additonal types, not represented in the
above mentioned draft, and therefore we have created an
EULER specific list of types.
The Frequent/Appropriate Types from "Standard"
|
Yes |
| Format | DC.Format | IANA MIME-type of file | Internet Media Types IMT (3) | Yes |
| Physical carrier | DC.Format.x-physical | Physical carrier of information. The reason for applying this EULER-invented sub-field is that the end-user should be able to conclude if the resource described in the bibliographic record (displayed in the hitlist) is available online or not. Example: book (= paper) -is the physical description (compared to monograph which is an entity, irrespective of how it is "delivered", in a printed version (paper) or in a file). |
The projects draft list of relevant EULER specific physical carriers:
|
No |
| URN | DC.Identifier | URN of resource described in the record | URN (4) | No |
| ISSN | DC.Identifier | ISSN of resource described in the record | ISSN | No |
| ISBN | DC.Identifier | ISBN of resource described in the record | ISBN | No |
| URL | DC.Identifier | URL of resource (described in the record) or where it can be acquired | URL (http://..,ftp://... etc) | Yes |
| EULER specific identifier | DC.Identifier | EULER-own resurce descriptions (in order to find duplicates) | The project partners have decided to use Soundex in order to produce this identfier. | No |
| Language | DC.Language | Language of the resource the record is describing | ISO 639 | Yes |
| EULER specific fields | ||||
| EULER identifier | EULER.Identifier | The purpose of this field is to identify the resource in other ways than those provided by the other fields. This can be page-, issue- or volume-numbers in serials or similar. (Can be used differently in different databases, e.g. ISO 4-1984) | None (i.e. freetext) | No |
| Full text | EULER.Fulltext | The fulltext of web-pages and other resources available as a whole. This does NOT mean "all fields" in a search! | None (i.e. freetext) | No |
| Event location | EULER.Event.Location | Location of event for/at which the resource described in the record was created | Getty Thesaurus of Geographic Names (TGN) (5) | No |
| Event date | EULER.Event.Date | Date of event for/at which the resource described in the record was created | YYYY-MM-DD (ISO 8601) | No |
| Event name | EULER.Event.Name | Name of event where document was created | None (i.e. freetext) | No |
| Record source | EULER.Record.Source | The source for the record i.e. describes which database has delivered the record | EULER will produce a list over participating databases, based on the DNS-system. | No |
| Record source id | EULER.Record.Sourceidentifier | Identifier of source record for the description delivered in EULER | The scheme/list used in EULER.Record.Source and add the record number in the database (i.e. original record) | No |
| Record creator | EULER.Record.Creator | Creator of the record (describing the resource), e.g. a reviewer | Family name, first name (MARC) | No |
| Address for delivery information | EULER.Delivery | Meant to give the URL to the library where the resource described in the record can be acquired. | URL | No |
| Additional retrieve/delivery information | EULER.Delivery.Description | Additional information that a local library need to retrieve/deliver the resource described in the record. This is useful information provided the end-user is at the same location as the library holding the database from which the record came. | None (i.e. freetext) | Yes |
In order to make all databases accessible from a single interface, they are all
connected to the Internet through Z39.50.
This is a standard client/server search and retrieve protocol
(
http://lcweb.loc.gov/z3950/agency/1995doce.html)
maintained by Library of Congress
(see
http://lcweb.loc.gov/z3950/agency/).
Since it was originally proposed in 1984 it has increased in popularity in
the library community, becoming one of the most used standards to achive
interoperability in search and retrieval.
As Z39.50 is supposed to cover all possible needs for searches in
bibliographic databases, its design is quite flexible, allowing the
implementor to choose which features of the protocol to implement and how
to use these to achieve the desired functionality. In order to make it
easier to create software that can work with several databases, some
sub-standards to Z39.50 have been developed. These standards determine what features
a Z39.50 server should support and how it should respond to certain requests.
Two of these standards are used within EULER;
Compliance to these kinds of standards also makes it easy to search several databases simultaneously through a single user interface, saving time and effort for the end user. This technology will be used in EULER in order to let end-user search all front end databases within the project from a single interface. This interface could be any Z39.50 client, but since most Z39.50 clients requires fairly deep knowledge of the protocol we will use a web gateway for accessing the databases.
After the selection of fields included in the project specific DC profile, a EULER Z39.50 profile was created by mapping the fields into the Z39.50 protocol. EULER uses DC as a switching language, thus it would have been natural to use a DC Z39.50 profile. However, no such profile is yet available. In the meantime, some other profile must be used as a vehicle for the EULER profile. GILSv2 was singled out as the interim solution since it is a fairly generic profile intended to describe documents. Again, this is merely a temporary solution for the communication over Z39.50. EULER is not trying to achieve full GILSv2 compliance. This is reflected by the fact that the interim EULER solution does not conform to GILSv2 semantic or syntactic standards (for all fields). Furthermore, Bib-1 use-attributes were used when these were deemed more appropriate for the contents of our fields. EULER partners will monitor the standard developments during the project and try to switch to a DC profile, if/when possible.
Since the architecture of EULER is supposed to handle resources from all over Europe, we must consider the differences in the alphabets used in different countries. UNICODE was considered for the project but was ruled out for three reasons. First of all, current input devices (i. e. keyboards) do not cater very well for UNICODE. It is hard to enter characters outside the ISO 8859-1 range. Most users would thus not be able to enter special characters of foreign languages. Secondly, most fonts provided by popular operating systems do not contain glyphs for generic UNICODE. This means that browser programs are not able to display characters not provided by the font. Finally, the software used in the project can not handle UNICODE. Instead, we use ISO 8859-1 as the basic searching mechanism and also provide a mapping to 7-bit ASCII for those fields that contain national characters, which can be used to search for records containing characters which cannot be entered at all keyboards. For displaying purposes we use HTML entities, which makes it possible to display national characters today and in the future will make it easy to use new entities when they become widely available in new versions of HTML.
The next step in the development of the EULER service will be to make it possible for end-users to search (all of) these databases simultaneously from one single easy to use interface. This will be done through a Z39.50 to WWW gateway. Through this gateway, users will be able to search the EULER databases and view the results. Along with the gateway itself there will be utilities, such as thesauri and classification browsers. Together, the gateway and the utilities are called the EULER engine.
To implement the user interface to the EULER engine widely available standard web technologies will be used. The basic functionality will only require support for HTTP (Hyper Text Transfer Protocol, the standard used to transfer files over the World Wide Web (WWW)) and HTML (Hyper Text Markup Language). Additional functionality may be achieved if the browser program supports JavaScript.
In this paper the word design has two meanings. It is the technical architecture of the service, making it possible to achieve the goals of the project. But it is also, design considerations in the sense of future functions of the service and layout. An end-user study has been performed within the frames of the project and will be considered as most important source of input for the actual user interface design.
The gateway should provide:
The access-control to the gateway will use regular HTTP methods, making it possible to use whatever standard package for user administration available. Access to the databases will be controlled by IP-number, allowing each provider to select who should have access to their data.
Helper applications will consists of at least the two following utilities:
There are several reasons why the final user interface of the service, what it will look like in terms of layout etc, are not covered in this paper. The most obvious one is the fact that the World Wide Web is changing constantly and an end-user interface preferably will be developed during the project, instead of being layed down from the start. Thus only some functions and utilities that might be of use (in an end-user perspective) are presented. The starting point for the described considerations is that, potentially, many EULER users are found outside the library community, i.e. end-users of the service can not be assumed to be advanced information searchers. Hence, the design will be aimed at ease-of-use rather than features that only can be used by information retrieval experts.
EULER started in April 1998 and has funding until the autumn 2000. The partners hope to present an alpha version of the service during summer 1999. More information on the project (and later the alpha version of the service) is available at the EULER homepage http://www.emis.de/projects/EULER/. The authors of this article are working at Lund Unviersity Libary development department NetLab (http://www.lub.lu.se/netlab/) and their e-mail addresses are anna.brummer@lub.lu.se, marten.berggren@lub.lu.se.
(1) Dublin Core home page
(http://purl.oclc.org/metadata/dublin_core/) Visited October 1998, this page has now disappeared (January 1999)
(2) Dublin Core Resource Types, Structuralist DRAFT: July 24, 1997
(http://sunsite.berkeley.edu/Metadata/structuralist.html) Visited October 1998
(3) IMT, Internet Media Types
(http://www.isi.edu/in-notes/iana/assignments/media-types/media-types) Visited October 1998
(4) URN, Uniform Resource Names
(http://www.ietf.org/html.charters/urn-charter.html) Visited October 1998
(5) TGN, Getty Thesaurus of Geographic Names
(http://www.gii.getty.edu/vocabulary/tgn.html) Visited October 1998