EULER: European Libraries and Electronic Resources in Mathematical Sciences
Telematics for Libraries Project LB-5609

 

Resource Adaptation in EULER

Michael Jost

May 19, 2000

Version 1.0

EULER Project Deliverable
Project Name: European Libraries and Electronic Resources in Mathematical Sciences
Project Acronym: EULER
Project Number: LB-5609
Deliverable Title: Resource Adaptation in EULER
Deliverable Number: General introduction to all D2.* Deliverables
Version Number: 1.0
Date: May 19, 2000
Principal Author(s): Michael Jost 
FIZ Karlsruhe, Dept. Math & Comput. Sci 
Franklinstr. 11, D-10587 Berlin, Germany 
e-mail: jo@zblmath.fiz-karlsruhe.de 
Tel: +49 30 3999340 
Fax: +49 30 3927009
Other Author(s): with contributions from other D2.* authors
Deliverable Kind: Report
Deliverable Type: Public
Abstract: This document is a general introduction to the five Task documentations that are available as separate documents. It is meant as an overview for the whole resource adaptation work package of the EULER project and states those facts that are of importance for all single Tasks.
It constists of a short overview of EULER aims in general and of the resource adaptation work package in particular, including an overview of information providers in the project and their respective data(bases) they have provided. This is followed by a high-level description of the methods used, including relevant specifications that are of relevance to all tasks. Ongoing and further work after the completion of the resource adaptation tasks is described. 

 

Table of Contents

1. Executive Summary
2. Introduction
3. Information Providers and Their Metadata
4. Description of Method Used / Work Done
4.1 EULERs Dublin Core based resource description format
4.2 Overview of Adaptation Tasks
4.3 Deduplication
4.4 Diacritics conversion
5. Ongoing and further work
6. References

A. Annexes
A.1. EULER-DC specifications
A.2. The Metadata Postprocessor

1. Executive Summary

This document is a general introduction to the five Task documentations that are available as separate documents (see links below). It is meant as an overview for the whole resource adaptation work package of the EULER project and states those facts that are of importance for all single Tasks. For more information on the EULER project see the project homepage at http://www.emis.de/projects/EULER/.

The following sections give a short overview of EULER aims in general and of the resource adaptation work package in particular, including an overview of information providers in the project and their respective data(bases) they have provided. This is followed by a high-level description of the methods used, including relevant specifications that are of relevance to all tasks (such as the EULER specific specification of Dublin Core Element Set usage, indexing strategy, and provisions for de-duplication). Ongoing and further work after the completion of the resource adaptation tasks is described.
 

2. Introduction

Aims of EULER

Since April 1998 the European Commission is funding the EULER project in the framework of the `Telematics for Libraries' sector from the Telematics Applications programme. Main goal of EULER is to integrate different, electronically available information resources in the field of mathematics. EULER aims to construct a digital library in mathematics from existing heterogeneous sources.

There's a rapid increase in the number of networked resources with information on scientific results and ongoing developments in the field of mathematics. Today, the user has to switch between a growing number of systems with heterogeneous user interfaces:

These resource types are considered to be the most frequently used when conducting searches for scientific results. They are rarely interconnected and users have to search them one by one.

The aim of the EULER project is to offer a one-stop-shopping site for users interested in mathematics. One single integrated networked based access point has been developed, covering the mentioned publications-related information resources on mathematics. A common user interface, available on the World Wide Web, allows a homogeneous access to all integrated information types. The interface was developed in close cooperation with the mathematical user community. Only one search will be necessary to generate a broad range of (mixed) hits, irrespective of resource type and information provider. The EULER service were developed starting with selected important information sources from the consortium partners. The goal is to design an open architecture. New sources of data from other information providers and libraries can easily be added later.

The integration approach makes use of common resource descriptions based on the Dublin Core (DC) element set and access to those descriptions via the Z39.50 protocol. Technically, all information providers produce DC metadata for their resources and offer them as distributed databases, which are located at the providers' sites. The central EULER Engine queries these databases in parallel via a common Z39.50 profile and performs result set merging and presentation formatting. The integration approach takes into consideration the requirements of the user community and the different information providers. Participating institutions are still autonomous in deciding on their scientific and organisational policies, while at the same time providing a common access strategy to their information services. The foremost requirement to achieve such an aim was to choose and apply suitable standards, formats and protocols.
 

Aims of the Resource Adaptation Work Package

The aim of this Work Package was to built the basic set of EULER Metadata Databases that are finally accessible from the EULER Engine. The Work Package was subdivided into five Tasks covering the five initial resource types that are part of EULER:

Bibliographic Databases and OPACs cover the broader scenario of automatic metadata to metadata conversion.

Refereed Electronic Journals, Preprint Servers, and Mathematical Internet Resources cover the broader scenario of resources harvesting, metadata creation (automatically or manually), and access to networked resources.

From the technical point, Tasks 3, 4 and 5 on one side, and Tasks 1 and 2 on the other side, share similar approaches and tools.

All resources were made accessible as standardised EULER Z39.50 metadata database as specified in the documentation of the EULER gateway engine (to be published).
 
 

3. Information Providers and Their Metadata


Zentralblatt MATH

Zentralblatt für Mathematik und ihre Grenzgebiete was founded in 1931 by O. Neugebauer and is today the longest term running abstracting and reviewing service in the field of mathematics. It covers the entire spectrum of mathematics incl. applications in  computer science, mechanics, physics, etc. Citations are classified according to the worldwide accepted Mathematics Subject Classification (MSC). It contains references to the worldwide literature drawn from currently about 2000 journals and serials, from conference proceedings, collections of papers and books. In the course of the European extension of Zentralblatt operations, the service was recently renamed to Zentralblatt MATH, to shorten the lengthy German title. Zentralblatt MATH publishes about 60.000 abstracts and reviews per year produced by more than 5000 scientists; the reviews are mainly written in English, but some also in French and German. Published by Springer-Verlag, Zentralblatt MATH is edited by the European Mathematical Society (EMS), Fachinformationszentrum (FIZ) Karlsruhe, and the Heidelberger Akademie der Wissenschaften.
The data set from MATH that was finally available in the EULER demonstrator service consists of 264176 records, namely the full Zentralblatt MATH production of the period January 1996 - March 1999 (196406 records) plus all monographs (67770 records, including dissertations and similar publications) that are referenced earlier in Zentralblatt MATH since its beginning (1931).
 

Niedersächsische Staats- und Universitätsbibliothek Göttingen

The Niedersächsische Staats- und Universitätsbibliothek (SUB) Göttingen, founded in 1734, is one of the five biggest libraries in Germany. Its stock amounts to approximately four million books. About 16.000 journals are taken in regularly. Besides the library is in possession of more than 12.000 manuscripts, 350 literary bequests and 3.100 incunabula. Depending on the covered time period and the subject world-wide accepted classifications (MSC, CCS etc.) are used. With the financial support of the Deutsche Forschungsgemeinschaft (DFG) the library collects comprehensively in more than twenty special fields - one of these is the field of Pure Mathematics ('Sammelschwerpunkt Reine Mathematik 17.1').

The data set from the Niedersächsische Staats- und Universitätsbibliothek is uploaded on three servers of the EULER demonstrator. The OPAC server contains about 89.000 bibliographic records with a wide spectrum of resource types (in case of another extraction adaptation in the demonstration phase this number will change). On the preprint server about 42.500 preprints and research reports can be found and on the quality controlled information gateway server about 1.100 EULER metadata sets are available (which were the basis for the EULER mathematical web index).
 

Centrum voor Wiskunde en Informatica

The CWI (carrying its present name since the early eighties, but founded in 1946 as Mathematical Centre) is the National Research Institute for Mathematics and Computer Science in the Netherlands, largely sponsored by the Dutch national research organization NWO (see http://www.cwi.nl/). CWI performs frontier research in mathematics and computer science and transfers new knowledge in these fields to society in general and trade and industry in particular. It is one of the founding members of the European Research Consortium for Informatics and Mathematics (ERCIM; see http://www.ercim.org/).
The CWI library is a supporting department of the CWI (http://www.cwi.nl/cwi/departments/BIBL.html). Historically it has a large and extensive collection of literature in the fields of mathematics and (theoretical) computer science. As a consequence of this large collection, the CWI library not only provides services to CWI's research staff (approx. 180 scientists), but also has a central supportive position for mathematical and computer science research in The Netherlands. The literature is mainly postgraduate and research material.  All the data (approx. 185,000 records) from the OPAC are presently being provided to the EULER service:

The CWI library is also involved in making available the electronic preprints from CWI. These are provided to the EULER service as well: The CWI reports/preprints that are available electronically as full text, are provided by the CWI reports/preprints server, but have their paper copy equivalent being described in the OPAC as well. There is currently no link between the OPAC and the CWI reports/preprint server.

The full text electronic CWI reports/preprints are available online since 1989. The files were being submitted by the CWI scientists to a ftp server on a voluntary basis till 1995. As of 1995 the reports/preprints are being submitted by the CWI Publication Dept. as part of a quality assurance and clearing procedure related to the production of paper based preprints. As a consequence of the voluntary aspect in the early period of online availability, the online available CWI electronic reports/preprints do not entirely match the actual paper copy based reports/preprints production from 1989 - 1994. In that period not every CWI scientist cared to put their reports/preprints to the CWI ftp server.

In the EULER project the CWI Library also contributed to the selection and input of relevant resources from the Internet for a quality controlled service.
 

University of Florence

The University of Florence (http://www.unifi.it) promoted the creation of the Italian National Bibliographic Service (SBN) since the beginning of the eighties. In this context, in 1985 a centralized library management became operational. The system now supports the complete life cycle of documents, from purchasing to circulation, for all the University libraries (5 wide thematic libraries and over 30 funds belonging to departments and centers). Cataloguing functions and rules operate on-line with the national union catalogue (SBN Index) so that about the 65% of local catalog records are imported from the SBN Index. Libraries create bibliographic records for documents, monographs and serials, pertinent to all the interested areas of the existent Faculties (11) and Courses (71) . At the end of April 2000 the catalogue consistency is over 350.000 bibliographic records (included 19.000 serials), over 520.000 holdings and about 224.500 authors. The OPAC (http://opac.unifi.it) is operational since 1995 and it provides catalog seraching and browising and holdings localisation through a WWW interface and Z39.50 access. Records are exported to the OPAC database out of the library management system in UNIMARC format, every night. The OPAC server for EULER contains a subset of the central catalog (OPAC) records concerning mathematics and computer science. At the end of March 2000 17691 records were extracted (16634 monographies and 1057 serials) with 27278 holdings.
 

Cellule de Coordination Documentaire Nationale pour les Mathématiques

La Cellule de Coordination Documentaire Nationale pour les mathématiques (MDC, http://www-mathdoc.ujf-grenoble.fr) is a national team for coordination of french mathematical documentation, and its access via the web. In the EULER project, associated to MDC er two important mahtematical research libraries:

Orsay Mathematical Library
The  Orsay mathematical library (Bibliothèque mathématique d'Orsay, BMO) is a department library, situated in the mathematics department of Université Paris Sud. BMO also plays a regional and national role in mathematical documentation for researchers. The domain is exclusively pure and applied mathematics. These fields are largely covered. Important collections: complete works, history of mathematics. The holdings are: Books, theses, proceedings: 55000; Serials : 690 collections (460 alive subscriptions). Metadata for all these are being provided to the euler system.

Strasbourg Mathematical Library
The Strasbourg mathematical library is also a department library, situated in the mathematics department of Université Louis Pasteur, Strasbourg.  The library has existed since the end of the 19th century.  The domain is exclusively pure and applied mathematics. The library is for researchers only. Important collections: ancient monographs and journals from 19th and early 20th century. 1600 documents are from before 1860, half in german, third in french, the rest in latin, english or italian. The oldest monograph is from 1499, and the oldest journal is from 1665. In the last twenty years, the most important field covered is pure mathematics. Associated domains such as applied mathematics and computer science are partially covered.  The holdings are: Books, theses, proceedings: 36750; Serials: 378 alive subscriptions. Metadata for all these are being provided to the euler system.

Online Preprints and Theses
In addition to the libray OPAC data, MDC makes available for euler all records from its national online grey literature index ( http://www-mathdoc.ujf-grenoble.fr/Harvest/brokers/prepub/query.html).
As elsewhere, French mathematicians make their scientific results public through
articles, books and preprints.  The preprints  are generally made available by the institutes as paper copies and have also been available on the Internet for a few years in electronic form. (usually PostScript or DVI files, sometimes compressed).  In 1997, MDC initiated the French online grey literature project (sometimes known as "math-prepub"). The idea was to use an internet harvesting tool to gather metadata  for mathematical grey literature into one national index.  Currently ( April 2000) 19 institutions participate in the project, and end users can also input directly their metadata onto MDC's web site. Metadata for 1667 online preprints and theses has been added to EULER. This figure grows variably every month.
 

NetLab, Lund University Library

NetLab is the Research and Development Department at Lund University Library. The department primarily runs developmental projects focusing on digital library and netbased information discovery and retrieval, mainly in Internet/WWW environments. NetLab has, for the EULER-project, created an automatically collected subject limited web-index.
A web-index is a searchable collection of documents found on the World Wide Web. Creating such an index involves using a harvesting program (often called a spider or a robot) to collect pages, using another program in order to index the data and, thirdly, a gateway and search program to display the index and making it searchable.
Creating a web-index can be done automatically by feeding a list of URLs into a robot program, that will try to fetch the corresponding documents from the web. The robot is able to extract new URLs from the documents it finds. The new URLs can be used to fetch more documents. This procedure, a harvesting step, can be repeated again and again. The quality of such a service is based on a number of factors, the relevance and quality of the URLs that are used as the starting point for the service and the number of harvesting steps decided upon. The starting point for the harvesting procedure for the EULER web-index was MathGuide's quality assessed mathematical web pages. The present index contains ca 103,500 records collected in a two step process from the original URLs.
The harvested records essentially contain document text, title, headers, links to other documents and possibly metadata, though in practice very few automatically collected web documents contain metadata. Thus the web-index will rarely offer more than fulltext and title searches, or combinations of these.
 

European Mathematical Society / Technical University of Berlin

The Electronic Library of Mathematics (ELibM) was founded in 1995 as part of the European Mathematical Information Service (EMIS), a service by the European Mathematical Society (EMS). All material carried in mathematical journals hosted by ELibM is peer-reviewed and the quality supervised by the Electronic Publishing Committee of the EMS. ELibM covers journals, monographs, collections and classical works in electronic form. The scope of the collection encompasses all of mathematics.
Journals are either produced/hosted exclusively by ELibM or mirrored from the journal's own web site. ELibM itself is mirrored via the mirror system of EMIS which encompasses more than 30 mirrors around the world.
ElibM carries currently the full texts of 36 electronic journals. The total number of full texts is currently 5071 (as of 25 April 2000). The projected growth rate in terms of full texts is about 20% for 2001. About 95% of these texts are indexed in EULER.
 

4. Description of Method Used / Work Done

In this chapter we describe the definition of common EULER Dublin Core based resource description format. This is followed by a short overview of the resource adaptation tasks, which contains links to the full task documentations. Finally, the deduplication method used in the project and handling of diacritics for search and display are described.
 

4.1 EULERs Dublin Core based resource description format

Searching various databases concurrently is a complicated matter for several reasons. Their records differ, covered fields differ, the syntax in the fields differ and access protocols and query languages differ. The means chosen to realize the project goals was to develop a joint database record profile. Individual fields in each original database were mapped to this common profile.

To create the common record profile, the EULER partners needed a known resource description method, which could serve as a "switching language" between individual database records. For this purpose the project decided to use the Dublin Core Metadata Element Set (DC), with qualifiers (the Dublin Core home page is available at http://purl.org/DC/). Dublin Core is

a 15-element metadata element set intended to facilitate discovery of electronic resources. Originally conceived for author-generated description of Web resources, it has also attracted the attention of formal resource description communities such as museums and libraries.

This decision was partly based on the fact that DC has become a widely used de facto standard for resource description on the Internet (withspecial interest from the library community), partly it was based on the outcome in the evaluation of existing formats done in the EU Telematics for Research (4th Framework) project DESIRE. Finally, since DC quickly is becoming a widely used format it assures the semantic interoperability between EULER and other services on a European and global level.

In the mapping process the partners tried to find fields that semantically matched the fifteen basic elements of DC in each of the databases that were to be converted. The fifteen DC elements were however not sufficient for the functionality that the end-user was believed to appreciate. In a few cases some needed (and available) information could not be matched with any of the 15 basic DC elements (and their qualifiers), thus they were put in a EULER-specific hierarchy.

In the selection process efforts were made to exclude fields that only a few partners (involved in EULER today) provide data for. There are however exceptions to the rule. For example, fields judged to become useful (in a foreseeable future), were not discarded. There were two important reasons for these selection rules. First, an architecture should be as generic and simple as possible, avoiding complexity of special solutions for each supplier. Secondly, users would not be aware that they do not search all records when they select a certain field but only those that have data in that field.

Specific schemes have been assigned to most EULER DC-fields thus the EULER participants conform their data, both within the respect of covered fields and the structure/formats of the information in those fields.

The EULER DC specifications are detailed in Annex 1, the Z39.50 mappings for those EULER DC fields will be given in the EULER Gateway Engine documentation.
 

4.2 Overview of Adaptation Tasks


Task 1: Bibliographic Databases

This task has implemented automatic format conversion of a selected set of the data of MATH Database (Zentralblatt MATH, more than 1.500.000 entries covering the world-wide mathematical literature from 1931 to present) from the proprietary MATH format to the common EULER Dublin Core based format. Additional data from the "Jahrbuch über die Fortschritte der Mathematik" may be included in this conversion process if resources permit. "Jahrbuch" covers the most important literature in mathematics from 1868-1943. The original data were enriched with necessary means of identification (e.g. internationally accepted standard document identifiers for single articles) to enable the effective exchange of data between different systems.

Deliverable: D2.1 A Frontend Dublin Core Database for Zentralblatt MATH
 
 

Task 2: OPACs

This task has produced routines for the automatic extraction of (subsets of) relevant OPAC entries and conversion routines for the format conversion to the EULER Dublin Core based format, including automatic update procedures. Standardised EULER metadata database with these data were made accessible to be queried by the central EULER Engine. Connections to existing online document delivery services (e.g. at SUB Göttingen) are included, based on forwarding relevant data directly to the ordering systems.

Deliverable: D2.2 The Creation of OPAC Metadata Databases for the EULER Service
 
 

Task 3: Preprint Servers

This task has produced frontend DC metadata database for preprint series and other grey literature of scientific institutions that are electronically available through the Internet. The metadata database were produced by means of automatic gathering/harvesting the original sources, and automatic conversion to Dublin Core. The databases integrate preprint referencing, metadata search and full-text retrieval.

It was sufficient to include only preprints of relatively few institutes in this task. Other (national) initiatives such as the German DFN project MathNet, the MPRESS initiative, and the collections of French preprints have been informed of EULER results and requirements, and asked to contribute to a common development. For the purpose of proof-of-concept, trials with importing data from MPRESS have been carried out. Likewise, EULER has monitored activities in this sector in WP-1, and has refined its specifications to enable concerted international actions.

Deliverable: D2.3 A Front-end Dublin Core Database for Preprints and Research Reports
 
 

Task 4: E-journals

The goal of this task was to systematically add metadata descriptions to a carefully selected set of high-quality peer reviewed electronic mathematical journals and other publications. Metadata descriptions for these electronic publications were

* used to generate a EULER DC metadata database, thus enabling users to effectively search for electronic publications in the same way as in all other EULER metadata databases.

* used in the production process of the electronic journal issues themselves, by the use of fully automated procedures that generate all the necessary index- and table-of-contents pages as well as individual journal articles homepages out of the metadata descriptions. Costly and time-consuming manual preparation (as it was done before) was eliminated. Metadata have substantially facilitated and speed up the preparation of the final electronic product, and enhanced its quality and usability.

* used to provide the basic means of enabling Online Delivery of the publications irrespective of protocols and formats. This is an important point when it comes to technology changes.

The collection of electronic publications covered in this task of the EULER project was taken from the Electronic Library of Mathematics, that is distributed through EMS's system of Internet servers, EMIS, http://www.emis.de/.

Deliverable: D2.4  Metadata for Electronic Journals in Mathematics
 
 

Task 5 Internet Mathematical Resources

The goal of this task was to comprehensively collect publications, information, resources and services in Mathematics published on the Internet, to offer them as a searchable and browseable service and to prepare the integration with the other "more traditional" bibliographic databases and fulltext publications in project EULER by creating DC metadata records for them.

Subtask a has developed a quality controlled information gateway for Mathematics, carefully selecting, describing and organizing Internet resources in this subject area. For this purpose data, approach and solutions of the DFG project MathGuide (http://www.MathGuide.de/) were used and adapted for the usage in EULER. Besides, the participants co-operated by selecting further relevant mathematical resources from the Internet.

Sub-task b has used a harvesting robot to systematically and automatically gather "all" mathematical resources on the Internet into a "Mathematical Web Index". A robot generated "Mathematical Web Index" consisting of "all" mathematical Web pages and resources on the Internet with focus on HTML pages was installed. This builds upon the robot software developed for the project DESIRE and methodologies first tested in the "Engineering Electronic Library, Sweden (EELS)" (http://www.ub2.lu.se/eel/eelhome.html) project. To increase the quality of this database, a Dublin Core Metadata creation and support site for publishers of European mathematical Web pages was offered, connected to the database.

Deliverables:
D2.5.1 The Creation of a Quality Controlled Information Gateway for the EULER Service
D2.5.2 The creation of a web-index for the EULER service
D2.5.3 A public Dublin Core Metadata creation and support site for Mathematicians
 
 

4.3 Deduplication

In a distributed repository like the EULER service, information about many documents will be present in more than one of the participating databases.  It is a well-known and difficult challenge to recognize these duplicates and present them to the user in a consolidated form. This was solved in the EULER service by extracting from each metadata record. a `deduplication key' containing elements from title, author and year of publication.

In this chapter we describe how the issue of merging duplicate entries in one or more connected databases was addressed in EULER.  This is done in a way that combines identification, sorting of result lists, and elimination of deduplicates of two types: local duplicates that arise in a single participating database (e.g. an on-line preprint and a report in the same partner's paper library), and nonlocal duplicates - items that appear in more than one database.

In a truly distributed environment, it is not possible to recognize duplicates in batch, i.e., build or maintain lists of duplicate entries.  Such an action would require the complete database to pass through a central point that would collect the data and recognize the duplicates.  Therefore, a method is required that can be performed at each of the partners' sites individually.
 

4.3.1 Description of Method Used

Some library systems such as PICA are known to have a form of `record identifier' that is built by taking a few letters/digits from elements such as the author and title of a publication.  EULER's deduplication key is similar, but unlike library systems, does not rely on authorized stop word lists; the latter fits less well in a multilingual context like the EULER service.
 

Construction of the deduplication key

The deduplication key is built in the metadata postprocessor, also known as ``the ISO-tool'' (described in more detail in annex 2).  The metadata postprocessor is a C program `iso' that runs at each partner's site, and receives raw metadata records built by a converter specific to the EULER partner (see EULER deliverables for task 2).

The key is stored in the IDE field (EULER.Identifier) and consists of 5x4 characters in the following format:
 

YYYYAAAATTTTUUUUVVVV
where
 

Deduplication at Zebra Server/EULER engine level

Actual deduplication is done by the EULER Engine.  Deduplication is activated technically by using the IDE field (the deduplication key) for sorting.  The queried partners' Zebra servers send the data matching the query to the EULER Engine in sorted order.  The engine then merges the sorted result lists into a single list and asks the Zebra servers for more data if deduplication resulted in less displayable `hits' than the user has asked for (typically 20).

Since the deduplication key starts with the publication year (or other year), the resulting sorting order after deduplication is necessarily by year.  It was chosen to use reverse sorting, so most recent items are shown first.
 

Display of de-duplicated entries

After de-duplication by the EULER Engine, only one of each set entries with the same deduplication key is shown to the user in ``consolidated form'' followed, as usual, by links the user can click to inspect the various sources (full records).
 

4.3.2 Analysis and/or Findings

After the outlines of the deduplication mechanism were drawn and some options for the precise way of constructing the deduplication key were discussed, two types of `crash test' were performed:
 
  1. The metadata postprocessor was extended with a feature that lists all the deduplication keys generated in a single database.  This list is then fed through a pipe of UNIX commands to produce a list of keys that appear more than once.
  2. The pre-alpha and alpha versions of the Engine were used to check duplicates across different databases.  The examples were chosen by common sense, so this form of analysis is not exhaustive.
Below, some of the more interesting findings are presented.
 
 

Similar titles by the same author

Some authors manage to produce up to 4 different works beginning in the same first 4 or 5 words, in the same year; some of these words are short.  The following two records would get the same identifier if short words were copied into the deduplication key without further ado:
<CR>Deak, J.</CR>
<TI>Extending a family of merotopies in a screen space.</TI>

<CR>Deak, J.</CR>
<TI>Extending a family of screens in a contiguity space.</TI>

<IDE>1995deakextea---fami</IDE>

When short words are postponed, as is the case in the EULER Alpha system, the keys become
<IDE>1995deakextefamimero</IDE>
<IDE>1995deakextefamiscre</IDE>
A similar example is:
<CR>Hinrichsen, Diederich</CR>
<TI>A canonical form for multinomial systems</TI>
<IDE>1997hinrcanoformmult</IDE>

<CR>Hinrichsen, D.</CR>
<TI>A canonical form for static linear output feedback</TI>
<IDE>1997hinrcanoformstat</IDE>

Statistically, the remaining cases usually concern multi-part works of the same author:
<TI>Singular optimal stochastic controls I: Existence.</TI>
<IDE>1995haussingoptistoc</IDE>

<TI>Singular optimal stochastic controls II: Dynamic programming.</TI>
<IDE>1995haussingoptistoc</IDE>

However, some authors manage to produce essentially different works with 3 nontrivial words (here, it doesn't even matter whether over is treated as a stop word.
<TI>Factoring multivariate polynomials over finite fields</TI>
<TI>Factoring multivariate polynomials over algebraic number fields</TI>

Difficult author's names

Difficult author names are those that have non-alphanumeric characters, such as the following:
<CR>O'Hara, Jun</CR>
<CR>mac_Donald, Fred</CR>
and surnames shorter than 4 letters, such as the following 6 spellings of the same Chinese name found in various databases:
<CR>He, Xue-Zhong</CR>
<CR>He, Xuezhong</CR>
<CR>He, X.-Z.</CR>
<CR>He, X.Z.</CR>
<CR>He, X[ue] Z[hong]</CR>
<CR>He, X.Z. [He, Xue Zhong]</CR>
The following two properties of the key were motivated by these examples:
 

Journals and Proceedings

Journals and proceedings volumes are the source of a wealth of incorrectly detected duplicates.  Many have no year and no author. In keys generated by early versions of the postprocessor, hundreds of proceedings volumes got assigned to key `--------procof--the-' ("Proceedings of the...").

After adding event locations to replace an empty author field, many still remain, and these are mostly journals.  The following are the most frequently occurring 5 journal entry keys of the CWI OPAC when short words are included (350 duplicates total):

34 --------intejourof--
27 ----ieeeieeetranon--
12 --------bullof--the-
11 --------jourof--the-
11 --------jourof--math
After short words are shifted to the end of the title, the final list becomes a total of 150 duplicates, the worst 5 being
10 ----akadizveakadnauk
 8 --------intejourcomp
 6 --------manasciejour
 6 --------commstatin--
 6 --------buleinstpoli

Different title spellings

In some cases, titles have subtle different spellings (mostly in abbreviations) that make the deduplication mechanism fail in detecting a duplicate.
<TI>Hazewinkel, M.(ed.)  1996  Handbook of algebra. Volume 1.</TI>
<IDE>1996hazehandalgevolu</IDE>

<TI>Hazewinkel, M. Ed.  1996  Handbook of algebra. Vol. 1</TI>
<IDE>1996hazehandalgeof--</IDE>

It has been suggested, but not investigated, to solve this by building lists of abbreviations.
 

4.3.3 Unresolved Points

4.3.4 Conclusions and recommendations

The construction of the deduplication key has thusfar been relatively complex - where a simple and straightforward computation would have been preferable - but it is carried out by the postprocessor, which is identical for each of the partner sites.  As the postprocessor does not depend on lists of abbreviations and stop words, this system is relatively easy to maintain, once the algorithm to construct the key has stabilized.

A downside of the approach taken is however that during development of the deduplication system, all partners have had to simultaneously change to a new postprocessor and re-index their data a few times.

It is worth investigating the possibility of using centrally maintained lists of stop words for titles (from, ueber, voor) to replace the 4 letter system, authors (``van der'') and abbreviations in the title (such as the problematic `vol.' vs `volume').  Considering the current performance of the deduplication system however, the pros and contras of such an approach should be weighed very carefully.
 
 

4.3.5 References

[1] Soundex - algorithm by Don Knuth, described on various web pages, e.g. http://monitor.nara.gov/genealogy/coding.html, http://www.l-ags.org/sndxhow.html
 
 

4.4 Diacritics conversion

Bibliographic resources use a variety of ways to encode and represent special characters: accented characters, ligatures, mathematical symbols and so forth.  In the EULER system, these representations are normalized to a number of standardized forms for indexing and display at the partners' sites.  This system lets users formulate queries in three different ways.

In the EULER partners' databases, diacritics are found represented in ISO-Latin-1, HTML and LaTeX encodings.  Users typically enter queries for non-straightforward words and names using plain ASCII, ISO-Latin accented characters, or the German convention for writing letters with umlauts (oe, ue, ae).

Even in a local textual database system, it is a difficult task to effectively answer differently formulated queries.  In the distributed EULER system, the user must choose to follow one of three conventions. Each database is locally treated by the `metadata processor', also known as ``the ISO tool'' (see annex 2).  The postprocessor replaces entries such as title, author and publisher by up to four different forms: three for indexing, and one for display.
 

4.4.1 Description of Method Used

One of the tasks of the metadata postprocessor is to inspect the values of the elements
DC.Creator.PersonalName (CR field)
DC.Creator.CorporateName (CRC field)
DC.Contributor.PersonalName (COP field)
DC.Contributor.CorporateName (COC field)
DC.Title (TI field)
DC.Title.Alternative (TIA field)
DC.Subject (SU field)
DC.Description (DE field),
and replace these elements, encoded in ISO-Latin, LaTeX or HTML encoding, by a HTML display form, and add up to three indexed alternatives:
CRI, COI, TII, SUI, DEI.
The first alternative is a normalized 7-bit ASCII form; the second alternative, if different, is in ISO-Latin-1 encoding; the third alternative, if different from both others, is a 7-bit ASCII form using the German ae/oe/ue spelling for umlauts; ss for the &szlig;.  The following example illustrates this procedure for the creator element:
Input:

<CR>Berggren, Mårten</CR>
<CR>Br\"{u}mmer, Anna</CR>

Output:

<CR>Berggren, M&acirc;rten</CR>
<CRI>Berggren, Marten</CRI>
<CRI>Berggren, Mårten</CRI>
<CR>Br&uuml;mmer, Anna</CRI>
<CRI>Brummer, Anna</CRI>
<CRI>Brümmer, Anna</CRI>
<CRI>Bruemmer, Anna</CRI>

The translations for the various representations are stored in a `dictionary file' that is read by the metadata postprocessor.
 

4.4.2 Unresolved Points

5. Ongoing and further work


Allthough the resource adaptation work package of the EULER project has been finished at the time of publication of these reports, the work in other work packages is still ongoing. In the following sections we describe the status of these work packages and how their results might have impact on the resource adaptation reports.

Work Package 3: EULER Engine

The project has released the Alpha version of the EULER service for intermediate evaluation in July 1999. The EULER service provides the user with an intermediate version of the central EULER Engine that queries the databases described here in parallel and performs the necessary processing for result presentation. Currently, the beta version of the software is being developed, based on results from an intermediate evaluation by users and experts. The final demonstrator service can be expected to be available in June 2000. Last minute changes in profiles and other conventions might lead to minor modifications of the resource adaptation procedures and specifications described here.
 

Work Package 4: Evaluation and Demonstration

After the release of EULER Engine beta version selected groups of users will start system exploitation and evaluation. The work package  intends to measure the system suitability and scalability and the satisfaction level of users with the service.

System test will evaluate the following parameters:

The task will analyse and describe system flaws and possible improvements concerning: Results of this work wackage will have impact on exploitation of project's results in a continued service and show needs for further development that might lead to a revision of the current resource adaptation principles and procedures.
 

Work Package 5: Information Dissemination and Exploitation Preparations

The final exploitation plan for EULER services and other project results will be prepared, based on the results of work package 4. Commercial exploitation for future operation of EULER services and transfer of EULER results other subject domains will be considered. Contracts within the consortium (and beyond) will ensure the continuation of EULER services after the project comes to an end.
 
 

6. Acknowledgements, References, Bibliography


Deliverables of the Resource Adaptation Work Package:

D2.1 A Frontend Dublin Core Database for Zentralblatt MATH
D2.2 The Creation of OPAC Metadata Databases for the EULER Service
D2.3 A Front-end Dublin Core Database for Preprints and Research Reports
D2.4  Metadata for Electronic Journals in Mathematics
D2.5.1 The Creation of a Quality Controlled Information Gateway for the EULER Service
D2.5.2 The creation of a web-index for the EULER service
D2.5.3 A public Dublin Core Metadata creation and support site for Mathematicians
 

Dublin Core Metadata Element Set, Version 1.1: Reference Description:
http://purl.oclc.org/dc/documents/rec-dces-19990702.htm
 
 

A. Annexes

 

A.1 EULER-DC specifications

These specifications are partially based on Dublin Core Metadata Element Set, Version 1.1: Reference Description, http://purl.oclc.org/dc/documents/rec-dces-19990702.htm . Relevant portions appear in italics.

Format of entries:
Field name
Qualified DC name - EULER shorthand
Scheme
Semantics

Title
DC.Title - TI
Scheme: None (i.e. freetext)
Title
Typically, a Title will be a name by which the resource is formally known.

Alternative title
DC.Title.Alternative - TIA
Scheme: None (i.e. freetext)
Any titles other than the main title; including subtitle, translated title, vernacular name, etc.

Personal Author
DC.Creator.PersonalName - CR
Scheme: Family name, first name(s) or initials (MARC)
A person primarily responsible for making the content of the resource.

Corporate Author
DC.Creator.CorporateName - CA
Scheme: None
A corporate entity primarily responsible for making the content of the resource.

Personal Contributor
DC.Contributor.PersonalName - COP
Scheme: Family name, first name(s) or initials (MARC)
A person responsible for making contributions to the content of the resource; including editors, translators, etc. .

Corporate Contributor
DC.Contributor.CorporateName - COC
Scheme: None
A corporate entity responsible for making contributions to the content of the resource; including editors, translators, etc. .

Uncontrolled keyword
DC.Subject - SU
Scheme: None (i.e. freetext)
The topic of the content of the resource: Any keyword (NOT full text).

Library of Congress Subject Headings (LCSH) keyword
DC.Subject - SUL
Scheme: LCSH
The topic of the content of the resource: Controlled keyword from the LCSH.

Mathematics Subject Classification Scheme (MSC) classification
DC.Subject - SUM
Scheme: MSC
The topic of the content of the resource: MSC code, not the explanatory text.

Dewey Decimal Classification (DDC) classification
DC.Subject - SUD
Scheme: DDC
The topic of the content of the resource: DDC code, not the explanatory text.

Computing Classification System
DC.Subject - SUC
Scheme: CCS
The topic of the content of the resource: CCS code, not the explanatory text.

Description
DC.Description - DE
Scheme: None (i.e. freetext)
Abstract, review and other freetext describing the resource. NOT the full text of a web-resource or such like.
An account of the content of the resource. Description may include but is not limited to: an abstract, table of contents, reference to a graphical representation of content or a free-text account of the content.

Publisher
DC.Publisher - PU
Scheme: City [(Country)]: Name
An entity responsible for making the resource available. Typically the publisher of the resource.
 

Date
DC.Date - DA
Scheme: YYYY[-MM[-DD]] (ISO 8601)
Date of resource publication or availability.
A date associated with an event in the life cycle of the resource. Typically, Date will be associated with the creation or  availability of the resource.

Type
DC.Type - TY
Scheme: All of the suggested types in the "Dublin Core Resource Types, Structuralist DRAFT: July 24, 1997" (2). Some of them are more relevant for the EULER project (and might be used as search help, e.g. eligible in a list over resource types)

Intellectual content type of the resource described in the record.
The nature or genre of the content of the resource. Type includes terms describing general categories, functions, genres, or aggregation levels for content.

Format
DC.Format - FO
Scheme: IMT (3)
IANA MIME-type of file Internet Media Types
The [...] digital manifestation of the resource. Typically, Format may include the media-type [...] of the resource. Format may be used to determine the software,  hardware or other equipment needed to display or operate the resource.

Physical carrier
DC.Format.x-carrier - FOP

The projects draft list of relevant EULER specific physical carriers:

Physical carrier of information. The reason for applying this EULER-invented sub-field is that the end-user should be able to conclude if the resource described in the bibliographic record (displayed in the hitlist) is available online or not. Example: book (= paper) -is the physical description (compared to monograph which is an entity, irrespective of how it is "delivered", in a printed version (paper) or in a file).
The physical [...] manifestation of the resource. Typically, Format may include the media-type or dimensions of the resource. Format may be used to determine the software,  hardware or other equipment needed to display or operate the resource. Examples of dimensions include size and duration.

URN
DC.Identifier - IDN
Scheme: URN (4)
An unambiguous reference to the resource within a given context: URN of resource described in the record.

ISSN
DC.Identifier - IDS
Scheme: ISSN
An unambiguous reference to the resource within a given context: ISSN of resource described in the record.

ISBN
DC.Identifier - IDB
Scheme: ISBN
An unambiguous reference to the resource within a given context: ISBN of resource described in the record.

URL
DC.Identifier - IDL
Scheme: URL (http://..,ftp://... etc)
An unambiguous reference to the resource within a given context: URL of resource (described in the record) or where it can be acquired

De-duplication Identifier
DC.Identifier - IDE
Scheme: EULER specific scheme, generated uniformely by automated procedure
An unambiguous reference to the resource within a given context: EULER-own resurce identifier (in order to find duplicates)

Language
DC.Language - LA
Scheme: ISO 639-1 (2 letter codes)
A language of the intellectual content of the resource.

Terms and Conditions
DC.Rights - TC
Scheme: None (yet)
Information about rights held in and over the resource. Typically, a Rights element will contain a rights management statement for the resource, or reference a service providing such information. Rights information often encompasses Intellectual Property Rights (IPR), Copyright, and various Property Rights. If the Rights element is absent, no assumptions can be made about the status of these and other rights with respect to the resource.

Metadata Creation Date
DC.Date.x-metadata-created - DMC
Scheme: numeric: YYYYMMDD
Date of the creation of the original metadata record. YYYY=Year, MM=month, DD=day. Use "01" for unknown MM or DD. Useful for SDI services.
 

EULER specific fields

EULER identifier
EULER.Identifier - IDF
Scheme: None (i.e. freetext)
An unambiguous reference to the resource within a given context:
The purpose of this field is to identify the resource in other ways than those provided by the other fields. This can be serial name, page-, issue- or volume-numbers for journal articles or similar. (Can be used differently in different databases, e.g. ISO 4-1984)

Full text
EULER.Fulltext - FT
Scheme: None (i.e. freetext)
The fulltext of web-pages and other resources available as a whole.

Event location
EULER.Event.Location - EL
Scheme: None
Location of event for/at which the resource described in the record was created.

Event date
EULER.Event.Date - ED
Scheme: YYYY-MM-DD (ISO 8601)
Date of event for/at which the resource described in the record was created.

Event name
EULER.Event.Name - EN
Scheme: None (i.e. freetext)
Name of event where document was created.

Record source
EULER.Record.Source - RS
Scheme: <Name of information provider>: <internal id>
The source for the record i.e. describes which information provider has delivered the record.

Record source URL
EULER.Record.Sourceidentifier - OI
Scheme: URL
Identifier of source record for the description delivered in EULER. URL pointing back to the original record at information providers' site.

Record creator
EULER.Record.Creator - RC
Scheme: Family name, first name (MARC)
Creator of the record (describing the resource), e.g. a reviewer.

Address for delivery information
EULER.Delivery - DI
Scheme: URL
Meant to give the URL to the library where the resource described in the record can be acquired. (Pointer to online-order forms etc.)

Additional retrieve/delivery information
EULER.Delivery.Description - DID
Scheme: None (i.e. freetext)
Additional information that a user and a local library need to retrieve/deliver the resource described in the record.
 
 

All fields are repeatable, except for Title (TI), De-duplication Identifier (IDE) and Record Source (RS).

Planned further work: shift elements from EULER specific hierarchy to DC hierarchy whenever possible.

A set of special indexed fields are generated by the EULER Metadata Postprocessor ("iso-tool"): The following five special index fields are generated from the listed original fields for indexing and retrieval purposes. The corresponding original fields (those without "I" at the end) are normalized for HTML display.

CRI: CR, CA, COP, COC
PUI: PU
TII: TI, TIA
SUI: SU
DEI: DE




 
 

A2. The Metadata Postprocessor (ISO)

A2.1 Introduction

The postprocessor or ``ISO tool'', developed at CWI, runs at each of the partners' sites.  It reads raw metadata records produced by the partner and performs those modifications that are independent of the database's particular properties.  Its tasks include
  • Constructing the deduplication key (section 4.3)
  • Diacritics conversion (section 4.4)
  • Producing field usage statistics
  • ISO-tool was written in C using standard UNIX libraries and was installed without problems at the various partner sites.  With some minor modifications it can be ported to other operating systems.
     

    A2.2 Prior to use

    At the point where you use `iso', you should have prepared one or more files containing bibliographic records in a format that differs from the format proposed by NetLab in EULER Deliverable 3.1 in the following ways:
     
    1. Records begin/end with <XREC>, </XREC>, respectively
    2. The fields <CRI>, <TII>, <SUI>, <DEI> and <IDE> are omitted
    3. The fields <CR>, <TI>, <SU>, <DE> may contain either LaTeX or ISO8859-1 encoded representations (and `iso' has provisions for adding support for other representations).
    The iso tool modifies your record files.  Although it automatically recognizes records it has previously processed, it cannot re-process records once they have been translated.  Unless you can quickly re-generate the source records, it is a good idea to make an explicit backup of the original record files.
     
     

    A2.3 Synopsis and description

    Synopsis

    The postprocessor is invoked as

       iso [-b] [-d dict-file] [-kd key-dumpfile] [-sd stats-dumpfile] <files/directories>
     

    Description

    You give `iso' a number of names of files to process.  These are modified in place, generating backup files if the `-b' flag is given. If you don't specify any file names, `iso' will read standard input, and write to standard output.  Directories are traversed recursively.

    The program reports how many of the records in the processed files needed postprocessing (that is, the XREC records), and doesn't write to files unless it actually had to make modifications.

    The following modifications are made:

    1. Records begin/end with <REC>, </REC>
    2. The fields <CR>, <TI>, <SU>, <DE> are replaced by HTML equivalents, containing HTML entities for non-ASCII characters in Latin-1 and the signs <, > and &.
    3. Up to three index versions <CRI>, <TTI>, &c are created: one in plain ASCII, one in ISO-Latin-1, and one following the German conventions for umlauts.
    4. A line <IDE> [20 character document id code] </IDE> is appended to each previously unindexed record.

    Files

    The `-kd <dump-file>' lets you specify a file to which the deduplication keys will be dumped for analysis.  This file can be further processed by the UNIX pipe commands `sort | uniq -c | sort -n -r' to get a ranked list of multiply occurring deduplication keys.

    The file `iso.dict' is expected to be in the directory from which `iso' is started.  The `-d' option allows you to read it from elsewhere, or to work with different dictionary files specifying alternative record formats and diacritics translation rules.

    The file `iso.dict' defines the record tag syntax and the translation rules for diacritics.  It consists of a number of records; blank lines are ignored, and can be used to separate records.

    If all you want `iso' to do is to convert ISO or LaTeX formatted records using the Zebra profile provided by NetLab, you may not need to do anything about `iso.dict', although it currently does not contain a lot of LaTeX sequences, so you may have to add sequences you encounter in your bibliographic records.

    Records consist of two or three lines.  The first line is of the form

    TagPair <tab> [tag pair identifier]
    or
    Display <tab> [display form]
    This is optionally followed by a second line
    Index <tab> [tab separated list of index forms]
    The last (second or third) line always reads
    Match <tab> [tab separated list of tag strings]
    A `TagPair' record must be present for the following bibliographical record items.
    unindexed record       (including indexed form)
    indexed record        (match only)
    personal author        (including indexed form)
    corporate author        (including indexed form)
    personal contributor       (including indexed form)
    corporate contributor      (including indexed form)
    publisher         (including indexed form)
    title           (including indexed form)
    alternative title        (including indexed form)
    subject          (including indexed form)
    description         (including indexed form)
    date-iso-8601         (match only)
    event-date-iso-8601        (match only)
    euler identifier        (match only)
    event location        (match only)
    free identifier        (including indexed form)
    Both Index and Match contain precisely two strings: the opening and the closing tag.

    For `Display' records, there can be at most two index forms; none if the display form is already in the US-ASCII range; one plain ASCII form when there are ISO-Latin characters in the display form, and one extra form for the German `oe for ö' convention.  The number of matched input strings is unrestricted.
     

    Statistics

    At the end of a run, ISO prints statistics to stderr in the format that can be uploaded to a project internal statistics repository (currently operated by Netlab) through a CGI script.

    If you want the statistics to be written to a file immediately, use the option -sd stats-file.