EULER Project Deliverable | |
Project Name: | European Libraries and Electronic Resources in Mathematical Sciences |
Project Acronym: | EULER |
Project Number: | LB-5609 |
Deliverable Title: | Resource Adaptation in EULER |
Deliverable Number: | General introduction to all D2.* Deliverables |
Version Number: | 1.0 |
Date: | May 19, 2000 |
Principal Author(s): | Michael Jost
FIZ Karlsruhe, Dept. Math & Comput. Sci Franklinstr. 11, D-10587 Berlin, Germany e-mail: jo@zblmath.fiz-karlsruhe.de Tel: +49 30 3999340 Fax: +49 30 3927009 |
Other Author(s): | with contributions from other D2.* authors |
Deliverable Kind: | Report |
Deliverable Type: | Public |
Abstract: | This document is a general introduction to the five Task documentations
that are available as separate documents. It is meant as an overview for
the whole resource adaptation work package of the EULER project and states
those facts that are of importance for all single Tasks.
It constists of a short overview of EULER aims in general and of the resource adaptation work package in particular, including an overview of information providers in the project and their respective data(bases) they have provided. This is followed by a high-level description of the methods used, including relevant specifications that are of relevance to all tasks. Ongoing and further work after the completion of the resource adaptation tasks is described. |
1. Executive Summary
2. Introduction
3. Information Providers and Their Metadata
4. Description of Method Used / Work Done4.1 EULERs Dublin Core based resource description format5. Ongoing and further work
4.2 Overview of Adaptation Tasks
4.3 Deduplication
4.4 Diacritics conversion
6. ReferencesA. Annexes
A.1. EULER-DC specifications
A.2. The Metadata Postprocessor
The following sections give a short overview of EULER aims in general
and of the resource adaptation work package in particular, including an
overview of information providers in the project and their respective data(bases)
they have provided. This is followed by a high-level description of the
methods used, including relevant specifications that are of relevance to
all tasks (such as the EULER specific specification of Dublin Core Element
Set usage, indexing strategy, and provisions for de-duplication). Ongoing
and further work after the completion of the resource adaptation tasks
is described.
Since April 1998 the European Commission is funding the EULER project in the framework of the `Telematics for Libraries' sector from the Telematics Applications programme. Main goal of EULER is to integrate different, electronically available information resources in the field of mathematics. EULER aims to construct a digital library in mathematics from existing heterogeneous sources.
There's a rapid increase in the number of networked resources with information on scientific results and ongoing developments in the field of mathematics. Today, the user has to switch between a growing number of systems with heterogeneous user interfaces:
The aim of the EULER project is to offer a one-stop-shopping site for users interested in mathematics. One single integrated networked based access point has been developed, covering the mentioned publications-related information resources on mathematics. A common user interface, available on the World Wide Web, allows a homogeneous access to all integrated information types. The interface was developed in close cooperation with the mathematical user community. Only one search will be necessary to generate a broad range of (mixed) hits, irrespective of resource type and information provider. The EULER service were developed starting with selected important information sources from the consortium partners. The goal is to design an open architecture. New sources of data from other information providers and libraries can easily be added later.
The integration approach makes use of common resource descriptions based
on the Dublin Core (DC) element set and access to those descriptions via
the Z39.50 protocol. Technically, all information providers produce DC
metadata for their resources and offer them as distributed databases, which
are located at the providers' sites. The central EULER Engine queries these
databases in parallel via a common Z39.50 profile and performs result set
merging and presentation formatting. The integration approach takes into
consideration the requirements of the user community and the different
information providers. Participating institutions are still autonomous
in deciding on their scientific and organisational policies, while at the
same time providing a common access strategy to their information services.
The foremost requirement to achieve such an aim was to choose and apply
suitable standards, formats and protocols.
Aims of the Resource Adaptation Work Package
The aim of this Work Package was to built the basic set of EULER Metadata Databases that are finally accessible from the EULER Engine. The Work Package was subdivided into five Tasks covering the five initial resource types that are part of EULER:
Refereed Electronic Journals, Preprint Servers, and Mathematical Internet Resources cover the broader scenario of resources harvesting, metadata creation (automatically or manually), and access to networked resources.
From the technical point, Tasks 3, 4 and 5 on one side, and Tasks 1 and 2 on the other side, share similar approaches and tools.
All resources were made accessible as standardised EULER Z39.50 metadata
database as specified in the documentation of the EULER gateway engine
(to be published).
Zentralblatt MATH
Zentralblatt für Mathematik und ihre Grenzgebiete was founded in
1931 by O. Neugebauer and is today the longest term running abstracting
and reviewing service in the field of mathematics. It covers the entire
spectrum of mathematics incl. applications in computer science, mechanics,
physics, etc. Citations are classified according to the worldwide accepted
Mathematics Subject Classification (MSC). It contains references to the
worldwide literature drawn from currently about 2000 journals and serials,
from conference proceedings, collections of papers and books. In the course
of the European extension of Zentralblatt operations, the service was recently
renamed to Zentralblatt MATH, to shorten the lengthy German title. Zentralblatt
MATH publishes about 60.000 abstracts and reviews per year produced by
more than 5000 scientists; the reviews are mainly written in English, but
some also in French and German. Published by Springer-Verlag, Zentralblatt
MATH is edited by the European Mathematical Society (EMS), Fachinformationszentrum
(FIZ) Karlsruhe, and the Heidelberger Akademie der Wissenschaften.
The data set from MATH that was finally available in the EULER demonstrator
service consists of 264176 records, namely the full Zentralblatt MATH production
of the period January 1996 - March 1999 (196406 records) plus all monographs
(67770 records, including dissertations and similar publications) that
are referenced earlier in Zentralblatt MATH since its beginning (1931).
Niedersächsische Staats- und Universitätsbibliothek Göttingen
The Niedersächsische Staats- und Universitätsbibliothek (SUB) Göttingen, founded in 1734, is one of the five biggest libraries in Germany. Its stock amounts to approximately four million books. About 16.000 journals are taken in regularly. Besides the library is in possession of more than 12.000 manuscripts, 350 literary bequests and 3.100 incunabula. Depending on the covered time period and the subject world-wide accepted classifications (MSC, CCS etc.) are used. With the financial support of the Deutsche Forschungsgemeinschaft (DFG) the library collects comprehensively in more than twenty special fields - one of these is the field of Pure Mathematics ('Sammelschwerpunkt Reine Mathematik 17.1').
The data set from the Niedersächsische Staats- und Universitätsbibliothek
is uploaded on three servers of the EULER demonstrator. The OPAC server
contains about 89.000 bibliographic records with a wide spectrum of resource
types (in case of another extraction adaptation in the demonstration phase
this number will change). On the preprint server about 42.500 preprints
and research reports can be found and on the quality controlled information
gateway server about 1.100 EULER metadata sets are available (which were
the basis for the EULER mathematical web index).
Centrum voor Wiskunde en Informatica
The CWI (carrying its present name since the early eighties, but founded
in 1946 as Mathematical Centre) is the National Research Institute for
Mathematics and Computer Science in the Netherlands, largely sponsored
by the Dutch national research organization NWO (see http://www.cwi.nl/).
CWI performs frontier research in mathematics and computer science and
transfers new knowledge in these fields to society in general and trade
and industry in particular. It is one of the founding members of the European
Research Consortium for Informatics and Mathematics (ERCIM; see http://www.ercim.org/).
The CWI library is a supporting department of the CWI (http://www.cwi.nl/cwi/departments/BIBL.html).
Historically it has a large and extensive collection of literature in the
fields of mathematics and (theoretical) computer science. As a consequence
of this large collection, the CWI library not only provides services to
CWI's research staff (approx. 180 scientists), but also has a central supportive
position for mathematical and computer science research in The Netherlands.
The literature is mainly postgraduate and research material. All
the data (approx. 185,000 records) from the OPAC are presently being provided
to the EULER service:
The full text electronic CWI reports/preprints are available online since 1989. The files were being submitted by the CWI scientists to a ftp server on a voluntary basis till 1995. As of 1995 the reports/preprints are being submitted by the CWI Publication Dept. as part of a quality assurance and clearing procedure related to the production of paper based preprints. As a consequence of the voluntary aspect in the early period of online availability, the online available CWI electronic reports/preprints do not entirely match the actual paper copy based reports/preprints production from 1989 - 1994. In that period not every CWI scientist cared to put their reports/preprints to the CWI ftp server.
In the EULER project the CWI Library also contributed to the selection
and input of relevant resources from the Internet for a quality controlled
service.
University of Florence
The University of Florence (http://www.unifi.it)
promoted the creation of the Italian National Bibliographic Service (SBN)
since the beginning of the eighties. In this context, in 1985 a centralized
library management became operational. The system now supports the complete
life cycle of documents, from purchasing to circulation, for all the University
libraries (5 wide thematic libraries and over 30 funds belonging to departments
and centers). Cataloguing functions and rules operate on-line with the
national union catalogue (SBN Index) so that about the 65% of local catalog
records are imported from the SBN Index. Libraries create bibliographic
records for documents, monographs and serials, pertinent to all the interested
areas of the existent Faculties (11) and Courses (71) . At the end of April
2000 the catalogue consistency is over 350.000 bibliographic records (included
19.000 serials), over 520.000 holdings and about 224.500 authors. The OPAC
(http://opac.unifi.it) is operational
since 1995 and it provides catalog seraching and browising and holdings
localisation through a WWW interface and Z39.50 access. Records are exported
to the OPAC database out of the library management system in UNIMARC format,
every night. The OPAC server for EULER contains a subset of the central
catalog (OPAC) records concerning mathematics and computer science. At
the end of March 2000 17691 records were extracted (16634 monographies
and 1057 serials) with 27278 holdings.
Cellule de Coordination Documentaire Nationale pour les Mathématiques
La Cellule de Coordination Documentaire Nationale pour les mathématiques (MDC, http://www-mathdoc.ujf-grenoble.fr) is a national team for coordination of french mathematical documentation, and its access via the web. In the EULER project, associated to MDC er two important mahtematical research libraries:
Orsay Mathematical Library
The Orsay mathematical library (Bibliothèque mathématique
d'Orsay, BMO) is a department library, situated in the mathematics department
of Université Paris Sud. BMO also plays a regional and national
role in mathematical documentation for researchers. The domain is exclusively
pure and applied mathematics. These fields are largely covered. Important
collections: complete works, history of mathematics. The holdings are:
Books, theses, proceedings: 55000; Serials : 690 collections (460 alive
subscriptions). Metadata for all these are being provided to the euler
system.
Strasbourg Mathematical Library
The Strasbourg mathematical library is also a department library, situated
in the mathematics department of Université Louis Pasteur, Strasbourg.
The library has existed since the end of the 19th century. The domain
is exclusively pure and applied mathematics. The library is for researchers
only. Important collections: ancient monographs and journals from 19th
and early 20th century. 1600 documents are from before 1860, half in german,
third in french, the rest in latin, english or italian. The oldest monograph
is from 1499, and the oldest journal is from 1665. In the last twenty years,
the most important field covered is pure mathematics. Associated domains
such as applied mathematics and computer science are partially covered.
The holdings are: Books, theses, proceedings: 36750; Serials: 378 alive
subscriptions. Metadata for all these are being provided to the euler system.
Online Preprints and Theses
In addition to the libray OPAC data, MDC makes available for euler
all records from its national online grey literature index ( http://www-mathdoc.ujf-grenoble.fr/Harvest/brokers/prepub/query.html).
As elsewhere, French mathematicians make their scientific results public
through
articles, books and preprints. The preprints are generally
made available by the institutes as paper copies and have also been available
on the Internet for a few years in electronic form. (usually PostScript
or DVI files, sometimes compressed). In 1997, MDC initiated the French
online grey literature project (sometimes known as "math-prepub"). The
idea was to use an internet harvesting tool to gather metadata for
mathematical grey literature into one national index. Currently (
April 2000) 19 institutions participate in the project, and end users can
also input directly their metadata onto MDC's web site. Metadata for 1667
online preprints and theses has been added to EULER. This figure grows
variably every month.
NetLab, Lund University Library
NetLab is the Research and Development Department at Lund University
Library. The department primarily runs developmental projects focusing
on digital library and netbased information discovery and retrieval, mainly
in Internet/WWW environments. NetLab has, for the EULER-project, created
an automatically collected subject limited web-index.
A web-index is a searchable collection of documents found on the World
Wide Web. Creating such an index involves using a harvesting program (often
called a spider or a robot) to collect pages, using another program in
order to index the data and, thirdly, a gateway and search program to display
the index and making it searchable.
Creating a web-index can be done automatically by feeding a list of
URLs into a robot program, that will try to fetch the corresponding documents
from the web. The robot is able to extract new URLs from the documents
it finds. The new URLs can be used to fetch more documents. This procedure,
a harvesting step, can be repeated again and again. The quality of such
a service is based on a number of factors, the relevance and quality of
the URLs that are used as the starting point for the service and the number
of harvesting steps decided upon. The starting point for the harvesting
procedure for the EULER web-index was MathGuide's quality assessed mathematical
web pages. The present index contains ca 103,500 records collected in a
two step process from the original URLs.
The harvested records essentially contain document text, title, headers,
links to other documents and possibly metadata, though in practice very
few automatically collected web documents contain metadata. Thus the web-index
will rarely offer more than fulltext and title searches, or combinations
of these.
European Mathematical Society / Technical University of Berlin
The Electronic Library of Mathematics (ELibM) was founded in 1995 as
part of the European Mathematical Information Service (EMIS), a service
by the European Mathematical Society (EMS). All material carried in mathematical
journals hosted by ELibM is peer-reviewed and the quality supervised by
the Electronic Publishing Committee of the EMS. ELibM covers journals,
monographs, collections and classical works in electronic form. The scope
of the collection encompasses all of mathematics.
Journals are either produced/hosted exclusively by ELibM or mirrored
from the journal's own web site. ELibM itself is mirrored via the mirror
system of EMIS which encompasses more than 30 mirrors around the world.
ElibM carries currently the full texts of 36 electronic journals. The
total number of full texts is currently 5071 (as of 25 April 2000). The
projected growth rate in terms of full texts is about 20% for 2001. About
95% of these texts are indexed in EULER.
To create the common record profile, the EULER partners needed a known resource description method, which could serve as a "switching language" between individual database records. For this purpose the project decided to use the Dublin Core Metadata Element Set (DC), with qualifiers (the Dublin Core home page is available at http://purl.org/DC/). Dublin Core is
a 15-element metadata element set intended to facilitate discovery of electronic resources. Originally conceived for author-generated description of Web resources, it has also attracted the attention of formal resource description communities such as museums and libraries.
This decision was partly based on the fact that DC has become a widely used de facto standard for resource description on the Internet (withspecial interest from the library community), partly it was based on the outcome in the evaluation of existing formats done in the EU Telematics for Research (4th Framework) project DESIRE. Finally, since DC quickly is becoming a widely used format it assures the semantic interoperability between EULER and other services on a European and global level.
In the mapping process the partners tried to find fields that semantically matched the fifteen basic elements of DC in each of the databases that were to be converted. The fifteen DC elements were however not sufficient for the functionality that the end-user was believed to appreciate. In a few cases some needed (and available) information could not be matched with any of the 15 basic DC elements (and their qualifiers), thus they were put in a EULER-specific hierarchy.
In the selection process efforts were made to exclude fields that only a few partners (involved in EULER today) provide data for. There are however exceptions to the rule. For example, fields judged to become useful (in a foreseeable future), were not discarded. There were two important reasons for these selection rules. First, an architecture should be as generic and simple as possible, avoiding complexity of special solutions for each supplier. Secondly, users would not be aware that they do not search all records when they select a certain field but only those that have data in that field.
Specific schemes have been assigned to most EULER DC-fields thus the EULER participants conform their data, both within the respect of covered fields and the structure/formats of the information in those fields.
The EULER DC specifications are detailed in Annex 1,
the Z39.50 mappings for those EULER DC fields will be given in the EULER
Gateway Engine documentation.
Task 1: Bibliographic Databases
This task has implemented automatic format conversion of a selected set of the data of MATH Database (Zentralblatt MATH, more than 1.500.000 entries covering the world-wide mathematical literature from 1931 to present) from the proprietary MATH format to the common EULER Dublin Core based format. Additional data from the "Jahrbuch über die Fortschritte der Mathematik" may be included in this conversion process if resources permit. "Jahrbuch" covers the most important literature in mathematics from 1868-1943. The original data were enriched with necessary means of identification (e.g. internationally accepted standard document identifiers for single articles) to enable the effective exchange of data between different systems.
Deliverable: D2.1
A Frontend Dublin Core Database for Zentralblatt MATH
Task 2: OPACs
This task has produced routines for the automatic extraction of (subsets of) relevant OPAC entries and conversion routines for the format conversion to the EULER Dublin Core based format, including automatic update procedures. Standardised EULER metadata database with these data were made accessible to be queried by the central EULER Engine. Connections to existing online document delivery services (e.g. at SUB Göttingen) are included, based on forwarding relevant data directly to the ordering systems.
Deliverable: D2.2
The Creation of OPAC Metadata Databases for the EULER Service
Task 3: Preprint Servers
This task has produced frontend DC metadata database for preprint series and other grey literature of scientific institutions that are electronically available through the Internet. The metadata database were produced by means of automatic gathering/harvesting the original sources, and automatic conversion to Dublin Core. The databases integrate preprint referencing, metadata search and full-text retrieval.
It was sufficient to include only preprints of relatively few institutes in this task. Other (national) initiatives such as the German DFN project MathNet, the MPRESS initiative, and the collections of French preprints have been informed of EULER results and requirements, and asked to contribute to a common development. For the purpose of proof-of-concept, trials with importing data from MPRESS have been carried out. Likewise, EULER has monitored activities in this sector in WP-1, and has refined its specifications to enable concerted international actions.
Deliverable: D2.3
A Front-end Dublin Core Database for Preprints and Research Reports
Task 4: E-journals
The goal of this task was to systematically add metadata descriptions to a carefully selected set of high-quality peer reviewed electronic mathematical journals and other publications. Metadata descriptions for these electronic publications were
* used in the production process of the electronic journal issues themselves, by the use of fully automated procedures that generate all the necessary index- and table-of-contents pages as well as individual journal articles homepages out of the metadata descriptions. Costly and time-consuming manual preparation (as it was done before) was eliminated. Metadata have substantially facilitated and speed up the preparation of the final electronic product, and enhanced its quality and usability.
* used to provide the basic means of enabling Online Delivery of the publications irrespective of protocols and formats. This is an important point when it comes to technology changes.
Deliverable: D2.4
Metadata for Electronic Journals in Mathematics
Task 5 Internet Mathematical Resources
The goal of this task was to comprehensively collect publications, information, resources and services in Mathematics published on the Internet, to offer them as a searchable and browseable service and to prepare the integration with the other "more traditional" bibliographic databases and fulltext publications in project EULER by creating DC metadata records for them.
Subtask a has developed a quality controlled information gateway for Mathematics, carefully selecting, describing and organizing Internet resources in this subject area. For this purpose data, approach and solutions of the DFG project MathGuide (http://www.MathGuide.de/) were used and adapted for the usage in EULER. Besides, the participants co-operated by selecting further relevant mathematical resources from the Internet.
Sub-task b has used a harvesting robot to systematically and automatically gather "all" mathematical resources on the Internet into a "Mathematical Web Index". A robot generated "Mathematical Web Index" consisting of "all" mathematical Web pages and resources on the Internet with focus on HTML pages was installed. This builds upon the robot software developed for the project DESIRE and methodologies first tested in the "Engineering Electronic Library, Sweden (EELS)" (http://www.ub2.lu.se/eel/eelhome.html) project. To increase the quality of this database, a Dublin Core Metadata creation and support site for publishers of European mathematical Web pages was offered, connected to the database.
Deliverables:
D2.5.1
The Creation of a Quality Controlled Information Gateway for the EULER
Service
D2.5.2
The creation of a web-index for the EULER service
D2.5.3
A public Dublin Core Metadata creation and support site for Mathematicians
In this chapter we describe how the issue of merging duplicate entries in one or more connected databases was addressed in EULER. This is done in a way that combines identification, sorting of result lists, and elimination of deduplicates of two types: local duplicates that arise in a single participating database (e.g. an on-line preprint and a report in the same partner's paper library), and nonlocal duplicates - items that appear in more than one database.
In a truly distributed environment, it is not possible to recognize
duplicates in batch, i.e., build or maintain lists of duplicate
entries. Such an action would require the complete database to pass
through a central point that would collect the data and recognize the duplicates.
Therefore, a method is required that can be performed at each of the partners'
sites individually.
The key is stored in the IDE field (EULER.Identifier) and consists of
5x4 characters in the following format:
YYYYAAAATTTTUUUUVVVVwhere
When taking personal author or personal contributor, do not process
any data following a comma (i.e. look at the surname only).
Since the deduplication key starts with the publication year (or other
year), the resulting sorting order after deduplication is necessarily by
year. It was chosen to use reverse sorting, so most recent items
are shown first.
<CR>Deak, J.</CR>When short words are postponed, as is the case in the EULER Alpha system, the keys become
<TI>Extending a family of merotopies in a screen space.</TI><CR>Deak, J.</CR>
<TI>Extending a family of screens in a contiguity space.</TI><IDE>1995deakextea---fami</IDE>
<IDE>1995deakextefamimero</IDE>A similar example is:
<IDE>1995deakextefamiscre</IDE>
<CR>Hinrichsen, Diederich</CR>Statistically, the remaining cases usually concern multi-part works of the same author:
<TI>A canonical form for multinomial systems</TI>
<IDE>1997hinrcanoformmult</IDE><CR>Hinrichsen, D.</CR>
<TI>A canonical form for static linear output feedback</TI>
<IDE>1997hinrcanoformstat</IDE>
<TI>Singular optimal stochastic controls I: Existence.</TI>However, some authors manage to produce essentially different works with 3 nontrivial words (here, it doesn't even matter whether over is treated as a stop word.
<IDE>1995haussingoptistoc</IDE><TI>Singular optimal stochastic controls II: Dynamic programming.</TI>
<IDE>1995haussingoptistoc</IDE>
<TI>Factoring multivariate polynomials over finite fields</TI>
<TI>Factoring multivariate polynomials over algebraic number fields</TI>
<CR>O'Hara, Jun</CR>and surnames shorter than 4 letters, such as the following 6 spellings of the same Chinese name found in various databases:
<CR>mac_Donald, Fred</CR>
<CR>He, Xue-Zhong</CR>The following two properties of the key were motivated by these examples:
<CR>He, Xuezhong</CR>
<CR>He, X.-Z.</CR>
<CR>He, X.Z.</CR>
<CR>He, X[ue] Z[hong]</CR>
<CR>He, X.Z. [He, Xue Zhong]</CR>
After adding event locations to replace an empty author field, many still remain, and these are mostly journals. The following are the most frequently occurring 5 journal entry keys of the CWI OPAC when short words are included (350 duplicates total):
34 --------intejourof--After short words are shifted to the end of the title, the final list becomes a total of 150 duplicates, the worst 5 being
27 ----ieeeieeetranon--
12 --------bullof--the-
11 --------jourof--the-
11 --------jourof--math
10 ----akadizveakadnauk
8 --------intejourcomp
6 --------manasciejour
6 --------commstatin--
6 --------buleinstpoli
<TI>Hazewinkel, M.(ed.) 1996 Handbook of algebra. Volume 1.</TI>It has been suggested, but not investigated, to solve this by building lists of abbreviations.
<IDE>1996hazehandalgevolu</IDE><TI>Hazewinkel, M. Ed. 1996 Handbook of algebra. Vol. 1</TI>
<IDE>1996hazehandalgeof--</IDE>
A downside of the approach taken is however that during development of the deduplication system, all partners have had to simultaneously change to a new postprocessor and re-index their data a few times.
It is worth investigating the possibility of using centrally maintained
lists of stop words for titles (from, ueber, voor)
to replace the 4 letter system, authors (``van der'') and abbreviations
in the title (such as the problematic `vol.' vs `volume').
Considering the current performance of the deduplication system however,
the pros and contras of such an approach should be weighed very carefully.
In the EULER partners' databases, diacritics are found represented in ISO-Latin-1, HTML and LaTeX encodings. Users typically enter queries for non-straightforward words and names using plain ASCII, ISO-Latin accented characters, or the German convention for writing letters with umlauts (oe, ue, ae).
Even in a local textual database system, it is a difficult task to effectively
answer differently formulated queries. In the distributed EULER system,
the user must choose to follow one of three conventions. Each database
is locally treated by the `metadata processor', also known as ``the ISO
tool'' (see annex 2). The postprocessor replaces
entries such as title, author and publisher by up to four different forms:
three for indexing, and one for display.
DC.Creator.PersonalName (CR field)and replace these elements, encoded in ISO-Latin, LaTeX or HTML encoding, by a HTML display form, and add up to three indexed alternatives:
DC.Creator.CorporateName (CRC field)
DC.Contributor.PersonalName (COP field)
DC.Contributor.CorporateName (COC field)
DC.Title (TI field)
DC.Title.Alternative (TIA field)
DC.Subject (SU field)
DC.Description (DE field),
CRI, COI, TII, SUI, DEI.The first alternative is a normalized 7-bit ASCII form; the second alternative, if different, is in ISO-Latin-1 encoding; the third alternative, if different from both others, is a 7-bit ASCII form using the German ae/oe/ue spelling for umlauts; ss for the ß. The following example illustrates this procedure for the creator element:
Input:The translations for the various representations are stored in a `dictionary file' that is read by the metadata postprocessor.<CR>Berggren, Mårten</CR>
<CR>Br\"{u}mmer, Anna</CR>Output:
<CR>Berggren, Mârten</CR>
<CRI>Berggren, Marten</CRI>
<CRI>Berggren, Mårten</CRI>
<CR>Brümmer, Anna</CRI>
<CRI>Brummer, Anna</CRI>
<CRI>Brümmer, Anna</CRI>
<CRI>Bruemmer, Anna</CRI>
Allthough the resource adaptation work package of the EULER project
has been finished at the time of publication of these reports, the work
in other work packages is still ongoing. In the following sections we describe
the status of these work packages and how their results might have impact
on the resource adaptation reports.
Work Package 3: EULER Engine
The project has released the Alpha version of the EULER service for
intermediate evaluation in July 1999. The EULER service provides the user
with an intermediate version of the central EULER Engine that queries the
databases described here in parallel and performs the necessary processing
for result presentation. Currently, the beta version of the software is
being developed, based on results from an intermediate evaluation by users
and experts. The final demonstrator service can be expected to be available
in June 2000. Last minute changes in profiles and other conventions might
lead to minor modifications of the resource adaptation procedures and specifications
described here.
Work Package 4: Evaluation and Demonstration
After the release of EULER Engine beta version selected groups of users will start system exploitation and evaluation. The work package intends to measure the system suitability and scalability and the satisfaction level of users with the service.
System test will evaluate the following parameters:
Work Package 5: Information Dissemination and Exploitation Preparations
The final exploitation plan for EULER services and other project results
will be prepared, based on the results of work package 4. Commercial exploitation
for future operation of EULER services and transfer of EULER results other
subject domains will be considered. Contracts within the consortium (and
beyond) will ensure the continuation of EULER services after the project
comes to an end.
Deliverables of the Resource Adaptation Work Package:
D2.1 A
Frontend Dublin Core Database for Zentralblatt MATH
D2.2
The Creation of OPAC Metadata Databases for the EULER Service
D2.3
A Front-end Dublin Core Database for Preprints and Research Reports
D2.4
Metadata for Electronic Journals in Mathematics
D2.5.1
The Creation of a Quality Controlled Information Gateway for the EULER
Service
D2.5.2
The creation of a web-index for the EULER service
D2.5.3
A public Dublin Core Metadata creation and support site for Mathematicians
Dublin Core Metadata Element Set, Version 1.1: Reference Description:
http://purl.oclc.org/dc/documents/rec-dces-19990702.htm
Format of entries:
Field name
Qualified DC name - EULER shorthand Scheme Semantics |
Title
DC.Title - TI
Scheme: None (i.e. freetext)
Title
Typically, a Title will be a name by which the resource is formally
known.
Alternative title
DC.Title.Alternative - TIA
Scheme: None (i.e. freetext)
Any titles other than the main title; including subtitle, translated
title, vernacular name, etc.
Personal Author
DC.Creator.PersonalName - CR
Scheme: Family name, first name(s) or initials (MARC)
A person primarily responsible for making the content of
the resource.
Corporate Author
DC.Creator.CorporateName - CA
Scheme: None
A corporate entity primarily responsible for making the content
of
the resource.
Personal Contributor
DC.Contributor.PersonalName - COP
Scheme: Family name, first name(s) or initials (MARC)
A person responsible for making contributions to the content
of the resource; including editors, translators, etc. .
Corporate Contributor
DC.Contributor.CorporateName - COC
Scheme: None
A corporate entity responsible for making contributions to
the content of the resource; including editors, translators, etc. .
Uncontrolled keyword
DC.Subject - SU
Scheme: None (i.e. freetext)
The topic of the content of the resource: Any keyword (NOT full
text).
Library of Congress Subject Headings (LCSH) keyword
DC.Subject - SUL
Scheme: LCSH
The topic of the content of the resource: Controlled keyword
from the LCSH.
Mathematics Subject Classification Scheme (MSC) classification
DC.Subject - SUM
Scheme: MSC
The topic of the content of the resource: MSC code, not the
explanatory text.
Dewey Decimal Classification (DDC) classification
DC.Subject - SUD
Scheme: DDC
The topic of the content of the resource: DDC code, not the
explanatory text.
Computing Classification System
DC.Subject - SUC
Scheme: CCS
The topic of the content of the resource: CCS code, not the
explanatory text.
Description
DC.Description - DE
Scheme: None (i.e. freetext)
Abstract, review and other freetext describing the resource.
NOT the full text of a web-resource or such like.
An account of the content of the resource. Description may include
but is not limited to: an abstract, table of contents, reference to a graphical
representation of content or a free-text account of the content.
Publisher
DC.Publisher - PU
Scheme: City [(Country)]: Name
An entity responsible for making the resource available. Typically
the publisher of the resource.
Date
DC.Date - DA
Scheme: YYYY[-MM[-DD]] (ISO 8601)
Date of resource publication or availability.
A date associated with an event in the life cycle of the resource.
Typically, Date will be associated with the creation or availability
of the resource.
Type
DC.Type - TY
Scheme: All of the suggested types in the "Dublin Core Resource Types,
Structuralist DRAFT: July 24, 1997" (2). Some of them are more relevant
for the EULER project (and might be used as search help, e.g. eligible
in a list over resource types)
Format
DC.Format - FO
Scheme: IMT
(3)
IANA MIME-type of file Internet Media Types
The [...] digital manifestation of the resource. Typically, Format
may include the media-type [...] of the resource. Format may be used to
determine the software, hardware or other equipment needed to display
or operate the resource.
Physical carrier
DC.Format.x-carrier - FOP
The projects draft list of relevant EULER specific physical carriers:
URN
DC.Identifier - IDN
Scheme: URN (4)
An unambiguous reference to the resource within a given context:
URN
of resource described in the record.
ISSN
DC.Identifier - IDS
Scheme: ISSN
An unambiguous reference to the resource within a given context:
ISSN
of resource described in the record.
ISBN
DC.Identifier - IDB
Scheme: ISBN
An unambiguous reference to the resource within a given context:
ISBN
of resource described in the record.
URL
DC.Identifier - IDL
Scheme: URL (http://..,ftp://... etc)
An unambiguous reference to the resource within a given context:
URL
of resource (described in the record) or where it can be acquired
De-duplication Identifier
DC.Identifier - IDE
Scheme: EULER specific scheme, generated uniformely by automated procedure
An unambiguous reference to the resource within a given context:
EULER-own
resurce identifier (in order to find duplicates)
Language
DC.Language - LA
Scheme: ISO 639-1 (2 letter codes)
A language of the intellectual content of the resource.
Terms and Conditions
DC.Rights - TC
Scheme: None (yet)
Information about rights held in and over the resource. Typically,
a Rights element will contain a rights management statement for the resource,
or reference a service providing such information. Rights information often
encompasses Intellectual Property Rights (IPR), Copyright, and various
Property Rights. If the Rights element is absent, no assumptions can be
made about the status of these and other rights with respect to the resource.
Metadata Creation Date
DC.Date.x-metadata-created - DMC
Scheme: numeric: YYYYMMDD
Date of the creation of the original metadata record. YYYY=Year, MM=month,
DD=day. Use "01" for unknown MM or DD. Useful for SDI services.
EULER specific fields
EULER identifier
EULER.Identifier - IDF
Scheme: None (i.e. freetext)
An unambiguous reference to the resource within a given context:
The purpose of this field is to identify the resource in other ways
than those provided by the other fields. This can be serial name, page-,
issue- or volume-numbers for journal articles or similar. (Can be used
differently in different databases, e.g. ISO 4-1984)
Full text
EULER.Fulltext - FT
Scheme: None (i.e. freetext)
The fulltext of web-pages and other resources available as a whole.
Event location
EULER.Event.Location - EL
Scheme: None
Location of event for/at which the resource described in the record
was created.
Event date
EULER.Event.Date - ED
Scheme: YYYY-MM-DD (ISO 8601)
Date of event for/at which the resource described in the record was
created.
Event name
EULER.Event.Name - EN
Scheme: None (i.e. freetext)
Name of event where document was created.
Record source
EULER.Record.Source - RS
Scheme: <Name of information provider>: <internal id>
The source for the record i.e. describes which information provider
has delivered the record.
Record source URL
EULER.Record.Sourceidentifier - OI
Scheme: URL
Identifier of source record for the description delivered in EULER.
URL pointing back to the original record at information providers' site.
Record creator
EULER.Record.Creator - RC
Scheme: Family name, first name (MARC)
Creator of the record (describing the resource), e.g. a reviewer.
Address for delivery information
EULER.Delivery - DI
Scheme: URL
Meant to give the URL to the library where the resource described in
the record can be acquired. (Pointer to online-order forms etc.)
Additional retrieve/delivery information
EULER.Delivery.Description - DID
Scheme: None (i.e. freetext)
Additional information that a user and a local library need to retrieve/deliver
the resource described in the record.
All fields are repeatable, except for Title (TI), De-duplication Identifier (IDE) and Record Source (RS).
Planned further work: shift elements from EULER specific hierarchy to DC hierarchy whenever possible.
A set of special indexed fields are generated by the EULER Metadata Postprocessor ("iso-tool"): The following five special index fields are generated from the listed original fields for indexing and retrieval purposes. The corresponding original fields (those without "I" at the end) are normalized for HTML display.
CRI: CR, CA, COP, COC
PUI: PU
TII: TI, TIA
SUI: SU
DEI: DE
ISO-tool was written in C using standard UNIX libraries and was installed without problems at the various partner sites. With some minor modifications it can be ported to other operating systems.Constructing the deduplication key (section 4.3) Diacritics conversion (section 4.4) Producing field usage statistics
iso [-b] [-d dict-file] [-kd key-dumpfile]
[-sd stats-dumpfile]
<files/directories>
The program reports how many of the records in the processed files needed postprocessing (that is, the XREC records), and doesn't write to files unless it actually had to make modifications.
The following modifications are made:
The file `iso.dict' is expected to be in the directory from which `iso' is started. The `-d' option allows you to read it from elsewhere, or to work with different dictionary files specifying alternative record formats and diacritics translation rules.
The file `iso.dict' defines the record tag syntax and the translation rules for diacritics. It consists of a number of records; blank lines are ignored, and can be used to separate records.
If all you want `iso' to do is to convert ISO or LaTeX formatted records using the Zebra profile provided by NetLab, you may not need to do anything about `iso.dict', although it currently does not contain a lot of LaTeX sequences, so you may have to add sequences you encounter in your bibliographic records.
Records consist of two or three lines. The first line is of the form
TagPair <tab> [tag pair identifier]or
Display <tab> [display form]This is optionally followed by a second line
Index <tab> [tab separated list of index forms]The last (second or third) line always reads
Match <tab> [tab separated list of tag strings]A `TagPair' record must be present for the following bibliographical record items.
unindexed record (including indexed form)Both Index and Match contain precisely two strings: the opening and the closing tag.
indexed record (match only)
personal author (including indexed form)
corporate author (including indexed form)
personal contributor (including indexed form)
corporate contributor (including indexed form)
publisher (including indexed form)
title (including indexed form)
alternative title (including indexed form)
subject (including indexed form)
description (including indexed form)
date-iso-8601 (match only)
event-date-iso-8601 (match only)
euler identifier (match only)
event location (match only)
free identifier (including indexed form)
For `Display' records, there can be at most two index forms; none if
the display form is already in the US-ASCII range; one plain ASCII form
when there are ISO-Latin characters in the display form, and one extra
form for the German `oe for ö' convention. The number of matched
input strings is unrestricted.
If you want the statistics to be written to a file immediately, use the option -sd stats-file.