EULER: European Libraries and Electronic Resources in Mathematical Sciences
Telematics for Libraries Project LB-5609

The creation of a web-index for the EULER service.

Includes an Evaluation of Methods for Automatic Collection of Web Pages.

Anna Brümmer, Johanna Nilsson,
Lars Noodén, Tomas Schönthal

3 March 2000

Final Version

EULER Project Deliverable
Project Name:	European Libraries and Electronic Resources in Mathematical Sciences
Project Acronym:	EULER
Project Number:	LB-5609
Report Title:	The Creation of a Web-Index for the EULER Service. Includes an Evaluation of Methods for Automatic Collection of Web Pages.
Deliverable Number:	D2.5.2
Version Number:	Final Version
Date:	3 March 2000
URL:	http://www.lub.lu.se/EULER/D2_5_2.html
URL to other versions	The report in pdf: Not available yet. The entire report in ps: Not available yet.
Author(s):	Anna Brümmer LUB NetLab Postal address: P.O. Box 3, S-221 00 Lund, Sweden Visiting address: Dag Hammarskjölds väg 2D, Lund, Sweden E-mail: Anna.Brummer@lub.lu.se Tel: +46 46 2220114 Fax: +46 46 2223682
	Johanna Nilsson LUB NetLab Postal address: P.O. Box 3, S-221 00 Lund, Sweden Visiting address: Dag Hammarskjölds väg 2D, Lund, Sweden E-mail: Johanna.Nilsson@lub.lu.se Tel: +46 46 2229369 Fax: +46 46 2223682
	Lars Noodén LUB NetLab Postal address: P.O. Box 3, S-221 00 Lund, Sweden Visiting address: Dag Hammarskjölds väg 2D, Lund, Sweden E-mail: Lars.Nooden@lub.lu.se Tel: +46 46 2229371 Fax: +46 46 2223682
	Tomas Schönthal LUB NetLab Postal address: P.O. Box 3, S-221 00 Lund, Sweden Visiting address: Dag Hammarskjölds väg 2D, Lund, Sweden E-mail: Tomas.Schonthal@lub.lu.se Tel: +46 46 2229372 Fax: +46 46 2223682
Deliverable Kind:	Report
Deliverable Type:	Unrestricted
Abstract:	This deliverable describes the creation of an automatically collected subject limited web-index. Before the production of the EULER web index a comparative study on two different indexing methods was performed which is also described in this deliverable. Finally, according to the ideas behind the project, a frontend Dublin Core database for the EULER services was created for the web index.

1. Introduction
2. Methods for Data Collection
3. Harvesting Strategies
4. Methods for Evaluating the Databases
5. Conclusions of the Evaluation
6. Indexing the Data
7. Displaying the Index

1 Introduction

1.1 Web-Index

A web-index is a searchable collection of documents found on the World Wide Web. Creating such an index involves using a harvesting program (often called a spider or a robot) to collect pages, using another program in order to index the data and, thirdly, a gateway and search program to display the index and making it searchable.

Creating a web-index can be done automatically by feeding a list of URLs into a robot program, that will try to fetch the corresponding documents from the web. The robot is able to extract new URLs from the documents it finds. The new URLs can be used to fetch more documents. This procedure, a harvesting step, can be repeated again and again. The quality of the web-index can be influenced by two factors:

The set of URLs initially fed into the robot.
The number of harvesting steps.

This is covered in more detail in the "Harvesting Strategies" section below.

1.2 Subject Based Information Gateways

Subject Based Information Gateways (SBIGs) are subject entrances, or clearing houses, for quality-assessed resources. The manual selection is done in accordance with an (officially published) list of quality criteria and subsequently cataloged with intellectually created abstracts and keywords.

1.3 The Goals of This Task

The aim of this task (EULER T2.5.2) is to create a web-index to be included in the EULER service. Two specific requirements have been decisive. The EULER web-index should be subject limited and cover pages of as high a quality and relevance as possible. In order to achieve this, the URLs for the list (to be fed to the robot) could be taken from quality assessed mathematical web pages. We assume that pages of high quality for the most part link to other pages of high quality. The use of these URLs could imply that the result, the index, will cover pages of superior quality than a normal/average web-index. Such high quality URLs could, for example, be available from SBIG services.

For the EULER project we chose initial URLs from MathGuide, being a well-established Internet service having every URL thoroughly quality rated according to a number of useful criteria. More information on MathGuide is available in EULER D2.5.1.

A sub-task in T2.5.2 was to evaluate two databases resulting from two different data collection or harvesting strategies. These databases were examined by mathematical experts, who volonteered from university mathematics departments around Sweden. Two methods were used to evaluate the databases. First randomly sampled documents were being judged as either relevant or irrelevant in order to estimate the proportion of relevant material in the entire databases. The second evaluation method implied that the databases were searched for solutions to specific problems.

In the EULER project the Combine robot was used. Combine consists of three main components. The first component schedules the harvesting of URLs, manages a central log, and ensures that Combine obeys the robot exclusion protocol (robots.txt files). The second part, with coordination from the scheduler, retrieves individual web pages. The pages are then sent to one or more databases, the third component of the Combine harvester/index, where they are parsed and indexed. Any number of the latter two components, the harvester and the database, can be controlled by a single scheduler.

Combine was developed as a part of the Development of a European Service for Information on Research and Education (DESIRE) project, which was funded by the European Commission within the Telematics for Science Program.

2 Methods for Data Collection

2.1 Initial Selection

The starting point for the harvesting process contained descriptive records for 804 manually compiled and classified mathematical web-sites which were gathered from the Subject Based Information Gateway (SBIG) MathGuide ( http://www.sub.uni-goettingen.de/ssgfi/math/ ) service at the Lower Saxony State and University Library Göttingen, Germany (SUB).

The rules for selecting sites from MathGuide were the following:

Highest contents rating.
Highest links rating.
A level rating of at least undergraduate - professional.
Language not entirely non-English.
MIME type text/html
HyperText Transfer Protocol (HTTP)
Better advice followed, but or/eq ignored.
See Appendix B for full MathGuide record.

Requiring MIME type text/html ensures that all the documents in the collection can be processed in an automated fashion. Further requiring at least some English in each document allows a common working language both for indexing software and for project participants.

The suppliers at SUB hint that the contents quality criterion (range 1 - 3) is the strongest one. Links, the number and quality of links for a given site, is also valuable. It was decided to initially pick sites with the highest contents and links ratings and to pursue harvest strategy A and B described below. A third harvesting strategy, which was considered and then discarded, was to completely harvest the entire MathGuide SBIG.

As a check, the values (for example, popular, undergraduate, graduate and professional) and compound values (undergraduate - professional, for example) in the qualitative level rating were quantified between purely popular (lowest) and purely professional (highest). Apparently, the level rating is well correlated with highest contents and links.

The above rules produced a set of 71 sites, which were then fed into the robot Combine and were resolved to 64 unique links.

Note: MathGuide is not purely mathematical but also devoted to applied mathematics (statistics, numerical analysis, computer science, etc) and natural sciences (e.g. physics). Examples:

Table 1.
Examples from the Mathematical Subject Classification System
MSC	Name
68	Computer science, www.caam.rice.edu
70	Mechanics of particles and systems, www.math.psu.edu/dynsys
83	Relativity and gravitational theory, www.physics.adelaide.edu/ASGRG
93	Systems theory, control, www.elsevier.com/locate/mp
90	Economics, operations research, programming, games, www.actuaries.ca/homee.htm

Using only MSC (Mathematical Subject classification) values in the interval [0, 59], only purely mathematical sites would be harvested, and the initial selection would have been only 53 sites, instead of 71.

2.2 Final Decisions

Interpret "mathematics" in a wide sense. Specifically, the initial sites to be harvested could be chosen from the entire MathGuide.
Pursue strategies A (narrow and deep) and B (wide and shallow) below.

3 Harvesting Strategies

As mentioned, it was decided to compare the results of narrow and deep harvesting versus wide and shallow harvesting. The narrow and deep method started from 64 unique links pared down from 71 pre-selected URLs. These links were up to three steps away from the original 64. The wide and shallow method started with 639 unique links (from 714 pre-selected URLs) and, in this case, followed these links only two steps from the original 639. The selection of the 64 first URL:s and the latter 639 differs in the sense that for the second group all level ratings were accepted.

3.1 Collecting Data

The collection process involves:

Obtaining a list of manually assembled and quality assessed web documents, in our case in the field of mathematics and its applications. We used the MathGuide list supplied by SUB, http://www.sub.uni-goettingen.de/ssgfi/math/.
Use the URLs in the above list as starting points for a harvesting robot, i.e. a software that automatically retrieves documents from the web and extracts useful information from them. The robot that was used Combine, http://www.lub.lu.se/combine/, developed by NetLab, http://www.lub.lu.se/netlab/.
Choosing a collection strategy. We tried these:
- Narrow and deep (database A):
  MathGuide was quality filtered so strictly that about 10 % of the original links remained. This became our starting link set. The robot was then instructed to harvest the link set one step. Harvesting one step follows this sequence:
  
  An attempt is made to retrieve the documents pointed out by the link set.
  
  Any new links found in the retrieved documents are added to the link set.
  
  The link set is then harvested three more steps. Note that the harvesting process as such does not apply any quality rules.
  
  (See Table 2)
- Wide and shallow (database B):
  The quality criteria were so relaxed that about 90 % of the original links were accepted.
  
  The resulting link set was then harvested one step less than in the previous strategy.
  
  Retrying harvesting steps for links involving web servers, that didn't respond the first time.
  
  (See Table 3)

Table 2.
Narrow and deep harvesting
	Links	Size	Time	Unavailable	Errors	Sites	Hosts
case.a0	64	842k	3m	1		71	63
case.a1	2138	17M	4h	108
case.a2	23376	187M	12h	690	3(*)	37000	4200
case.a3	185000	1.4G	9d	2500	27	390000	22200
Indexing run: 184,975 records, 2,925,460 words, 117,828,218 iterations taking 4 hours. () E.g. three occurrences of Unknown action 303: /turing.math.tu.cottbus.de/*

Table 3.
Wide and shallow harvesting
	Links	Size	Time	Unavailable	Errors	Sites	Hosts
case.b0	639	4M	18m	21	0		556
case.b1	11320	101M	4h	297	2	16300	2500
case b2	106566	908M	78h	3019	75	217000	14570
Indexing run: 104,357 records, 2,169,118 words, 76,170,199 iterations taking 2.25h

4 Methods for Evaluating the Databases

The two strategies pursued resulted in two databases with different contents. In order to decide which strategy gives the best database, the resulting databases were compared using two different evaluation methods. The contents of the databases were examined through:

assessment of sample web pages to determine whether they belonged to the subject of mathematics or not. The purpose of this method was to measure the proportions of irrelevant versus relevant material
search tests, trying to answer subject questions using the database. The purpose here was to test how useful the databases are for searching.

The examinations were carried out by subject experts in the field of mathematics. The participants were a total of eight volunteers from mathematical departments at the Royal Institute of Technology in Stockholm, Lund University, Lund Institute of Technology, and the Massachusetts Institute of Technology. Their tasks were presented to them on an individual web page for each person, containing instructions and forms for data submission.

4.1 Evaluation Method 1: Assessment of Sample Web Pages

The first phase of the evaluation was the evaluation of two sets of sample URLs, each set extracted from each database. The number of URLs was calculated to be 400 per database in order to acquire enough data for statistic analysis. The participants then each examined 50 URLs per database each, judging them to be either relevant to the subject or not. Broken links and other viewing problems were put aside in a "Fault" category. If the page was accessible and belonged to the subject, the participant checked "Yes" in the form. If it was accessible, but did not belong to the subject, the answer was "No".

Table 4.
Results for evaluation method 1
db	Total	Yes	No	Fault	Confidence Interval
A	399	122 (30.6%)	246 (61.6%)	31 (7.77%)	±4.5%
B	400	143 (35.5%)	218 (54.5%)	39 (9.75%)	±4.7%
Confidence interval = 1.96 * sqrt( P * ( ( 1-P ) /n ) ), where P = % relevant, n = total answered.

The resulting data sets were submitted from the participants. The total set of results for each of the two databases, shown above in table 4, was then analyzed measuring the shares of pages that was within the subject, not within the subject and not available respectively.

4.2 Evaluation Method 2: Search Tests

For the second part, each participant formulated four questions (see Appendix A). These questions were then distributed to other participants. In the search tests, each expert tried to find the answer to these questions by using the two databases. The questions were thus attempted twice by the experts, once in each database. After searching for the answer to each question in a database, the participants reported whether the answer had been found or not and, optionally, whether it was easy or difficult and whether the answer was found in many pages of the hit list or in few.

Did you find the answer in the database?
Was it easy or difficult?
In how many of the search results could the answer be found?

This last question, helped determine if the harvesting method had collected many relevant links or only one good one among many non-useful links.

The resulting data was put together and the numbers of positive and negative answers compiled. The optional questions were used for catching trends, not for the primary conclusions.

Results for evaluation method 2

Table 5a.
Did you find the answer in the database?
db	Total	Yes	No
A	28	13 (46%)	15 (54%)
B	28	9 (32%)	19 (68%)

Table 5b.
Was it easy or difficult?
db	Total	Easy	Difficult
A	13	8 (62%)	5 (38%)
B	9	7 (78%)	2 (22%)

Table 5c.
In how many of the search results could the answer be found?
db	Total	Many	Few
A	11	2 (18%)	9 (82%)
B	6	1 (17%)	5 (83%)

Regarding the optional questions, it can be assumed that the questions which could not be answered using the respective databases were neither easy nor did they contain many relevant records.

5 Conclusions of the Evaluation

The results for test method 1 suggest that the wide and shallow harvesting strategy has a slight advantage over the narrow and deep strategy. On the other hand, evaluation method 2 seem to point out the database built with the narrow and deep strategy as the better one. However, further evaluations are necessary.

5.1 Evaluation Method 1: Assessment of Sampling of Web Pages

The assessment of a sampling of web pages, evaluation method 1. This method suggests that both harvesting methods are rather similar. However, database B, which was harvested using the wide and shallow method seems to have a slightly higher yield of relevant web pages, a slightly lower yield of non-relevant web pages. As seen in table 4 above, the wide and shallow method has a higher yield of relevant URLs (35.5% versus 30.6%), as well as a lower yield of irrelevant URLs (54.5% versus 61.6%).

5.2 Evaluation Method 2: Search Tests

The second evaluation method might be regarded as more interesting since searching for direct answers is what most people will want to use a web-index for. People need solutions to specific problems and try to find relevant information. (Thus the high percentage of non-relevant pages might not be a problem even if it is high since the user never is confronted with the material. Those pages are never show in search results.) This method does also, as the first, suggest that both harvesting methods are rather similar. The fact that not even 50 % of the questions could be answered by using the databases is not very satisfactory. However, the results suggest that database A is slightly better aimed for searching than database B. As seen in table 5a above, database A provided answers to 46 % of the questions versus 32 % in database B.

We have not looked at the questions to see if it were the same questions that could be answered by using both databases or if the overlap was very small and the answers were found either in database A or B.

5.3 Concluding Remarks on the Evaluation

Relevance rankings of 30% and 35% for databases A and B, respectively, may at first indicate a low yield. However, considering that database A holds as many as 185,000 records, a yield of 30%-35% would be 55,000 to 65,000 potentially relevant records. Starting from 71 URLs, this indicates a yield factor of about 780. This is very encouraging, since the harvesting process is entirely automatic. Correspondingly, database B holds 37,000 potentially relevant records with a yield factor of 58.

Please note that the starting URLs for database A were more carefully selected than for B. It should also be noted that harvesting strategy B was much faster than A.

Another remark about relevance: The EULER partners decided to interprete "mathematics" in a broad sense, i.e. as pure mathematics plus applied mathematics. The fact that quite a few of the evaluators are pure mathematicians, is problematic. It's not at all evident that a pure mathematician would judge e.g. computer science documents as relevant "mathematics".

6 Indexing the Data

The harvested records essentially contain document text, title, headers, links to other documents and possibly metadata. In practice very few automatically collected web documents contain metadata.

Thus the web-index will rarely offer more than fulltext and title searches, or combinations of these. However, in order to be fully compatible with all the other databases within the project, the web-index is created according to the full Euler profile. This is important because it makes our web-index cross-searchable with the other non-web-indexes databases (e.g. library catalogues, pre-prints).

The database conforms to the Z39.50 protocol, and the Zebra software from Index Data, http://www.indexdata.dk/, was used to create it.

7 Displaying the Index

The web gateway for searching the web-index is designed with the Zebril software from DTV, http://www.dtv.dk/, and the database is accessed via Index Data's Zebra software.

The gateway allows the user to combine several search fields with Boolean operators. Given a result set, the user may:

Refine the search by applying extra Boolean operators.
Request detailed record descriptions.
Fetch the document from the web.

The gateway and a more detailed description is available at http://mother.lub.lu.se/eval-euler/.

Appendix A - Some of the questions used in evaluation method 2, search tests.

Appendix B - Sample MathGuide Record.

The Lund University Library Development Department, NetLab, would like to extend special thanks to Claes Trygger at the Royal Institute of Technology's Department of Mathematics for his extra efforts and input.