EULER Project Deliverable | |
Project Name: | European Libraries and Electronic Resources in Mathematical Sciences |
Project Acronym: | EULER |
Project Number: | LB-5609 |
Report Title: | The Creation of a Web-Index for the EULER Service. Includes an Evaluation of Methods for Automatic Collection of Web Pages. |
Deliverable Number: | D2.5.2 |
Version Number: | Final Version |
Date: | 3 March 2000 |
URL: | http://www.lub.lu.se/EULER/D2_5_2.html |
URL to other versions | The report in pdf:
Not available yet. The entire report in ps: Not available yet. |
Author(s): | Anna Brümmer LUB NetLab Postal address: P.O. Box 3, S-221 00 Lund, Sweden Visiting address: Dag Hammarskjölds väg 2D, Lund, Sweden E-mail: Anna.Brummer@lub.lu.se Tel: +46 46 2220114 Fax: +46 46 2223682 |
Johanna Nilsson LUB NetLab Postal address: P.O. Box 3, S-221 00 Lund, Sweden Visiting address: Dag Hammarskjölds väg 2D, Lund, Sweden E-mail: Johanna.Nilsson@lub.lu.se Tel: +46 46 2229369 Fax: +46 46 2223682 |
|
Lars Noodén LUB NetLab Postal address: P.O. Box 3, S-221 00 Lund, Sweden Visiting address: Dag Hammarskjölds väg 2D, Lund, Sweden E-mail: Lars.Nooden@lub.lu.se Tel: +46 46 2229371 Fax: +46 46 2223682 |
|
Tomas Schönthal LUB NetLab Postal address: P.O. Box 3, S-221 00 Lund, Sweden Visiting address: Dag Hammarskjölds väg 2D, Lund, Sweden E-mail: Tomas.Schonthal@lub.lu.se Tel: +46 46 2229372 Fax: +46 46 2223682 |
|
Deliverable Kind: | Report |
Deliverable Type: | Unrestricted |
Abstract: | This deliverable describes the creation of an automatically collected subject limited web-index. Before the production of the EULER web index a comparative study on two different indexing methods was performed which is also described in this deliverable. Finally, according to the ideas behind the project, a frontend Dublin Core database for the EULER services was created for the web index. |
A web-index is a searchable collection of documents found on the World Wide Web. Creating such an index involves using a harvesting program (often called a spider or a robot) to collect pages, using another program in order to index the data and, thirdly, a gateway and search program to display the index and making it searchable.
Creating a web-index can be done automatically by feeding a list of URLs into a robot program, that will try to fetch the corresponding documents from the web. The robot is able to extract new URLs from the documents it finds. The new URLs can be used to fetch more documents. This procedure, a harvesting step, can be repeated again and again. The quality of the web-index can be influenced by two factors:Subject Based Information Gateways (SBIGs) are subject entrances, or clearing houses, for quality-assessed resources. The manual selection is done in accordance with an (officially published) list of quality criteria and subsequently cataloged with intellectually created abstracts and keywords.
The aim of this task (EULER T2.5.2) is to create a web-index to be included in the EULER service. Two specific requirements have been decisive. The EULER web-index should be subject limited and cover pages of as high a quality and relevance as possible. In order to achieve this, the URLs for the list (to be fed to the robot) could be taken from quality assessed mathematical web pages. We assume that pages of high quality for the most part link to other pages of high quality. The use of these URLs could imply that the result, the index, will cover pages of superior quality than a normal/average web-index. Such high quality URLs could, for example, be available from SBIG services.
For the EULER project we chose initial URLs from MathGuide, being a well-established Internet service having every URL thoroughly quality rated according to a number of useful criteria. More information on MathGuide is available in EULER D2.5.1.
A sub-task in T2.5.2 was to evaluate two databases resulting from two different data collection or harvesting strategies. These databases were examined by mathematical experts, who volonteered from university mathematics departments around Sweden. Two methods were used to evaluate the databases. First randomly sampled documents were being judged as either relevant or irrelevant in order to estimate the proportion of relevant material in the entire databases. The second evaluation method implied that the databases were searched for solutions to specific problems.
In the EULER project the Combine robot was used. Combine consists of three main components. The first component schedules the harvesting of URLs, manages a central log, and ensures that Combine obeys the robot exclusion protocol (robots.txt files). The second part, with coordination from the scheduler, retrieves individual web pages. The pages are then sent to one or more databases, the third component of the Combine harvester/index, where they are parsed and indexed. Any number of the latter two components, the harvester and the database, can be controlled by a single scheduler.
Combine was developed as a part of the Development of a European Service for Information on Research and Education (DESIRE) project, which was funded by the European Commission within the Telematics for Science Program.
The starting point for the harvesting process contained descriptive records for 804 manually compiled and classified mathematical web-sites which were gathered from the Subject Based Information Gateway (SBIG) MathGuide ( http://www.sub.uni-goettingen.de/ssgfi/math/ ) service at the Lower Saxony State and University Library Göttingen, Germany (SUB).
The rules for selecting sites from MathGuide were the following:
Requiring MIME type text/html ensures that all the documents in the collection can be processed in an automated fashion. Further requiring at least some English in each document allows a common working language both for indexing software and for project participants.
The suppliers at SUB hint that the contents quality criterion (range 1 - 3) is the strongest one. Links, the number and quality of links for a given site, is also valuable. It was decided to initially pick sites with the highest contents and links ratings and to pursue harvest strategy A and B described below. A third harvesting strategy, which was considered and then discarded, was to completely harvest the entire MathGuide SBIG.
As a check, the values (for example, popular, undergraduate, graduate and professional) and compound values (undergraduate - professional, for example) in the qualitative level rating were quantified between purely popular (lowest) and purely professional (highest). Apparently, the level rating is well correlated with highest contents and links.
The above rules produced a set of 71 sites, which were then fed into the robot Combine and were resolved to 64 unique links.
Note: MathGuide is not purely mathematical but also devoted to applied mathematics (statistics, numerical analysis, computer science, etc) and natural sciences (e.g. physics). Examples:
Examples from the Mathematical Subject Classification System | |
---|---|
MSC | Name |
68 | Computer science, www.caam.rice.edu |
70 | Mechanics of particles and systems, www.math.psu.edu/dynsys |
83 | Relativity and gravitational theory, www.physics.adelaide.edu/ASGRG |
93 | Systems theory, control, www.elsevier.com/locate/mp |
90 | Economics, operations research, programming, games, www.actuaries.ca/homee.htm |
Using only MSC (Mathematical Subject classification) values in the interval [0, 59], only purely mathematical sites would be harvested, and the initial selection would have been only 53 sites, instead of 71.
As mentioned, it was decided to compare the results of narrow and deep harvesting versus wide and shallow harvesting. The narrow and deep method started from 64 unique links pared down from 71 pre-selected URLs. These links were up to three steps away from the original 64. The wide and shallow method started with 639 unique links (from 714 pre-selected URLs) and, in this case, followed these links only two steps from the original 639. The selection of the 64 first URL:s and the latter 639 differs in the sense that for the second group all level ratings were accepted.
Narrow and deep harvesting | |||||||
---|---|---|---|---|---|---|---|
Links | Size | Time | Unavailable | Errors | Sites | Hosts | |
case.a0 | 64 | 842k | 3m | 1 | 71 | 63 | |
case.a1 | 2138 | 17M | 4h | 108 | |||
case.a2 | 23376 | 187M | 12h | 690 | 3(*) | 37000 | 4200 |
case.a3 | 185000 | 1.4G | 9d | 2500 | 27 | 390000 | 22200 |
Indexing run: 184,975 records, 2,925,460 words, 117,828,218 iterations taking 4 hours. (*) E.g. three occurrences of Unknown action 303: /turing.math.tu.cottbus.de/ |
Wide and shallow harvesting | |||||||
---|---|---|---|---|---|---|---|
Links | Size | Time | Unavailable | Errors | Sites | Hosts | |
case.b0 | 639 | 4M | 18m | 21 | 0 | 556 | |
case.b1 | 11320 | 101M | 4h | 297 | 2 | 16300 | 2500 |
case b2 | 106566 | 908M | 78h | 3019 | 75 | 217000 | 14570 |
Indexing run: 104,357 records, 2,169,118 words, 76,170,199 iterations taking 2.25h |
The two strategies pursued resulted in two databases with different contents. In order to decide which strategy gives the best database, the resulting databases were compared using two different evaluation methods. The contents of the databases were examined through:
The examinations were carried out by subject experts in the field of mathematics. The participants were a total of eight volunteers from mathematical departments at the Royal Institute of Technology in Stockholm, Lund University, Lund Institute of Technology, and the Massachusetts Institute of Technology. Their tasks were presented to them on an individual web page for each person, containing instructions and forms for data submission.
The first phase of the evaluation was the evaluation of two sets of sample URLs, each set extracted from each database. The number of URLs was calculated to be 400 per database in order to acquire enough data for statistic analysis. The participants then each examined 50 URLs per database each, judging them to be either relevant to the subject or not. Broken links and other viewing problems were put aside in a "Fault" category. If the page was accessible and belonged to the subject, the participant checked "Yes" in the form. If it was accessible, but did not belong to the subject, the answer was "No".
Results for evaluation method 1 | |||||
---|---|---|---|---|---|
db | Total | Yes | No | Fault | Confidence Interval |
A | 399 | 122 (30.6%) | 246 (61.6%) | 31 (7.77%) | ±4.5% |
B | 400 | 143 (35.5%) | 218 (54.5%) | 39 (9.75%) | ±4.7% |
Confidence interval = 1.96 * sqrt( P * ( ( 1-P ) /n
) ), where P = % relevant, n = total answered. |
The resulting data sets were submitted from the participants. The total set of results for each of the two databases, shown above in table 4, was then analyzed measuring the shares of pages that was within the subject, not within the subject and not available respectively.
For the second part, each participant formulated four questions (see Appendix A). These questions were then distributed to other participants. In the search tests, each expert tried to find the answer to these questions by using the two databases. The questions were thus attempted twice by the experts, once in each database. After searching for the answer to each question in a database, the participants reported whether the answer had been found or not and, optionally, whether it was easy or difficult and whether the answer was found in many pages of the hit list or in few.
This last question, helped determine if the harvesting method had collected many relevant links or only one good one among many non-useful links.
The resulting data was put together and the numbers of positive and negative answers compiled. The optional questions were used for catching trends, not for the primary conclusions.
Results for evaluation method 2 |
---|
Did you find the answer in the database? | |||
---|---|---|---|
db | Total | Yes | No |
A | 28 | 13 (46%) | 15 (54%) |
B | 28 | 9 (32%) | 19 (68%) |
Was it easy or difficult? | |||
---|---|---|---|
db | Total | Easy | Difficult |
A | 13 | 8 (62%) | 5 (38%) |
B | 9 | 7 (78%) | 2 (22%) |
In how many of the search results could the answer be found? | |||
---|---|---|---|
db | Total | Many | Few |
A | 11 | 2 (18%) | 9 (82%) |
B | 6 | 1 (17%) | 5 (83%) |
Regarding the optional questions, it can be assumed that the questions which could not be answered using the respective databases were neither easy nor did they contain many relevant records.
The assessment of a sampling of web pages, evaluation method 1. This method suggests that both harvesting methods are rather similar. However, database B, which was harvested using the wide and shallow method seems to have a slightly higher yield of relevant web pages, a slightly lower yield of non-relevant web pages. As seen in table 4 above, the wide and shallow method has a higher yield of relevant URLs (35.5% versus 30.6%), as well as a lower yield of irrelevant URLs (54.5% versus 61.6%).
We have not looked at the questions to see if it were the same questions that could be answered by using both databases or if the overlap was very small and the answers were found either in database A or B.
Relevance rankings of 30% and 35% for databases A and B, respectively, may at first indicate a low yield. However, considering that database A holds as many as 185,000 records, a yield of 30%-35% would be 55,000 to 65,000 potentially relevant records. Starting from 71 URLs, this indicates a yield factor of about 780. This is very encouraging, since the harvesting process is entirely automatic. Correspondingly, database B holds 37,000 potentially relevant records with a yield factor of 58.
Please note that the starting URLs for database A were more carefully selected than for B. It should also be noted that harvesting strategy B was much faster than A.
Another remark about relevance: The EULER partners decided to interprete "mathematics" in a broad sense, i.e. as pure mathematics plus applied mathematics. The fact that quite a few of the evaluators are pure mathematicians, is problematic. It's not at all evident that a pure mathematician would judge e.g. computer science documents as relevant "mathematics".
Thus the web-index will rarely offer more than fulltext and title searches, or combinations of these. However, in order to be fully compatible with all the other databases within the project, the web-index is created according to the full Euler profile. This is important because it makes our web-index cross-searchable with the other non-web-indexes databases (e.g. library catalogues, pre-prints).
The database conforms to the Z39.50 protocol, and the Zebra software from Index Data, http://www.indexdata.dk/, was used to create it.
The gateway allows the user to combine several search fields with Boolean operators. Given a result set, the user may:
The gateway and a more detailed description is available at http://mother.lub.lu.se/eval-euler/.
Appendix A - Some of the questions used in evaluation method 2, search tests.
Appendix B - Sample MathGuide Record.
The Lund University Library Development Department, NetLab, would like to extend special thanks to Claes Trygger at the Royal Institute of Technology's Department of Mathematics for his extra efforts and input.