Navigation Interface in Cross-Lingual WWW Search Engine, TITAN

Seiji Susaki, Yoshihiko Hayashi & Gen-itiro Kikui
NTT Information and Communication Systems Labs, Japan
Email: suzaki@isl.ntt.co.jp

Abstract

The size of the World Wide Web (WWW) has been increasing these days so the Internet users are finding it difficult to reach their destinations when they 'surf' the Web. Many systems that explore ways to make it easier to obtain information from the Internet have been developed. TITAN (Total Information Traverse AgeNt) is one such systems. TITAN features: (i) cross-lingual information retrieval, (ii) a user interface that supports browsing. With TITAN, users of the Internet can reach their destinations within their native language and browse Web pages according to hyperlink information.

1 Introduction

The explosive proliferation of information on the Internet has increased the urgency of developing effective tools for accessing the immense information resources that are now available. However, many users are bewildered. They don't know how to navigate through the vast amounts of information on the WWW to find the specific information that they are seeking.

Conventional information retrieval methods cannot simply be applied as is, to effectively access the increasingly diversified types of files and documents that are now available. This paper offers an overview on how to conduct a productive search for specific information on the Web, and also provides the mechanics of TITAN (Total Information Traverse AgeNt), a powerful Internet search system based on the robot mechanism that gives users access to the diverse resources of the Web.

TITAN features cross-lingual information retrieval capability that enables users to quickly and efficiently access information in Japanese, English and some other languages using Japanese or English search requests. It also has an interface that supports browsing based on VRML (Virtual Reality Modelling Language). This browsing mechanism helps clarify the retrieval results and allows users to access the value of information.

2 TITAN Overview

The WWW is growing explosively, and today there are about 60,000,000 Web pages that Excite can gather from the Web. To efficiently examine this vast informational cyberspace, sophisticated new tools are being developed that automatically traverse the WWW, create a condensed summary of the contents that are found, and thus make the information readily available. Although this compressed information only approximates the full-text documents it represents, these tools nevertheless make it possible to manage and search through the vast amounts of information that is available on the Internet. They are known variously as Internet robots, agents, spiders, and webcrawlers, but generally 'robot' is becoming the term (Koster 1994).

In simple terms, a robot is furnished with a list of URLs (Uniform Resource Locators) and it gathers information starting from that list. The robot extracts hyperlinks from Web sites which it then uses to recursively travel to those URLs. Information gathering is complete when all the links have been followed up. Currently, TITAN extracts over 300,000 URLs from the Web. Though this method may not be practical for collecting all the information on the Internet, it does allow the collection of a good proportion of the information that is available. The robot then analyses HTML (Hyper Text Markup Language) documents on the pages that it visits. The analysis consists of extracting those parts which are identified by HTML tags as being especially important. The important parts in this context designates titles, headlines, paragraph-breaks, and hyperlink annotations (generally known as 'anchors').

After being subjected to the above processing, TITAN takes the information and sorts it based on the type of HTML document. There are various kinds of documents on the Web. For example, there are indices such as Yahoo that contain an immense number of links to other Web sites, and there are pages that contain a great deal of information in and of themselves.

Taking this into account, TITAN employs two sorting criteria: information content and links. The former measures the degree of unique informational content contained in a document and the latter determines to what extent a document is linked to other Web sites.


Figure 1: TITAN Architecture

Figure 1 shows the TITAN architecture for information gathering, text retrieval and page browsing. Before it gathers Web pages from the Internet, it indexes the information and makes a catalogue database; then it builds up the index database. Texts are retrieved from the index database according to user requests. For the text retrieval engine, TITAN uses the WAIS-sf and for CGI (Common Gateway Interface) script which makes exchanges between the retrieval engine and users it uses the SF-gate. If the users make retrieval requests in Japanese or Web pages are described by multi-byte code, it is necessary to use Juman which is a Japanese tagger.

Figure 2 is a first screen and an example of a search result that TITAN outputs.


Figure 2: First Screen and Output

Users can refer to the retrieval results by calling the VRML browser (WebSpace, etc.) from the WWW browser (Netscape, Mosaic, etc.) or SF-gate directly. This VRML browser communicates with link database, which is made from index database, and obtains the topologies of hyperlinks so the system describes these topologies on the VRML browser.

3 Cross-lingual Search Support Capability

The Internet supports a multitude of different languages and it is only natural that users want to access information resources in their native tongue. In light of the fact that most information on the Internet today is in English, we developed a cross-lingual information search support capability that starts by converting Japanese inquiries into English.

Firstly, phrases (individual words, phrases, and whole sentences are permitted) undergo tagger, and after grammatical articles and other extraneous elements are removed, the phrase is converted to English tokens by a dictionary lookup procedure based on the longest string-matching method.

Multiple variants commonly exist, but these are eliminated using the corpus derived from the information collected from the Web. Finally, the words in original phrases and the words in the translation are separately combined using the 'OR' operator and used in the retrieval.

To support the retrieval of Web resources that are written in other languages and character codes, TITAN also has the ability to automatically identify those character codes and languages. Thus, users can specify fancy describing languages in the first screen when they retrieve and in the output. The basic idea is to use a statistic language model to select the correctly decoded string as well as to determine the language. The strategy of language identification consists of the following steps: (see Figure 3).


Figure 3: Identification of the Coding System

  1. Raw code
    The strings which aren't identify the own code are given.
  2. Decoding
    The code string is decoded into the strings for every possible coding system.
  3. Calculating the probability of each languages
    The most likely language and its likelihood score for each decoded string are calculated using statistic-based language models.
  4. Selecting
    The algorithm chooses the decoded string with the highest likelihood score. In Figure 3, string(1) is the most likely decoding and its language is selected as Japanese.

The statistic-based language models are effective to identify. These models are based on text, which is regarded as a list of tokens where a token is a word in European languages, or a character in East-Asian languages. The likelihood of a list of tokens with regard to a language is the product of unigram probabilities for the class of every token in the list.

To train the statistic model of some languages, we gathered 100,000 Web pages and classified into target languages by humans. As a result, we gained 700 valid Web pages in the following languages.

After the statistic model of each language was obtained, we estimated the accuracy for language identification module using a 640 Web page. The average error rate is 4.8% and error occurs when the document is not a normal text (for example, computer program, DNS date, etc.). This value shows our method achieved the level of correctness equivalent to the previous methods that presuppose correctly decoded character strings (Kikui et al. 1996).

4 The Graphical User Interface in TITAN

4.1 Browsing Support

TITAN has an interface that supports browsing to determine at a glance what specific information is represented by each line of result output; that is, to the use of affixing icons to the end of lines that instantly identify the nature of a document including the following information (Figure 4).


Figure 4: The Interface for Browsing

(1)Document's Title (2)Translated Title (3)Score
(4)Country of Origin (5)Information Format (6)Type of Information
(7)Describing Language

Numbers 1 and 2 are WWW titles. If the title is in English, TITAN translates it into Japanese. Number 3 is the score which is assigned by the retrieval engine to show the degree of matching between the request and the actual in formation in WWW pages. Number 4, country of origin, designates the country where the information was created and is indicated by a symbol of a national flag. Number 5, the information format, specifies whether the information is an HTML (Hyper Text Markup Language) document, a graphic file, or some other kind of file and is designated by a unique icon that is associated with each type of information. Number 6, information type, designates to what extent a document contains information in and of itself (information content) and to what extent a document provides links to other WWW sites (link). This information is conveyed with an oval-shaped icon. The information content aspect is indicated by the upper blue part of the icon, and the link aspect is shown by the lower red portion of the icon. Number 7, describing languages, specifies what languages the WWW page is written. The presentation is 'name of language + degree'. Languages are specified in ISO CD 639/2 Draft and includes jpn (Japanese), eng (English), bg3 (Chinese), and so on. TITAN can distinguish Chinese, Japanese, Korean, Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Spanish, and Swedish.

4.2 Information References

WWW space is a very wide hyperlink space where many URLs are concatenated with each other as a link and users can run through the Internet following those links. Although the problems of the hyperlink space have already been pointed out (Horn), the following two problems are very serious and deserve special consideration :

For instance, the Internet robot outputs results suited to the user request. Although the results are output according to the score set by the retrieval engine, some pages are similar in content and this fact should be made known to users. If they are informed in advance, they can access these pages continuously. If users can make use of VRML browser, TITAN displays retrieval results on the 3-dimensional space based on the following information.

  1. Score (X axis)
    This score is based on the retrieval engine, but it is normalised to be easy to make comparison of each results.
  2. Information content (Y axis)
  3. Link (Z axis)
  4. Country of origin/describing language (Texture)
    VRML has the feature which maps a texture on the objects (texture mapping). Using this feature, it displays the country of origin and describing language. Moreover, TITAN points out each WWW pages which come from the same server, the partial topologies of hyperlinks on the Internet, and class information of pages. Of course, WWW space is a very wide hyperlink space where many URLs are concatenated with each other as a link and users can run through the Internet following those links. The problems of the WWW space have already been pointed out - that is, to lose one's way and to be bewildered caused by the many available branches. The browsing WWW space with VRML helps us form the relation of search results (Figure 5).


    Figure 5: Output Example with VRML

    TITAN has the link database which has already been built from the index database. This database consists of about 2,000,000 pairs of URL. Using this information, the topology of hyperlinks and the class information of WWW pages are visualised. Once the user specifies a URL among the search results, TITAN displays the link topology with this URL as the starting point; that is, it extracts URLs which the URL points to and is pointed from. Finally, a portion of the topology is restored as a solid tree in the VRML space. Furthermore, TITAN decides the class information from this link information. To do this, we focus on the construction of URL, which are built in with following form.

    scheme://host.domain [:port] / path [#anchor] [?key]

    Scheme shows what protocol the URL uses, host and domain show the server that has the information, port shows the number of the server resource, path shows where the data exist, and anchor and key are used to request for CGI script, etc. It becomes clear by referring to the host and domain whether the information is on the same server, and whether to refer to the path information to clarify the class information between URLs. This browsing mechanism with VRML helps clarify the Internet hyperlink topology and allows users to access the value of information.

    5 Conclusion

    There are many Internet robots that help users explore the information available. The problem is these robots don't facilitate the extraction of the desired information. Approaches to this problem include increasing the size of the database and speeding up retrieval (for example, Excite, Alta Vista), making retrieval more precise (for example, InfoSeek), making it easier for users to browse (for example, Yuwono's system, Ayers's system).

    TITAN has proven to be a very useful tool due to its cross-lingual access capability and clear navigation interface. This cross-lingual access feature makes it much easier for non-native English speakers to access English web pages.

    Bibliography

    1
    Hayashi, Y., Kikui, G-I. & Susaki, S. (1995) TITAN, NTT Information and Communication Systems Labs.
    URL: http://isserv.tas.ntt.jp/chisho/titan-e.html

    2
    Excite (1995) Excite Inc.
    URL: http://www.excite.com/

    3
    Koster, M. (1994) The Web Robots Database
    URL: http://info.webcrawler.com/mak/projects/robots/active.html

    4
    Yahoo, Yahoo Inc.
    URL: http://www.yahoo.com/

    5
    WAIS-sf
    URL: http://ls6-www.informatik.uni-dortmund.de/freeWAIS-sf/

    6
    SF-gate
    URL:
    http://ls6-www.informatik.uni-dortmund.de/SFgate/

    7
    Juman, Version 2.0
    URL: ftp://ftp.aist-nara.ac.jp/pub/nlp/tools/juman/

    8
    Kikui et al. (1996) Cross-lingual Information Retrieval on the WWW in Proceedings of the the MULSAIC96 (Multilinguality in the Software Industry: The AI Contribution), Budapest, Hungary.

    9
    ISO639
    URL: http://www.stonehand.com/unicode/standard/cd639-2.html

    10
    Horn, R. E. (1989) Mapping hypertext. The analysis, organization, and display of knowledge for the next generation of on-line text and graphics, The Lexington Institute, Lexington.

    11
    AltaVista, Degital Equipment Corp.
    URL:
    http://www.altavista.digital.com/

    12
    Infoseek (1995) Infoseek Inc.
    URL: http://www.infoseek.com/

    13
    Yuwono, B., Lam, S. L. Y., Ying, J. H. & Lee, D. L. (1994) A World Wide Web Resource Discovery System in The 4th International World Wide Web Conference, Boston, MA, USA.
    URL: http://www.w3.org/pub/Conferences/WWW4/Papers/66/

    14
    Ayers, E. Z. & Stasko, J. T. (1994) Using Graphic History in Browsing the World Wide Web in The 4th International World Wide Web Conference, Boston, MA, USA.
    URL: http://www.w3.org/pub/Conferences/WWW4/Papers2/270/


    Organised by: AUUG'96 & CSU Return to Conference Proceedings