Evaluating the Performance of Distributed Architectures for Information Retrieval Using a Variety of Workloads

Cahoon Brendon; Mckinley Kathryn S; Mckinley K S

DSpace Home
→
Ingenierías y Ciencias de la Computación
→
*Ingenierías y Ciencias de la Computación (Proyecto VLIR)
→
Documentos
→
View Item

dc.contributor.author	Cahoon Brendon
dc.contributor.author	Mckinley Kathryn S
dc.contributor.author	Mckinley K S
dc.date.accessioned	2018-01-22T17:23:10Z
dc.date.available	2018-01-22T17:23:10Z
dc.identifier.uri	http://hdl.handle.net/123456789/6830
dc.description.abstract	The information explosion across the Internet and elsewhere offers access to an increasing number of document collections. In order for users to effectively access these collections, information retrieval (IR) systems must provide coordinated, concurrent, and distributed access. In this article, we explore how to achieve scalable performance in a distributed system for collection sizes ranging from 1GB to 128GB. We implement a fully functional distributed IR system based on a multithreaded version of the Inquery unified IR system. To explore the design space more fully, we also implement and validate a flexible simulation model. We measure performance as a function of system parameters such as client command rate, number of document collections, terms per query, query term frequency, number of answers returned, and command mixture. Our results show that it is important to model both query and document commands because the heterogeneity of commands significantly impacts performance. Based on our results, we recommend simple changes to the prototype and evaluate the changes using the simulator. Because of the significant resource demands of information retrieval, it is not difficult to generate workloads that overwhelm system resources regardless of the architecture. However under some realistic workloads, we demonstrate system organizations for which response time gracefully degrades as the workload increases and performance scales with the number of processors. This scalable architecture includes a surprisingly small number of brokers through which a large number of clients and servers communicate. 1. INTRODUCTION The increasing numbers of large, unstructured text collections require full-text information retrieval (IR) systems in order for users to access them effectively. Current systems typically only allow users to connect to a single database either locally or perhaps on another machine. A distributed IR system should be able to provide multiple users with concurrent, efficient access to multiple text collections located on disparate sites. Since the documents in unstructured text collections are independent, IR systems are ideal applications to distribute across a network of workstations. However, the high resource demands of IR systems limit their performance, especially as the number of users, as well as the size and number of text collections, increases. Distributed computing offers a solution to these problems. Only recently have people published work on distributed architectures for information retrieval. The Very Large Collection track in the TREC conferences promotes the development of distributed and shared memory architectures for IR [Hawking and Thistlewaite 1997; Hawking et al. 1998]. Several researchers created distributed IR systems and demonstrated the feasibility of distributed architectures for information retrieval [Harman et al. 1991; Macleod et al. 1987]. However, it is not clear from these initial implementations how the systems will perform in practice, since, unlike the case for database systems, very little experimental investigation of distributed IR systems has been performed, and researchers have typically limited their collection sizes to at most 1GB. The focus of this article is to design large-scale distributed IR architectures by analyzing the performance of potential systems under a variety of workloads. In our work, we simulate configurations with up to 128 servers, each containing a 1GB text collection (i.e., 128GB of total text). We begin with a prototype implementation of a distributed information retrieval system using Inquery; an inference network, full-text information retrieval model [Callan et al. 1992]. Our system adopts a variation of the client-server paradigm that consists of clients connected to information retrieval engines through a central administration broker, as we illustrate in Figure 1. In our experiments, we model a simple hardware configuration using a single CPU and disk. In our prototype, we use a multithreaded version of the Inquery server to obtain high machine utilization. We illustrate that using a single-threaded Inquery server does not effectively utilize CPU and disk resources; we achieve the best performance when we use four threads.
dc.format	application/pdf
dc.title	Evaluating the Performance of Distributed Architectures for Information Retrieval Using a Variety of Workloads
dc.type	journal-article
dc.source.volume	18
dc.source.issue	1
dc.source.journal	ACM Transactions on Information Systems