|
A Framework for the Study of Grid Inter-Operation Mechanisms
The study of the history of computing infrastructures reveals an integration trend. For example,
the explosive growth of the Internet in the 1990s was the result of an integration process started in
the 1960s with the emerging networks of computers. By using the Internet, millions of users were
capable of accessing information anytime and anywhere, much like other daily utilities such as water,
electricity, and telephone. However, an important category of users remained under-served: the users
with large computational and storage requirements, e.g., the scientists, the companies that focus on
data analysis, and the governmental departments that manage the interaction between the state and
the population (such as census, tax, and public health). Thus, in the mid-1990s, the vision of the Grid
as a universal computing utility was formulated. The main benefits promised by the Grid are similar
to those of other integration efforts: extended and optimized service of the integrated network, and
significant reductions of maintenance and operation costs through sharing and better scheduling.
While the universal Grid has yet to be developed, large-scale distributed computing infrastructures
that provide their users with seamless and secure access to computing resources, individually called
Grid parts or simply grids, have been built throughout the world -- in different countries, for different
sciences, and both for production work and for computer-science research. At the same time, the main
technological alternatives to grids, that is, supercomputers and large clusters, have evolved into much
larger, scalable, and reliable systems. Thus, the integration of existing grids into larger infrastructures
and finally into The Grid is key in keeping the grid vision attractive for its potential users.
The integration of grids raises a double challenge, one related with the efficient scaling of a distributed
computing system, the second associated with the operation of a system across different
ownership and administrative domains. Thus, many of the traditional approaches for inter-operating
computer systems, such as those based on completely centralized or purely decentralized system ar-
chitectures, are eliminated from the start. To mark the distinction between the typical problem of
integrating smaller components into a larger system and the double challenge of grid integration, we
call the latter the problem of grid inter-operation. In this thesis we approach the problem of grid
inter-operation with two main objectives: to design a comprehensive framework for the study of grid
inter-operation mechanisms, and to provide an initial but good solution for this problem.
Our framework provides both the theoretical support and the tools for finding new and improved
solutions for this problem. The tools are assembled into a research toolbox for the study of grid inter-operation
mechanisms. This research toolbox addresses two problems that have hampered the grid
community in the past decade: the lack of knowledge about the workloads and resources of real grids,
and the lack of tools for grid simulation and performance evaluation in real environments. Research
using unrealistic characteristics or characteristics that are specific to other types of environments is
being limited in scope and applicability, and may even miss the problems that are specific to grids.
Thus, real data and realistic models of grid workloads and resources are critical for designing efficient
and scalable architectures. Using for simulation and for performance evaluation in real environments tools
that have not been adapted to the requirements of grids leads to slower progress and to results
that are di±cult to compare. Thus, tools adapted to grids and aimed at producing results that can
be shared with other researchers are needed.
The contents of this thesis is split into four logical parts: the introduction, a toolbox for grid
inter-operation research, a method for grid inter-operation, and the conclusion.
We begin the thesis with an introduction to the problem of grid inter-operation that focuses on the
challenges of grid inter-operation addressed by this thesis. In Chapter 1 we also present an overview of
the framework for the study of grid inter-operation mechanisms introduced in this thesis. In Chapter 2
we introduce a basic model for grid inter-operation. This model, required to understand the remainder
of the thesis, defines the components of a grid system, the types of applications that can be found in
a grid, the system users, and the grid job execution model.
The toolbox for grid inter-operation research is described in Chapters 3, 4, 5, and 6, which we
describe in turn. In Chapter 3 we present the Grid Workloads Archive (GWA). We design the GWA
with a focus on building a grid workload data repository, and on establishing a community center
around the archived data. One of the important design achievements is the formulation of a grid
workload format for storing job-level information that can be extended for higher-level information such
as co-allocated jobs or resource reservations. We develop a comprehensive set of tools for collecting,
processing, and using grid workloads. To make the GWA accessible by non-expert users, we devise
a mechanism for automated trace ranking and selection. So far, the GWA contains traces from nine
well-known grid environments, with a total content of more than 2,000 users submitting more than 7
million jobs over a period of over 13 operational years, and with working environments spanning over
130 sites comprising 10,000 resources.
In Chapter 4 we describe the extension of the basic model for grid environments into a comprehensive
model for (multi-)grids. By analyzing real data such as long-term system traces of real grids,
we find that grid resources exhibit a highly dynamic availability both over the course of single days
and over whole years. We also find that grid workloads are very different from the workloads of other
related systems such as parallel production environments and distributed web servers. Based on the
results of this analysis, we design and validate a comprehensive model for grid resource dynamics and
evolution, and for grid workloads that include parallel jobs and/or bags-of-tasks.
In Chapter 5 we introduce the GrenchMark testing framework. The main focus of this framework
is on testing large-scale distributed computing systems with synthetically generated yet realistic
workloads. We test and validate our reference implementation of the GrenchMark framework, and
show that GrenchMark has been successful in testing real multi-cluster grids and pools of resources.
The experimental results show that a grid testing tool focusing on realistic workloads can indeed be
used to assess important characteristics of real systems that are otherwise not available, such as
scalability limits, overheads, and reliability.
To conclude the presentation of our grid research toolbox, in Chapter 6 we introduce the DGSim
grid simulation framework. The main focus of this framework is on facilitating repeated simulations
of multi-cluster and multi-grid environments under realistic workload. We test and validate our
reference implementation of the DGSim framework, and show that DGSim has been successful as the
simulation tool for several design space exploration studies of grid settings that are larger than the
previous state-of-the-art.
The method for grid inter-operation and a solution for the grid inter-operation problem are described
in Chapters 7 and 8, which we describe in turn. In Chapter 7 we study the existing alternatives
for grid inter-operation, and introduce a novel architecture for grid inter-operation. We classify real
grid systems according to their architectural and operational components. The practical limitations
of the centralized grid inter-operation approaches are evaluated in a real environment. These two
preliminary steps allow us to assess the grid inter-operation ability of existing grid resource management
systems; we find that this ability is limited. Thus, we introduce a novel architecture for
grid inter-operation with a better potential of ful¯lling the requirements of grid inter-operation. The
architecture is a hybrid between hierarchical and purely decentralized architectures. The set of architectures
investigated here provides a comprehensive architectural space for the problem of grid
inter-operation.
In Chapter 8 we introduce a novel approach for grid inter-operation, Delegated MatchMaking.
Our approach, which couples the hybrid architecture introduced in the previous chapter with a novel
inter-operation mechanism, is compared with five alternatives through trace-based simulations, and is
found to deliver the best performance especially when the system is heavily loaded. While many other
mechanisms can be designed in the future, our experiments prove that the Delegated MatchMaking
approach already is a good solution for the problem of grid inter-operation. Our experiments also
demonstrate that the inter-operation of existing grids can lead to significant performance gains in
comparison with leaving them operate independently.
At the end of this thesis, Chapter 9 summarizes our main achievements and presents future direc-
tions for this work. The direct use of the framework for the study of grid inter-operation mechanisms
holds good promise for future research. In particular, "How many clusters are best?" and other related
questions about the system structure can find answers under this framework, leading to important contributions
to automating system provisioning and administration. With extensions, our framework can
be used to investigate important classes of resource management problems, such as mechanisms and
incentives for more system decentralization, scheduling for specific classes of applications or scheduling
under less strict information availability assumptions, and guarantees for Quality-of-Service for
commercial workloads. We have already taken initial steps in several of these directions.
|