%
% THE \simulan LOGO IS DEFINED HERE.
%
\def\simulan{{\rm s\kern-.06em\raise-.5ex\hbox{i}\kern-.1em\raise-.1ex
\hbox{m}\raise-.3ex\hbox{u}\kern-.10emL\kern-.1667em\lower-.6ex
\hbox{a}\kern-.10emn}}
%% the \PBeam Logo is defined here
\def\PBeam{{\sc\kern.15emP\kern-.9em\raise.125ex\hbox{$\leftarrow$}\sc\kern-.25emB\sc\kern-.1eme\kern-.1ema\kern-.1emm}}
You might also want look at the companion thesis about modelling
with simuLan:
[Schmida91]
Ralf Schmidt-Dannert:
\simulan: Modellierung und Simulation lokaler Netzwerke.
Diplomarbeit, TU Braunschweig, 1991 (in german).
However, they exhibit problems with unbalanced load, conflicts with interactive users and increased failure probability. Process migration and checkpoint/restart are adequate means to solve these problems. Several of such mechanisms are described in the literature, but they all have certain weaknesses and impose restrictions on the applications they can handle.
In this thesis, a new concept for an application transparent migration
and checkpointing mechanism is developed, wich overcomes some
substantial of those restrictions. It supports migration and fault
transparency for parallel and distributed applications, i.e. groups of
communicating processes, on clusters of workstations. Neither the
system kernel nor the application programs need to be modified, and
applications are not required to be written for a specific runtime
environment. However, for better performance the applications can be
linked with a modified system library. From this concept, the
architecture of the example implementation
is derived, and first measured results and experiences
with the implementation are reported.
A BiBTeX-File, from my Dissertation.
und CoCheck, vorgestellt und
verglichen. Abschließend wird ein Ausblick auf neue
Funktionalitäten gegeben, die durch Integration mit interaktiven
Werkzeugen erreicht werden können.
-- Fehlertoleranz für
verteilte Anwendungen mittels Migration und Checkpointing.
system and its facility for creating global consistent
checkpoints of groups of communication process on networked Unix
systems, and present first experiences with our implementation.
system and focus on its facility for creating global
consistent checkpoints of groups of communicating processes on
networked Unix systems. We present first experiences with our
implementation.
. There we use a global virtual name space to achieve
location transparency for process migration and checkpointing~/
rollback for distributed applications on clusters of Unix
workstations. First measurements have shown that the
maintenance of this global name space is critical for the
performance of the entire system.
The
system uses a global
virtual name space to provide migration and rollback
transparency in user space for distributed groups of processes
on workstations. Applications always use the same virtual
names for the operating system objects, independent of their
current real location. The system calls are interposed and
their parameters translated between the name spaces. Unlike
other migration mechanisms,
does not require the applications to be written
for a specific programming model or communication library.
The first approach to execute applications in the virtual name space was to link the programs with a modified system library. Now, in this paper we describe design and implementation of a separate system call interposition process that accesses the application via the debugging interface. The main advantage of this approach is that it can handle even unmodified (e.g. commercially bought) application programs. We compare measured performance figures with previous similar approaches and the modified system library.
to also provide replication for
distributed applications. The approach to achieve these services
transparently for the application is the construction of a system-wide
virtual name space, which abstracts the running application processes
from the actual location and time of their execution. No changes to
the system kernel or the application code are required. Checkpointing
already is the common base for rollback and for migration, and now we
extend the migration to cloning to also achieve replication. With
application transparent checkpoint / rollback,
can only detect and treat crash faults. With replication
it now is also able to detect other faults by comparing the output of
the replica, even though it cannot know the meaning of the output
contents. The abstraction through the virtual name space is an
important prequisite to enable error detection through output
comparison on cluster systems. It ensures that no environment specific
information can yield differences in the replica outputs during
fault-free operation.
Some amount of data is kept distributed or replicated on some or all nodes of a distributed system. At every moment, each instance that accesses this data must see the same information. Updates must be delivered ordered, reliably, and efficiently.
Our prototype software implements ordered, reliable multicasts on top of the unreliable IP broad- or multicast with three different methods (Master-Slave, Token Exchange on Demand, Totem Single Ring). This paper shows measurement results for the efficiency and scalability of the three methods in different topologies.
The measurements confirm earlier analytical results. Totem behaves well in large networks with many concurrent senders. The overhead of Token on Demand and of the Master-Slave algorithm is almost the same. Also we could not find an indication for the often-read opinion that the Master-Slave approach scales worse because of the central bottleneck.