Grid'5000 experiment

Jump to: navigation, search

Distributed Generation of Large Scale Permutations (Application)

Conducted by

Jens Gustedt


For testing and benchmarking the generation of large random input data with known probability distributions is crucial. In [GUSTEDT:2008:INRIA-00000900:2], we show how to uniformly distribute data at random in two related settings: coarse grained parallelism and external memory. In contrast to previously known work for parallel setups, our method is able to fulfill the three criteria of uniformity, work-optimality and balance among the processors simultaneously. To guarantee the uniformity we investigate the matrix of communication requests between the processors. We show that its distribution is a generalization of the multivariate hypergeometric distribution and we give algorithms to sample it efficiently in the two settings.

If instead of shuffling existing data we constrain ourselves to produce permutations of integers 0, ..., n for some large number n, solutions that are much more efficient become possible. First, we are able to show that the communication mentioned above can be improved by using adapted compression techniques. In particular the information theoretic lower bound of Theta(n log p) (instead of Theta(n log n)) for the overall communication can be matched up to a constant factor. Second, if instead of communicating data (integers in that case) we generate them in place, we achieve a scheme that allows for a trade-off between the number of random bits (ie the quality of the target distribution) and the total running time of the algorithm. This approach is presented in [GUSTEDT:2008:INRIA-00312131:1].




  • Nodes involved: 100
  • Sites involved: 1
  • Minimum walltime: 1h
  • Batch mode: yes
  • Use kadeploy: no
  • CPU bound: yes
  • Memory bound: yes
  • Storage bound: no
  • Network bound: yes
  • Interlink bound: no

Tools used



Not yet

Shared by: Jens Gustedt
Last update: 2008-11-19 10:52:37
Experiment #522