AlgoPARC Publications

Conference Proceedings

Abstract: Graphics Processing Units (GPUs) have emerged as popular architectures for high-performance computing due to their parallelism and low latency context switching capabilities. However, optimizing GPU implementations can be challenging due to the complexity of the architecture, such as the diverse characteristics of memory units. While most optimization efforts focus on parallelism and global memory access, for some algorithms memory conflicts in shared memory, known as bank conflicts, can significantly impact performance. This affects the accuracy of theoretical runtime analysis of GPU algorithms. In this paper, we present a number-theoretic solution for eliminating all bank conflicts for Thrust library's mergesort implementation -- the fastest comparison-based sorting implementation on GPUs. Our experiments demonstrate that the modified mergesort takes virtually the same time to run on the worst-case inputs as it does on random inputs (the worst-case inputs have been shown in the past to cause up to 50% slowdown).
PDF DOI
@inproceedings{p25-spaa, author = {Berney, Kyle and Sitchinava, Nodari}, title = {Eliminating bank conflicts in {GPU} mergesort}, booktitle = {Proceedings of the 37th ACM Symposium on Parallelism in Algorithms and Architectures}, series = {SPAA '25}, year = {2025}, month = jul, doi = {10.1145/3694906.3743337}, software = {https://github.com/algoparc/GPU-CFMerge} }
Related Software
Abstract:

We study the predecessor search problem in the classical PRAM model of computation. In this problem, the input is a set of $n$ $\ell$ -bit integers and the goal is to store the input in a data structure of size $S(n)$ such that given a query value $q$ , the predecessor of $q$ can be found efficiently. This is a very classical problem with an extensive history.

We prove a lower bound for this problem in the strongest CRCW PRAM model. A simplified version of the lower bound states that in a $K$ -processor PRAM model with $O(\log n)$ -bit registers, the query requires $\Omega(\log_K\log n)$ worst-case time under the realistic setting where the space is near-linear.
PDF DOI
@inproceedings{p25-soda, author = {Afshani, Peyman and Sitchinava, Nodari}, title = {A cell-probe lower bound for the predecessor search problem in { PRAM }}, booktitle = {Proceedings of the 36th ACM-SIAM Symposium on Discrete Algorithms}, series = {SODA '25}, pages = {3998-4008}, year = {2025}, month = jan, doi = {10.1137/1.9781611978322.136} }
Abstract: We present a thorough investigation of the All Nearest Smaller Values (ANSV) problem from a practical perspective. The ANSV problem is defined as follows: given an array $A$ consisting of $n$ values, for each entry $A_i$ compute the largest index $l < i$ and the smallest index $r > i$ such that $A_i > A_l$ and $A_i > A_r$ , i.e., the indices of the nearest smaller values to the left and to the right of $A_i$ . The ANSV problem was solved by Berkman, Schieber, and Vishkin [J. Algorithms, 1993] in the PRAM model. Their solution in the CREW PRAM model, which we will refer to as the BSV algorithm, achieves optimal $\mathcal{O}\!\left(n\right)$ work and $\mathcal{O}\!\left(\log n\right)$ span. Until now, the BSV algorithm has been perceived as too complicated for practical use, and we are not aware of any publicly available implementations. Instead, the best existing practical solution to the ANSV problem is the implementation by Shun and Zhao presented at DCC’13. They implemented a simpler $\mathcal{O}\!\left(n\log n\right)$ -work algorithm with an additional heuristic first proposed by Blelloch and Shun at ALENEX’11. We refer to this implementation as the BSZ algorithm. In this paper, we implement the original BSV algorithm and demonstrate its practical efficiency. Despite its perceived complexity, our results show that its performance is comparable to the BSZ algorithm. We also present the first theoretical analysis of the heuristic implemented in the BSZ algorithm and show that it provides a tunable trade-off between optimal work and optimal span. In particular, we show that it achieves $\mathcal{O}\!\left(n\left(1 + \frac{\log{n}}{k}\right)\right)$ work and $\mathcal{O}\!\left(k(1+\log{\frac{n}{k}})\right)$ span, for any integer parameter $1 \le k \le n$ . Thus, for $k = \Theta\!\left(\log n\right)$ , the BSZ algorithm can be made to be work-optimal, albeit at the expense of increased span compared to BSV. Our discussion includes a detailed examination of different input types, particularly highlighting that for random inputs, the low expected distance between values and their nearest smaller values renders simple algorithms efficient. Finally, we analyze the input/output (I/O) complexities of the BSV algorithm.
PDF DOI
@inproceedings{p24-spaa, author = {Sitchinava, Nodari and Svenning, Rolf}, title = {The all nearest smaller values problem revisited in practice, parallel and external memory}, booktitle = {Proceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures}, series = {SPAA '24}, pages = {259-268}, year = {2024}, month = jun, doi = {10.1145/3626183.3659979}, software = {https://github.com/algoparc/ANSV} }
Related Software
Abstract: The program performance on modern hardware is characterized by locality of reference, that is, it is faster to access data that is close in address space to data that has been accessed recently than data in a random location. This is due to many architectural features including caches, prefetching, virtual address translation and the physical properties of a hard disk drive; attempting to model all the components that constitute the performance of a modern machine is impossible, especially for general algorithm design purposes. What if one could prove an algorithm is asymptotically optimal on all systems that reward locality of reference, no matter how it manifests itself within reasonable limits? We show that this is possible, and that excluding some pathological cases, cache-oblivious algorithms that are asymptotically optimal in the ideal-cache model are asymptotically optimal in any reasonable setting that rewards locality of reference. This is surprising as the cache-oblivious framework envisions a particular architectural model involving blocked memory transfer into a multi-level hierarchy of caches of varying sizes, and was not designed to directly model locality-of-reference correlated performance.
PDF ARXIV DOI
@inproceedings{p22-apocs, author = {Afshani, Peyman and Iacono, John and Jayapaul, Varunkumar and Karsin, Ben and Sitchinava, Nodari}, title = {Locality-of-reference optimality of cache-oblivious algorithms}, booktitle = {Proceedings of the Third SIAM Symposium on Algorithmic Principles of Computer Systems}, series = {APOCS '22}, month = jan, year = {2022}, pages = {31-45}, doi = {10.1137/1.9781611977059.3}, arxiv = {https://arxiv.org/abs/1902.07928} }
Abstract: We prove an $\Omega(\log n \log \log n)$ lower bound for the span of implementing the $n$ input, $\log n$ -depth FFT circuit (also known as butterfly network) in the nonatomic binary fork-join model. In this model, memory-access synchronizations occur only through fork operations, which spawn two child threads, and join operations, which resume a parent thread when its child threads terminate. Our bound is asymptotically tight for the nonatomic binary fork-join model, which has been of interest of late, due to its conceptual elegance and ability to capture asynchrony. Our bound implies super-logarithmic lower bound in the nonatomic binary fork-join model for implementing the butterfly merging networks used, e.g., in Batcher’s bitonic and odd-even mergesort networks. This lower bound also implies an asymptotic separation result for the atomic and nonatomic versions of the fork-join model, since, as we point out, FFT circuits can be implemented in the atomic binary fork-join model with span equal to their circuit depth.
PDF DOI
@inproceedings{p21-soda, author = {Goodrich, Michael T. and Jacob, Riko and Sitchinava, Nodari}, title = {Atomic power in forks: a super-logarithmic lower bound for implementing butterfly networks in the Nonatomic Binary Fork-Join model}, booktitle = {Proceedings of the 32nd ACM-SIAM Symposium on Discrete Algorithms}, series = {SODA '21}, month = jan, year = {2021}, pages = {2141-2153}, doi = {10.1137/1.9781611976465.128} }
Abstract: When lexicographically sorting strings, it is not always necessary to inspect all symbols. For example, the lexicographical rank of europar amongst the strings eureka, eurasia, and excells only depends on its so called relevant prefix euro. The distinguishing prefix size $D$ of a set of strings is the number of symbols that actually need to be inspected to establish the lexicographical ordering of all strings. Efficient string sorters should be $D$ -aware, i.e. their complexity should depend on $D$ rather than on the total number $N$ of all symbols in all strings. While there are many $D$ -aware sorters in the sequential setting, there appear to be no such results in the PRAM model. We propose a framework yielding a $D$ -aware modification of any existing PRAM string sorter. The derived algorithms are work-optimal with respect to their original counterpart: If the original algorithm requires $O(w(N))$ work, the derived one requires $O(w(D))$ work. The execution time increases only by a small factor that is logarithmic in the length of the longest relevant prefix. Our framework universally works for deterministic and randomized algorithms in all variations of the PRAM model, such that future improvements in ( $D$ -unaware) parallel string sorting will directly result in improvements in $D$ -aware parallel string sorting.
PDF ARXIV DOI
@inproceedings{p20-europar, author = {Ellert, Jonas and Fischer, Johannes and Sitchinava, Nodari}, title = {{LCP}-aware parallel string sorting}, booktitle = {Proceedings of the 26th International European Conference on Parallel and Distributed Computing}, series = {Euro-Par '20}, pages = {329-342}, month = aug, year = {2020}, doi = {10.1007/978-3-030-57675-2_21}, arxiv = {https://arxiv.org/abs/2006.02219}, keywords = {String sorting, Lexicographical sorting, Parallel, PRAM, Distinguishing prefix, Longest common prefix, LCP, Karp-Rabin fingerprints} }
Abstract:

Currently, the fastest comparison-based sorting implementation on GPUs is implemented using a parallel pairwise merge sort algorithm (Thrust library). To achieve fast runtimes, the number of threads $t$ to sort the input of $N$ elements is fine-tuned experimentally for each generation of Nvidia GPUs in such a way that the number of elements $E = N/t$ that each thread accesses in each merging round results in a small (empirically measured) number of shared memory contentions, known as bank conflicts, while balancing the number of global memory accesses and latency-hiding through thread oversubscription/occupancy.

In this paper, we show that for every choice of $E < w$ , such that $E$ and $w$ are co-prime, there exists an input permutation on which every warp of $w$ threads of the Thrust merge sort is effectively reduced to using at most $\lceil w/E \rceil$ threads due to sequentialization of shared memory accesses due to bank conflicts. Note that this matches the trivial worst-case bound on the loss of parallelism due to memory contentions for any warp accessing $wE$ contiguous shared memory locations.

Our proof is constructive, i.e., we are able to automatically construct such permutation for every value of $E$ . We also show in practice that such constructed inputs result in up to ~50% slowdown, compared to the performance on random inputs, on modern GPU hardware.
PDF DOI
@inproceedings{p20-ipdps, author = {Berney, Kyle and Sitchinava, Nodari}, title = {Engineering worst-case inputs for pairwise merge sort on { GPUs }}, booktitle = {Proceedings of the 34th IEEE International Parallel and Distributed Processing Symposium}, series = {IPDPS '20}, year = {2020}, month = may, pages = {1133-1142}, doi = {10.1109/IPDPS47924.2020.00119} }
Abstract: We initiate a study of algorithms with a focus on the computational complexity of individual elements, and introduce the fragile complexity of comparison-based algorithms as the maximal number of comparisons any individual element takes part in. We give a number of upper and lower bounds on the fragile complexity for fundamental problems, including Minimum, Selection, Sorting and Heap Construction. The results include both deterministic and randomized upper and lower bounds, and demonstrate a separation between the two settings for a number of problems. The depth of a comparator network is a straight-forward upper bound on the worst case fragile complexity of the corresponding fragile algorithm. We prove that fragile complexity is a different and strictly easier property than the depth of comparator networks, in the sense that for some problems a fragile complexity equal to the best network depth can be achieved with less total work and that with randomization, even a lower fragile complexity is possible.
PDF ARXIV DOI
@inproceedings{p19-esa, author = {Afshani, Peyman and Fagerberg, Rolf and Hammer, David and Jacob, Riko and Kostitsyna, Irina and Meyer, Ulrich and Penschuck, Manuel and Sitchinava, Nodari}, title = {Fragile complexity of comparison-based algorithms}, booktitle = {Proceedings of the 27th Annual European Symposium on Algorithms}, series = {ESA '19}, pages = {2:1-2:19}, year = {2019}, month = sep, award = {ESA Track A Best Paper Award}, doi = {10.4230/LIPIcs.ESA.2019.2}, arxiv = {https://arxiv.org/abs/1901.02857} }
Abstract:

We study the relationship between memory accesses, bank conflicts, thread multiplicity (also known as over-subscription) and instruction-level parallelism in comparison-based sorting algorithms for Graphics Processing Units (GPUs). We experimentally validate a proposed formula that relates these parameters with asymptotic analysis of the number of memory accesses by an algorithm. Using this formula we analyze and compare several GPU sorting algorithms, identifying key performance bottlenecks in each one of them. Based on this analysis we propose a GPU-efficient multiway mergesort algorithm, GPU-MMS, which minimizes or eliminates these bottlenecks and balances various limiting factors for specific hardware.

We realize an implementation of GPU-MMS and compare it to sorting algorithm implementations in state-of-the-art GPU libraries on three GPU architectures. Despite these library implementations being highly optimized, we find that GPU-MMS outperforms them by an average of $21\%$ for random integer inputs and $14\%$ for random key-value pairs.
PDF DOI
@inproceedings{p18-ics-o, author = {Karsin, Ben and Weichert, Volker and Casanova, Henri and Iacono, John and Sitchinava, Nodari}, title = {Analysis-driven engineering of comparison-based sorting algorithms on { GPUs }}, booktitle = {Proceedings of the 32nd ACM International Conference on Supercomputing}, series = {ICS '18}, pages = {86-95}, year = {2018}, month = jun, doi = {10.1145/3205289.3205298}, software = {https://github.com/algoparc/GPU-MMS} }
Related Software
Abstract: We present parallel algorithms to efficiently permute a sorted array into the level-order binary search tree (BST), level-order B-tree (B-tree), and van Emde Boas (vEB) layouts in-place. We analytically determine the complexity of our algorithms and empirically measure their performance. Results indicate that on both CPU and GPU architectures B-tree layouts provide the best query performance. However, when considering the total time to permute the data and to perform a series of search queries, our vEB permutation provides the best performance on the CPU. We show that, given an input of $N=500\mathrm{M}$ $64$ -bit integers, the benefits of query performance (compared to binary search) outweigh the cost of in-place permutation using our algorithms when performing at least $5\mathrm{M}$ queries ( $1\%$ of $N$ ) and 27M queries ( $6\%$ of $N$ ), on our CPU and GPU platforms, respectively.
PDF DOI
@inproceedings{p18-ipdps, author = {Berney, Kyle and Casanova, Henri and Higuchi, Alyssa and Karsin, Ben and Sitchinava, Nodari}, title = {Beyond binary search: parallel in-place construction of implicit search tree layouts}, booktitle = {Proceedings of the 32nd International Parallel and Distributed Processing Symposium}, series = {IPDPS '18}, pages = {1070-1079}, year = {2018}, month = may, doi = {10.1109/IPDPS.2018.00116} }
Abstract: Visibility graph reconstruction, which asks us to construct a polygon that has a given visibility graph, is a fundamental problem with unknown complexity (although visibility graph recognition is known to be in PSPACE). We show that two classes of uniform step length polygons can be reconstructed efficiently by finding and removing rectangles formed between consecutive convex boundary vertices called tabs. In particular, we give an $O(n^2m)$ -time reconstruction algorithm for orthogonally convex polygons, where $n$ and $m$ are the number of vertices and edges in the visibility graph, respectively. We further show that reconstructing a monotone chain of staircases (a histogram) is fixed-parameter tractable, when parameterized on the number of tabs, and polynomially solvable in time $O(n^2m)$ under reasonable alignment restrictions.
PDF ARXIV DOI
@inproceedings{p17-gd, author = {Sitchinava, Nodari and Strash, Darren}, title = {Reconstructing generalized staircase polygons with uniform step length}, booktitle = {Proceedings of the 25th International Symposium on Graph Drawing}, series = {GD '17}, pages = {88-101}, year = {2017}, month = sep, doi = {10.1007/978-3-319-73915-1_8}, arxiv = {https://arxiv.org/abs/1708.09842} }
Abstract: Motivated by the asymmetric read and write costs of emerging non-volatile memory technologies, we study lower bounds for the problems of sorting, permuting and multiplying a sparse matrix by a dense vector in the asymmetric external memory model (AEM). Given an AEM with internal (symmetric) memory of size $M$ , transfers between symmetric and asymmetric memory in blocks of size $B$ and the ratio $\omega$ between write and read costs, we show $\Omega(\min\{N, \frac{\omega N}{B}\log_{\frac{\omega M}{B}} \frac{N}{B}\})$ lower bound for the cost of permuting $N$ input elements. This lower bound also applies to the problem of sorting $N$ elements. This proves that the existing sorting algorithms in the AEM model are optimal to within a constant factor for reasonable ranges of parameters $N$ , $M$ , $B$ , and $\omega$ . We also show a lower bound of $\Omega\left(\min\left\{H,\frac{\omega H}{B} \log_{\frac{\omega M}{B}} \frac{N}{\max\{\delta ,M\}} \right\} \right)$ for the cost of multiplying an $N \times N$ matrix with at most $H=\delta N$ non-empty entries by a vector with $N$ elements.
PDF DOI
@inproceedings{p17-spaa, author = {Jacob, Riko and Sitchinava, Nodari}, title = {Lower bounds in the Asymmetric External Memory model}, booktitle = {Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures}, series = {SPAA '17}, pages = {247-254}, year = {2017}, month = jul, doi = {10.1145/3087556.3087583} }
Abstract:

Let $T$ be a terrain, and let $P$ be a set of points (locations) on its surface. An important problem in Geographic Information Science (GIS) is computing the visibility index of a point $p$ on $P$ , that is, the number of points in $P$ that are visible from $p$ . The total visibility-index problem asks for computing the visibility index of every point in $P$ . Most applications of this problem involve 2-dimensional terrains represented by a grid of $n \times n$ square cells, where each cell is associated with an elevation value, and $P$ consists of the center-points of these cells. Current approaches for computing the total visibility-index on such a terrain take at least quadratic time with respect to the number of the terrain cells. While finding a subquadratic solution to this 2D total visibility-index problem is an open problem, surprisingly, no subquadratic solution has been proposed for the one-dimensional (1D) version of the problem; in the 1D problem, the terrain is an $x$ -monotone polyline, and $P$ is the set of the polyline vertices.

We present an $O(n \log^2 n)$ algorithm that solves the 1D total visibility-index problem in the RAM model. Our algorithm is based on a geometric dualization technique, which reduces the problem into a set of instances of the red-blue line segment intersection counting problem. We also present a parallel version of this algorithm, which requires $O(\log^2 n)$ time and $O(n \log^2 n)$ work in the CREW PRAM model. We implement a naive $O(n^2)$ approach and three variations of our algorithm: one employing an existing red-blue line segment intersection algorithm and two new approaches that perform the intersection counting by leveraging features specific to our problem. We present experimental results for both serial and parallel implementations on large synthetic and real-world datasets, using two distinct hardware platforms. Results show that all variants of our algorithm outperform the naive approach by several orders of magnitude on large datasets. Furthermore, we show that our new intersection counting implementations achieve more than 8 times speedup over the existing red-blue line segment intersection algorithm. Our parallel implementation is able to process a terrain of $2^{24}$ vertices in under 1 minute using 16 cores, achieving more than 7 times speedup over serial execution.
PDF DOI
@inproceedings{p17-alenex, author = {Afshani, Peyman and de Berg, Mark and Casanova, Henri and Karsin, Ben and Lambrechts, Colin and Sitchinava, Nodari and Tsirogiannis, Constantinos}, title = {An efficient algorithm for the {1D} total visibility-index problem}, booktitle = {Proceedings of the 19th Meeting on Algorithm Engineering \& Experiments}, series = {ALENEX '17}, pages = {218-231}, year = {2017}, month = jan, doi = {10.1137/1.9781611974768.18}, software = {https://github.com/algoparc/1DVisibilityIndex} }
Related Software
Abstract: Many-core Graphics Processing Units (GPUs) are being used for general-purpose computing. However, due to architectural features, for many problems it is challenging to design parallel algorithms that exploit the full compute power of GPUs. Among these features is the memory design. Although the issue of coalesced global memory access has been documented and studied extensively, another important architectural feature is the organization of shared memory into banks. The study of how bank conflicts impact algorithm performance has only recently begun to receive attention. In this work we study the predecessor search algorithm and the effects of bank conflicts on its execution time. Via complexity analysis we show that bank conflicts cause significant loss in parallelism for a naive algorithm. We then propose two improved algorithms: one that eliminates bank conflicts altogether but that uses a work inefficient linear search, and one that is work-optimal but that experiences a limited number of bank conflicts. We develop GPU implementations of these algorithms and present experimental results obtained on real-world hardware. These results validate our theoretical analysis of the naive algorithm and allow us to assess the performance of our algorithms in practice. Although both our improved algorithms outperform the naive algorithm, our main experimental finding is that our conflict-limited algorithm provides a larger performance gain.
PDF DOI
@inproceedings{p15-hipc-o, author = {Karsin, Ben and Casanova, Henri and Sitchinava, Nodari}, title = {Efficient batched predecessor search in shared memory on { GPUs }}, booktitle = {Proceedings of the IEEE International Conference on High Performance Computing}, series = {HiPC '15}, pages = {335-344}, year = {2015}, month = dec, doi = {10.1109/HiPC.2015.40} }
Abstract:

In this paper, we look at the complexity of designing algorithms without any bank conflicts in the shared memory of Graphical Processing Units (GPUs). Given input of size $n$ , $w$ processors and $w$ memory banks, we study three fundamental problems: sorting, permuting and $w$ -way partitioning (defined as sorting an input containing exactly $n/w$ copies of every integer in $[w]$ ).

We solve sorting in optimal $O(\frac{n}{w} \log n)$ time. When $n \ge w^2$ , we solve the partitioning problem optimally in $O(n/w)$ time. We also present a general solution for the partitioning problem which takes $O(\frac{n}{w} \log^3_{n/w} w)$ time. Finally, we solve the permutation problem using a randomized algorithm in $O(\frac{n}{w} \log\log\log_{n/w} n)$ time. Our results show evidence that when working with banked memory architectures, there is a separation between these problems and the permutation and partitioning problems are not as easy as simple parallel scanning.
PDF ARXIV DOI
@inproceedings{p15-esa, author = {Afshani, Peyman and Sitchinava, Nodari}, title = {Sorting and permuting without bank conflicts}, booktitle = {Proceedings of the 23rd European Symposium on Algorithms}, series = {ESA '15}, pages = {13-24}, year = {2015}, month = sep, doi = {10.1007/978-3-662-48350-3_2}, arxiv = {https://arxiv.org/abs/1507.01391} }
Abstract:

We study the problem of list ranking in the parallel external memory (PEM) model. We observe an interesting dual nature for the hardness of the problem due to limited information exchange among the processors about the structure of the list, on the one hand, and its close relationship to the problem of permuting data, which is known to be hard for the external memory models, on the other hand.

By carefully defining the power of the computational model, we prove a permuting lower bound in the PEM model. Furthermore, we present a stronger $\Omega(\log^2 N)$ lower bound for a special variant of the problem and for a specific range of the model parameters, which takes us a step closer toward proving a non-trivial lower bound for the list ranking problem in the bulk-synchronous parallel (BSP) and MapReduce models. Finally, we also present an algorithm that is tight for a larger range of parameters of the model than in prior work.
PDF ARXIV DOI
@inproceedings{p14-mfcs, author = {Jakob, Riko and Lieber, Tobias and Sitchinava, Nodari}, title = {On the complexity of list ranking in the parallel external memory model}, booktitle = {Proceedings of the 39th International Symposium on Mathematical Foundations of Computer Science}, series = {MFCS '14}, pages = {384-395}, year = {2014}, month = aug, doi = {10.1007/978-3-662-44465-8_33}, arxiv = {https://arxiv.org/abs/1406.3279} }
Abstract:

In this paper we study the offline (batched) range minima query (RMQ) problem in the external memory (EM) and cache-oblivious (CO) models. In the static RMQ problem, given an array $A$ , a query rmq $_A(i,j)$ returns the smallest element in the range $A[i,j]$ .

If $B$ is the size of the block and $m$ is the number of blocks that fit in the internal memory in the EM and CO models, we show that $Q$ range minima queries on an array of size $N$ can be answered in $O(\frac{N}{B} + \frac{Q}{B}\log_{m} \frac{Q}{B}) = O(\mathrm{scan}(N) + \mathrm{sort}(Q))$ I/Os in the CO model and slightly better $O(\mathrm{scan}(N) + \frac{Q}{B} \log_m \min\{\frac{Q}{B}, \frac{N}{B}\})$ I/Os in the EM model and linear space in both models. Our cache-oblivious result is new and our external memory result is an improvement of the previously known bound. We also show that the EM bound is tight by proving a matching lower bound. Our lower bound holds even if the queries are presorted in any predefined order.

In the batched dynamic RMQ problem, the queries must be answered in the presence of the updates (insertions/deletions) to the array. We show that in the EM model we can solve this problem in $O(\mathrm{sort}(N) + \mathrm{sort}(Q)\log_m \frac{N}{B})$ I/Os, again improving the best previously known bound.
PDF DOI
@inproceedings{p14-swat, author = {Afshani, Peyman and Sitchinava, Nodari}, title = {{I/O}-efficient range minima queries}, booktitle = {Proceedings of the 14th Scandinavian Symposium and Workshops on Algorithm Theory}, series = {SWAT '14}, pages = {1-12}, year = {2014}, month = jul, doi = {10.1007/978-3-319-08404-6_1} }
Abstract: In this paper, we perform an empirical evaluation of the Parallel External Memory (PEM) model in the context of geometric problems. In particular, we implement the parallel distribution sweeping framework of Ajwani, Sitchinava and Zeh to solve batched 1-dimensional stabbing max problem. While modern processors consist of sophisticated memory systems (multiple levels of caches, set associativity, TLB, prefetching), we empirically show that algorithms designed in simple models, that focus on minimizing the I/O transfers between shared memory and single level cache, can lead to efficient software on current multicore architectures. Our implementation exhibits significantly fewer accesses to slow DRAM and, therefore, outperforms traditional approaches based on plane sweep and two-way divide and conquer.
PDF ARXIV DOI
@inproceedings{p13-esa, author = {Ajwani, Deepak and Sitchinava, Nodari}, title = {Empirical evaluation of the parallel distribution sweeping framework on multicore architectures}, booktitle = {Proceedings of the 21st European Symposium on Algorithms}, series = {ESA '13}, pages = {25-36}, year = {2013}, month = sep, doi = {10.1007/978-3-642-40450-4_3}, arxiv = {https://arxiv.org/abs/1306.4521} }
Abstract: We study a simple parallel algorithm for computing matchings in a graph. A variant for unweighted graphs finds a maximal matching using linear expected work and $O(\log^2 n)$ expected running time in the CREW PRAM model. Similar results also apply to External Memory, MapReduce and distributed memory models. In the maximum weight case the algorithm guarantees a 1/2-approximation. Although the parallel execution time is linear for worst case weights, an experimental evaluation indicates good scalabilty on distributed memory machines and on GPUs. Furthermore, the solution quality is very good in practice.
PDF ARXIV DOI
@inproceedings{p13-europar, author = {Birn, Marcel and Osipov, Vitaly and Sanders, Peter and Schulz, Christian and Sitchinava, Nodari}, title = {Efficient parallel and external matching}, booktitle = {Proceedings of the 19th European Conference on Parallel Processing}, series = {Euro-Par '13}, pages = {659-670}, year = {2013}, month = aug, doi = {10.1007/978-3-642-40047-6_66}, arxiv = {https://arxiv.org/abs/1302.4587} }
Abstract: We study the one-dimensional range minimum query (RMQ) problem in the external memory model. We provide the first space-optimal solution to the batched static version of the problem. On an instance with $N$ elements and $Q$ queries, our solution takes $\Theta(\mathrm{sort}(N+Q)) = \Theta(\frac{N+Q}{B}\log_{M/B} \frac{N+Q}{B})$ I/O complexity and $O(N+Q)$ space, where $M$ is the size of the main memory and $B$ is the block size. This is a factor of $O(\log_{M/B} N)$ improvement in space complexity over the previous solutions. We also show that an instance of the batched dynamic RMQ problem with $N$ updates and $Q$ queries can be solved in $O(\frac{N+Q}{B} \log^2_{M/B} \frac{N+Q}{B})$ I/O complexity and $O(N+Q)$ space.
PDF DOI
@inproceedings{p13-wads, author = {Arge, Lars and Fischer, Johannes and Sanders, Peter and Sitchinava, Nodari}, title = {On (dynamic) range minimum queries in external memory}, booktitle = {Proceedings of the 13th International Symposium on Algorithms and Data Structures}, series = {WADS '13}, pages = {37-48}, year = {2013}, month = aug, doi = {10.1007/978-3-642-40104-6_4} }
Abstract:

We present the parallel buffer tree, a parallel external memory (PEM) data structure for batched search problems. This data structure is a non-trivial extension of Arge’s sequential buffer tree to a private-cache multiprocessor environment and reduces the number of I/O operations by the number of available processor cores compared to its sequential counterpart, thereby taking full advantage of multicore parallelism.

The parallel buffer tree is a search tree data structure that supports the batched parallel processing of a sequence of $N$ insertions, deletions, membership queries, and range queries in the optimal $O(\mathrm{sort}_P(N) + K/PB)$ parallel I/O complexity, where $K$ is the size of the output reported in the process and $\mathrm{sort}_P(N)$ is the parallel I/O complexity of sorting $N$ elements using $P$ processors.
PDF DOI
@inproceedings{p12-spaa, author = {Sitchinava, Nodari and Zeh, Norbert}, title = {A parallel buffer tree}, booktitle = {Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures}, series = {SPAA '12}, pages = {214-223}, year = {2012}, month = jul, doi = {10.1145/2312005.2312046} }
Abstract: We study the MapReduce framework from an algorithmic standpoint, providing a generalization of the previous algorithmic models for MapReduce. We present optimal solutions for the fundamental problems of all-prefix-sums, sorting and multi-searching. Additionally, we design optimal simulations of the the well-established PRAM and BSP models in MapReduce, immediately resulting in optimal solutions to the problems of computing fixed-dimensional linear programming and 2-D and 3-D convex hulls.
PDF ARXIV DOI
@inproceedings{p11-isaac, author = {Goodrich, Michael T. and Sitchinava, Nodari and Zhang, Q.}, title = {Sorting, searching and simulation in the {MapReduce} framework}, booktitle = {Proceedings of the 22nd International Symposium on Algorithms and Computation}, series = {ISAAC '11}, pages = {374-383}, year = {2011}, month = dec, doi = {10.1007/978-3-642-25591-5_39}, arxiv = {https://arxiv.org/abs/1101.1902} }
Abstract: The parallel external memory (PEM) model has been used as a basis for the design and analysis of a wide range of algorithms for private-cache multi-core architectures. As a tool for developing geometric algorithms in this model, a parallel version of the I/O-efficient distribution sweeping framework was introduced recently, and a number of algorithms for problems on axis-aligned objects were obtained using this framework. The obtained algorithms were efficient but not optimal. In this paper, we improve the framework to obtain algorithms with the optimal I/O complexity of $O(\mathrm{sort}_P(N) + K/PB)$ for a number of problems on axis aligned objects; $P$ denotes the number of cores/processors, $B$ denotes the number of elements that fit in a cache line, $N$ and $K$ denote the sizes of the input and output, respectively, and $\mathrm{sort}_P(N)$ denotes the I/O complexity of sorting $N$ items using $P$ processors in the PEM model. To obtain the above improvement, we present a new one-dimensional batched range counting algorithm on a sorted list of ranges and points that achieves an I/O complexity of $O((N + K)/PB)$ , where $K$ is the sum of the counts of all the ranges. The key to achieving efficient load balancing among the processors in this algorithm is a new method to count the output without enumerating it, which might be of independent interest.
PDF DOI
@inproceedings{p11-ipdps, author = {Ajwani, Deepak and Sitchinava, Nodari and Zeh, Norbert}, title = {{I/O}-optimal distribution sweeping on private-cache chip multiprocessors}, booktitle = {Proceedings of the 26th IEEE International Parallel \& Distributed Processing Symposium}, series = {IPDPS '11}, pages = {1114-1123}, year = {2011}, month = may, doi = {10.1109/IPDPS.2011.106} }
Abstract: We study techniques for obtaining efficient algorithms for geometric problems on private-cache chip multiprocessors. We show how to obtain optimal algorithms for interval stabbing counting, 1-D range counting, weighted 2-D dominance counting, and for computing 3-D maxima, 2-D lower envelopes, and 2-D convex hulls. These results are obtained by analyzing adaptations of either the PEM merge sort algorithm or PRAM algorithms. For the second group of problems -- orthogonal line segment intersection reporting, batched range reporting, and related problems -- more effort is required. What distinguishes these problems from the ones in the previous group is the variable output size, which requires I/O-efficient load balancing strategies based on the contribution of the individual input elements to the output size. To obtain nearly optimal algorithms for these problems, we introduce a parallel distribution sweeping technique inspired by its sequential counterpart.
PDF DOI
@inproceedings{p10-esa, author = {Ajwani, Deepak and Sitchinava, Nodari and Zeh, Norbert}, title = {Geometric algorithms for private-cache chip multiprocessors}, booktitle = {Proceedings of the 18th European Symposium on Algorithms}, series = {ESA '10}, pages = {75-86}, year = {2010}, month = sep, doi = {10.1007/978-3-642-15781-3_7} }
Abstract: In this paper, we study parallel I/O efficient graph algorithms in the Parallel External Memory (PEM) model, one o f the private-cache chip multiprocessor (CMP) models. We study the fundamental problem of list ranking which leads to efficient solutions to problems on trees, such as computing lowest common ancestors, tree contraction and expression tree evaluation. We also study the problems of computing the connected and biconnected components of a graph, minimum spanning tree of a connected graph and ear decomposition of a biconnected graph. All our solutions on a $P$ -processor PEM model provide an optimal speedup of $\Theta(P)$ in parallel I/O complexity and parallel computation time, compared to the single-processor external memory counterparts.
PDF DOI
@inproceedings{p10-ipdps, author = {Arge, Lars and Goodrich, Michael T. and Sitchinava, Nodari}, title = {Parallel external memory graph algorithms}, booktitle = {Proceedings of the 25th IEEE International Parallel \& Distributed Processing Symposium}, series = {IPDPS '10}, pages = {1-11}, year = {2010}, month = may, doi = {10.1109/IPDPS.2010.5470440} }
Abstract: In this paper, we study parallel algorithms for private-cache chip multiprocessors (CMPs), focusing on methods for foundational problems that are scalable with the number of cores. By focusing on private-cache CMPs, we show that we can design efficient algorithms that need no additional assumptions about the way cores are interconnected, for we assume that all inter-processor communication occurs through the memory hierarchy. We study several fundamental problems, including prefix sums, selection, and sorting, which often form the building blocks of other parallel algorithms. Indeed, we present two sorting algorithms, a distribution sort and a mergesort. Our algorithms are asymptotically optimal in terms of parallel cache accesses and space complexity under reasonable assumptions about the relationships between the number of processors, the size of memory, and the size of cache blocks. In addition, we study sorting lower bounds in a computational model, which we call the parallel external-memory (PEM) model, that formalizes the essential properties of our algorithms for private-cache CMPs.
PDF DOI
@inproceedings{p08-spaa, author = {Arge, Lars and Goodrich, Michael T. and Nelson, Mark and Sitchinava, Nodari}, title = {Fundamental parallel algorithms for private-cache chip multiprocessors}, booktitle = {Proceedings of the 20th ACM Symposium on Parallelism in Algorithms and Architectures}, series = {SPAA '08}, pages = {197-206}, year = {2008}, month = jun, doi = {10.1145/1378533.1378573} }
Abstract: We consider the problem of placing a small number of angle guards inside a simple polygon $P$ so as to provide efficient proofs that any given point is inside $P$ . Each angle guard views an infinite wedge of the plane, and a point can prove membership in $P$ if it is inside the wedges for a set of guards whose common intersection contains no points outside the polygon. This model leads to a broad class of new art gallery type problems, which we call “sculpture garden” problems and for which we provide upper and lower bounds. In particular, we show there is a polygon $P$ such that a “natural” angle-guard vertex placement cannot fully distinguish between points on the inside and outside of $P$ (even if we place a guard at every vertex of $P$ ), which implies that Steiner-point guards are sometimes necessary. More generally, we show that, for any polygon $P$ , there is a set of $n+2(h-1)$ angle guards that solve the sculpture garden problem for $P$ , where h is the number of holes in $P$ (so a simple polygon can be defined with $n-2$ guards). In addition, we show that, for any orthogonal polygon $P$ , the sculpture garden problem can be solved using $\frac{n}{2}$ angle guards. We also give an example of a class of simple (non-general-position) polygons that have sculpture garden solutions using $O(\sqrt{n})$ guards, and we show this bound is optimal to within a constant factor. Finally, while optimizing the number of guards solving a sculpture garden problem for a particular $P$ is of unknown complexity, we show how to find in polynomial time a guard placement whose size is within a factor of $2$ of the optimal number for any particular polygon.
PDF DOI
@inproceedings{p07-socg, author = {Eppstein, David and Goodrich, Michael T. and Sitchinava, Nodari}, title = {Guard placement for efficient point-in-polygon proofs}, booktitle = {Proceedings of the 23rd Annual ACM Symposium on Computational Geometry}, series = {SoCG '07}, pages = {27-36}, year = {2007}, month = jun, doi = {10.1145/1247069.1247075} }
Abstract: This paper extends the reconfigurable shared scan-in architecture (RSSA) to provide additional ability to change values on the scan configuration signals (scan enable signals) during the scan operation on a per-shift basis. We show that the extra flexibility of reconfiguring the scan chains every shift cycle reduces the number of different configurations required by RSSA while keeping test coverage the same. In addition a simpler analysis can be used to construct the scan chains. This is the first paper of its kind that treats the scan enable signal as a test data signal during the scan operation of a test pattern. Results are presented on some ISCAS as well as industrial circuits.
PDF DOI
@inproceedings{p04-vts-o, author = {Sitchinava, Nodari and Samaranayake, Samitha and Kapur, Rohit and Gizdarski, Emil and Neuveux, Frederic and Williams, Thomas W.}, title = {Changing scan enable during shift}, booktitle = {Proceedings of the 22nd IEEE VLSI Test Symposium}, series = {VTS '04}, pages = {73-78}, year = {2004}, doi = {10.1109/VTEST.2004.1299228} }
Abstract: In this paper, an efficient technique for test data volume reduction based on the shared scan-in (Illinois Scan) architecture and the scan chain reconfiguration (Dynamic Scan) architecture is defined. The composite architecture is created with analysis that relies on the compatibility relation of scan chains. Topological analysis and compatibility analysis are used to maximize gains in test data volume and test application time. The goal of the proposed synthesis procedure is to test all detectable faults in broadcast test mode using minimum scan-chain configurations. As a result, more aggressive sharing of scan inputs can be applied for test data volume and test application time reduction. The experimental results demonstrate the efficiency of the proposed architecture for real-industrial circuits.
PDF DOI
@inproceedings{p03-vts-o, author = {Samaranayake, Samitha and Gizdarski, Emil and Sitchinava, Nodari and Neuveux, Frederic and Kapur, Rohit and Williams, Thomas W.}, title = {A reconfigurable shared scan-in architecture}, booktitle = {Proceedings of the 21st IEEE VLSI Test Symposium}, series = {VTS '03}, pages = {9-14}, year = {2003}, doi = {10.1109/VTEST.2003.1197627} }

Journal Articles

Abstract: We present parallel algorithms to efficiently permute a sorted array into the level-order binary search tree (BST), level-order B-tree (B-tree), and van Emde Boas (vEB) layouts in-place. We analytically determine the complexity of our algorithms and empirically measure their performance. When considering the total time to permute the data in-place and to perform a series of search queries, the vEB layout provides the best performance on the CPU. Given an input of $N$ =537 million 64-bit integers, the benefits of query performance (compared to binary search) outweigh the cost of in-place permutation when performing as few as 0.37% of $N$ queries. On the GPU, results depend on the particular architecture, with the B-tree and vEB layouts performing the best. The number of queries necessary to reach the break-even point with binary search ranges from 1.3% to 8.9% of $N$ =1,074 million 32-bit integers.
PDF DOI
@article{p21-tc, author = {Berney, Kyle and Casanova, Henri and Higuchi, Alyssa and Karsin, Ben and Sitchinava, Nodari}, title = {Beyond Binary Search: Parallel In-place Construction of Implicit Search Tree Layouts}, journal = {IEEE Transactions on Computers}, volume = {71}, number = {5}, pages = {1104-1116}, year = {2022}, doi = {10.1109/TC.2021.3075392} }
Abstract:

Let $T$ be a terrain and $P$ be a set of points on its surface. An important problem in Geographic Information Science (GIS) is computing the visibility index of a point $p$ on $P$ , that is, the number of points in $P$ that are visible from $p$ . The total visibility-index problem asks for the visibility index of every point in $P$ .

We present the first subquadratic-time algorithm to solve the 1D total-visibility-index problem. Our algorithm uses a geometric dualization technique to reduce the problem to a set of instances of the red-blue line segment intersection counting problem, allowing us to find the total visibility-index in $O(n \log^2{n})$ time. We implement a naive $O(n^2)$ approach and four variations of our algorithm: one that uses an existing red-blue line segment intersection counting algorithm and three new approaches that leverage features specific to our problem. Two of our implementations allow for parallel execution, requiring $O(\log^2{n})$ time and $O(n \log^2{n})$ work in the CREW PRAM model.

We present experimental results for both serial and parallel implementations on synthetic and real-world datasets, using two hardware platforms. Results show that all variants of our algorithm outperform the naive approach by several orders of magnitude. Furthermore, we show that our special-case red-blue line segment intersection counting implementations out-perform the existing general-case solution by up to a factor 10. Our fastest parallel implementation is able to process a terrain of more than 100 million vertices in under 3 minutes, achieving up to 85% parallel efficiency using 16 cores.
PDF DOI
@article{p18-jea, author = {Afshani, Peyman and de Berg, Mark and Casanova, Henri and Karsin, Ben and Lambrechts, Colin and Sitchinava, Nodari and Tsirogiannis, Constantinos}, title = {An efficient algorithm for the 1D total visibility-index problem and its parallelization}, year = {2018}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, journal = {ACM Journal of Experimental Algorithmics}, volume = {23}, number = {2}, month = jul, articleno = {2.3}, numpages = {23}, pages = {2.3:1-2.3:23}, issn = {1084-6654}, doi = {10.1145/3209685}, keywords = {persistent data structures, parallel algorithms, computational geometry, Terrain visibility} }
Abstract: Visibility graph reconstruction, which asks us to construct a polygon that has a given visibility graph, is a fundamental problem with unknown complexity (although visibility graph recognition is known to be in PSPACE). As far as we are aware, the only class of orthogonal polygons that are known to have efficient reconstruction algorithms is the class of orthogonal convex fans (staircase polygons) with uniform step lengths. We show that two classes of uniform step length polygons can be reconstructed efficiently by finding and removing rectangles formed between consecutive convex boundary vertices called tabs. In particular, we give an $O(n^2m)$ -time reconstruction algorithm for orthogonally convex polygons, where $n$ and $m$ are the number of vertices and edges in the visibility graph, respectively. We further show that reconstructing a monotone chain of staircases (a histogram) is fixed-parameter tractable, when parameterized on the number of tabs, and polynomially solvable in time $O(n^2m)$ under alignment restrictions. As a consequence of our reconstruction techniques, we also get recognition algorithms for visibility graphs of these classes of polygons with the same running times.
PDF DOI
@article{p18-jgaa, author = {Sitchinava, Nodari and Strash, Darren}, title = {Reconstructing generalized staircase polygons with uniform step length}, journal = {Journal of Graph Algorithms and Applications}, volume = {22}, number = {3}, pages = {431-459}, year = {2018}, doi = {10.7155/jgaa.00469} }
Abstract: A foreword by the editors of the special issue dedicated to the selected papers from the 26th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2014).
PDF DOI
@article{p16-topc, author = {Meyer auf der Heide, Friedhelm and Sanders, Peter and Sitchinava, Nodari}, title = {Introduction to the special issue on SPAA 2014}, journal = {ACM Transactions on Parallel Computing}, volume = {3}, number = {1}, pages = {1:1-1:2}, year = {2016}, doi = {10.1145/2936716} }
Abstract: An overview of computational geometry results in the Parallel External Memory (PEM) model.
PDF DOI
@article{p12-sigspatial, author = {Sitchinava, Nodari}, title = {Computational geometry in the Parallel External Memory model}, journal = { {SIGSPATIAL} Special}, issue_date = {July 2012}, volume = {4}, number = {2}, month = jul, year = {2012}, issn = {1946-7729}, pages = {18-23}, numpages = {6}, url = {http://doi.acm.org/10.1145/2367574.2367578}, doi = {10.1145/2367574.2367578}, acmid = {2367578}, publisher = {ACM}, address = {New York, NY, USA} }
Abstract: Two factors primarily drive the soaring cost of semiconductor test: the number of test patterns applied to each chip and the time it takes to run each pattern. Typical semiconductor testing for each chip involves a set of 1,000 to 5,000 test patterns. These tests are applied through scan chains that operate at about 25 MHz. Depending on the size of the scan chains on the chip, a set of test patterns can take a few seconds to execute per chip. It's easy to see that even a small decrease in either the number of patterns or the time to execute them can quickly add up to big savings across millions of fabricated chips. This potential savings forms the basis for dynamic scan, a new approach to the well-established scan test methodology. The authors initial studies indicate that dynamic scan could easily reduce the time spent applying test patterns by 40 percent. A more theoretical analysis shows a potential savings of as much as 80 percent.
PDF DOI
@article{p02-computer, author = {Samaranayake, Samitha and Sitchinava, Nodari and Kapur, Rohit and Amin, Minesh B. and Williams, Thomas W.}, journal = {Computer}, title = {Dynamic scan: driving down the cost of test}, year = {2002}, volume = {35}, number = {10}, pages = {63-68}, keywords = {automatic test pattern generation;design for testability;integrated circuit testing;automatic test pattern generation;design-for-test technology;dynamic scan;scan test methodology;semiconductor testing;test patterns;Automatic generation control;Automatic test pattern generation;Automatic testing;Costs;Design for testability;Fabrication;Flip-flops;Logic testing;Semiconductor device testing;Test pattern generators}, doi = {10.1109/MC.2002.1039519}, issn = {0018-9162}, month = oct }

Book Chapters

Abstract:
PDF DOI
@incollection{p17-hdcg, author = {Goodrich, Michael T. and Sitchinava, Nodari}, title = {Parallel algorithms in geometry}, booktitle = {Handbook of Discrete and Computational Geometry}, editor = {C.D. T\'{o}th, J. O'Rourke, J. E. Goodman}, year = {2017}, publisher = {CRC Press}, doi = {10.1201/9781315119601}, edition = {3rd}, pages = {1225-1239}, isbn = {9781498711395} }

Referreed Workshops (without formally published proceedings)

N. Sitchinava, D. Strash: "Reconstructing a unit-length orthogonally convex polygon from its visibility graph". European Workshop on Computational Geometry (EuroCG '16). 2016.
P. Afshani, N. Sitchinava: "I/O-efficient range minima queries". 6th Workshop on Massive Data Algorithmics (MASSIVE '14). 2014.
N. Sitchinava, V. Weichert: "Provably efficient GPU algorithms". 5th Workshop on Massive Data Algorithmics (MASSIVE '13). 2013.
L. Arge, J. Fischer, P. Sanders, N. Sitchinava: "On (dynamic) range minimum queries in external memory". 5th Workshop on Massive Data Algorithmics (MASSIVE '13). 2013.
D. Ajwani, N. Sitchinava, N. Zeh: "I/O-optimal distribution sweeping on private-cache chipmultiprocessors". 3rd Workshop on Massive Data Algorithmics (MASSIVE '11). 2011.
D. Ajwani, N. Sitchinava, N. Zeh: "Geometric algorithms for private-cache chip multiprocessors". 2nd Workshop on Massive Data Algorithmics (MASSIVE '10). 2010.
L. Arge, M.T. Goodrich, N. Sitchinava: "Parallel external memory model". Workshop on Theory and Many-Cores (T&MC). 2009.
* N. Sitchinava, S. Samaranayake, R. Kapur, F. Neuveux, E. Gizdarski, T.W. Williams: "Dynamically reconfigurable shared scan-in architecture". IEEE International Test Synthesis Workshop (ITSW '04). 2004.
* N. Sitchinava, S. Samaranayake, R. Kapur, F. Neuveux, E. Gizdarski, T.W. Williams, D. Spielman: "A segment identification algorithms for a dynamic scan architecture". IEEE International Test Synthesis Workshop (ITSW '03). 2003.
* N. Sitchinava, S. Samaranayake, R. Kapur, M. Amin, T.W. Williams: "DFT - ATE solution to lower the cost of test". IEEE Workshop on Test Resource Partitioning. 2001.