並行編程模型

計算機科學中,並行編程模型(Parallel programming model)是並行計算機架構的抽象化,通過它可方便的表達算法和它們在程序中的合成。判別編程模型的價值可以通過它的通用性:在各種不同架構上能表達多大範圍的不同問題,和它的性能:編譯的程序在執行時有多高的效率[1]。並行編程模型的實現形式可以是從「順序編程」語言中調用的函式庫,作為現存語言的擴展,或作為全新的語言。

圍繞特定編程模型的共識是很重要的,這可導致建造不同的並行計算機來支持這個模型,從而促進軟件的可移植性。在這個意義上,編程模型被稱為在硬件和軟件之間的橋接英語Bridging model[2]

並行編程模型的分類

並行編程模型的分類可以寬泛的在兩個領域主題下進行:進程交互和問題分解[3][4][5]

進程交互

進程交互有關於並行進程能夠籍此相互通信的機制。最常見的交互形式是共享內存和消息傳遞。

共享內存

共享內存是在進程間傳遞數據的高效方式。在共享內存模型中,並行進程共享它們可以異步讀與寫的全局地址空間。異步並發訪問可能導致競爭條件,和用來避免它們的機制如:信號量監視器。常規的多核處理器直接支持共享內存,很多並行編程語言和庫在設計上利用了它,比如採用分叉會合模型的:CilkOpenMP線程建造塊英語Threading Building Blocks[6]

消息傳遞

消息傳遞模型中,並行進程通過消息傳遞相互交換數據。這種通信可以是異步的,就是說消息可以在接收者準備好之前發出;或是同步的,就是說消息發出前接收者必須準備好。通信順序進程(CSP)形式化了使用同步通信通道來連接進程的消息傳遞,並引出了重要的語言如:OccamLimboGo。與之相對,演員模型使用異步消息傳遞,並被採用於如下語言的設計中:DScala和SALSA[7]

分布式內存

分布式內存英語Distributed memory指稱一類多處理器計算機系統,其中每個處理器都有自己私有的內存,計算任務只能在本地數據上運算,如果需要遠程數據,計算任務必須與一個或多個遠程處理器通信。在分布式內存系統編程中的關鍵要點是如何把數據分布到各個內存上;依賴於所解決的問題,數據可以靜態分布,也可以在節點間移動;數據可以在需要時移動,也可以事先推入新的節點。

MPI規定了用於分布式內存系統的通信協議,支持點到點通信和集體通信(collective communication)二者。MPI還是消息傳遞API,帶有對其特徵在任何實現中必須如何表現的協議和語義規定[8]。MPI的目標是高性能、可伸縮性和可移植性,目前仍是高性能計算領域中統治性的模型[9]。此外還有支持單邊通信(one-sided communication)的分區全局地址空間模型。

問題分解

並行程序由同時執行的進程構成。問題分解有關於規劃(formulate)成員進程的方式[10][11]

數據並行

數據並行模型關注進行運算所在的數據集,典型的是正規結構的數組。一組任務將在這些數據上運算,但是單獨的處於在不相交的分區中。數據並行通常對應SPMD英語SPMD編程模型[12],相應執行模型對應費林分類法中的SIMD(例如AVX擴展英語Processor supplementary capability)或MIMD(例如Xeon Phi[13],還有GPGPU採用的SIMT英語Single instruction, multiple threads[14] (例如NVIDIA Tesla)。

任務並行

任務並行模型關注進程線程的執行。這些進程通常表現出獨特性,並強調對通信的需求。任務並行是表達消息傳遞通信的自然方式。任務並行通常對應MPMD英語MPMD編程模型,與SPMD英語SPMD的區別在於適合解決的問題而非執行模型[15]

隱式並行

與上述顯式並行英語Explicit parallelism相反,隱式並行英語Implicit parallelism不向編程者透露任何編譯器、運行時系統或硬件負責的事情。例如,在編譯器中自動並行化英語automatic parallelization是把順序代碼轉變程並行代碼的過程;而在計算機架構中,超標量執行是採用指令級並行來進行並行運算的機制,因自身限制而被實際利用為同時多線程/超線程

在隱式並行中,進程交互對編程者均不可見,轉而由編譯器和/或運行時系統負責進行交互。函數式編程語言不存在副作用,允許無依賴的函數並行執行,人們已經進行了很多有關的研究實驗[16]。以SISALSAC為代表的一類數據流程編程語言,是可以高效隱式並行的函數式編程語言,其關鍵特徵是單賦值面向數組

術語

並行計算模型是計算模型的一大範疇,包括:細胞自動機PRAM機LogP機佩特里網進程網英語Kahn process networks交互網英語Interaction nets等。計算模型是用來分析計算進程代價的一種抽象化,利用它分析並行算法英語Analysis of parallel algorithms性能,可以不取決於特定實現和技術所特有的各種變化,並行算法一般而言是針對特定計算模型而編寫的,為PRAM機編寫的偽碼通常會採用某種For循環形式的並發編程構造[17]

編程模型英語Programming model指稱一種編程樣式,即通過看起來像調用的方式引發執行。例子包括POSIXPthreads庫和Apache Hadoop中的MapReduce。在這二者情況下,執行模型英語Execution model都符合這個庫所用語言的語法卻不能按照其語義來理解。不同於計算模型,編程模型特別暗含着對硬件或軟件實現的實際考慮[18]

並行計算中,執行模型經常必須暴露硬件特徵來達成高性能。並行硬件有大量的變種導致了同時需要類似數量的並行執行模型。對每個執行模型都建立一門新語言是不實際的,因此常見的實踐都是通過某個API來引發並行執行模型的行為。並行編程語言可以基於一種或一個組合的編程模型。例如,高性能Fortran基於共享內存交互和數據並行問題分解,而Go提供共享內存交互和消息傳遞交互。

並行編程模型

這裡列出的編程模型是可稱為橋接模型英語Bridging model計算機的抽象模型[2],它提供了在一個機器的物理實現和編程者可獲得的這個機器的抽象概念之間的橋梁;換句話說,它意圖在硬件軟件工程師之間提供共同的理解層面。成功的編程模型可以在現實中有效的實現並被編程者有效的作為目標;特別是應當有可能用典型的高級語言編譯器生成良好的代碼。從編程者的角度來看,這種橋接並行編程模型一般典型的位於PthreadsIPCMPI等之上,而在OpenMPOpenACC等之下。

名稱 交互類別 分解類別 實現及用例
分叉會合模型 共享內存 數據 CilkOpenMP線程建造塊英語Threading Building Blocks[6]
整體同步並行模型 共享內存 數據[19] BSPlib[20]:MulticoreBSP[21]、BSPonMPI[22]Apache Giraph[23]Apache Hama英語Apache Hama[24]
分區全局地址空間模型 分布式內存英語Distributed memory 數據 SHMEM英語SHMEM[25]Coarray Fortran英語Coarray FortranChapel英語Chapel (programming language)X10英語X10 (programming language)
通信順序進程模式 同步消息傳遞 任務 OccamAdaGo
演員模型 異步消息傳遞 任務 Erlang[26]DScala,很多的框架實現
異構計算框架 共享內存 數據 OpenCLCUDA[27]SYCL,Pocl[28]
串流處理/數據流程范型 管道 數據[1] Brook英語BrookGPU[29]Apache FlinkRaftLib英語RaftLibTensorFlowLustre
高級綜合電子設計自動化 通道 任務 C轉HDL英語C to HDLHandel-CSystemC

OpenCL將計算系統視為組成自一組「計算設備」,它們可以是CPU或是附加的「加速器」比如GPU。它定義了一種類C語言英語List of C-family programming languages用來寫程序。在OpenCL設備上執行的函數叫做「內核[30]:17。一個單一的計算設備典型的組成自一些「計算單元」,它們依次又包含很多「處理元素」(PE)。一個單一的內核執行可以在所有或多個PE上並行運行。OpenCL定義了API,允許運行於主機上的程序,啟動在計算設備上的內核,並管理設備內存,它至少在概念上分離於主機內存。用OpenCL語言寫的程序預期會被即時編譯,所以使用OpenCL的應用程序在針對各種設備的實現之間是可移植的[31]

FPGA可以被用來解決任何可計算的問題,這通過用FPGA能實現軟微處理器英語Soft microprocessor的事實就可輕易的證明。它們的好處在於對某些應用它們明顯的要更快速,因為它們有着並行本質和在對用在特定處理上的邏輯門的數目方面的優化[32]。近年來開始興起使用OpenCL編程來利用FPGA提供的性能和能耗效率。OpenCL允許編程者用C語言編碼並把FPGA組合函數作為使用OpenCL構造的OpenCL計算內核的目標[33]

MapReduce是通過並行、分布式算法在集群上處理和生成鍵/值對形式的大數據集的編程模型和有關實現[34]Apache Hadoop中將它與HDFS分立實現[35]MapReduce受到了在函數式編程范型中常用的mapreduce函數的啟發[36],但是它們在MapReduce框架中的用途不同於它們在起初形式中那樣[37]

並行編程模型還有很多,比如:馬里蘭大學學院市分校依據PRAM計算模型,建立了指令級並行顯式多線程英語Explicit multi-threading的多處理器計算機和編程語言XMTC英語XMTC[38],實現了Spawn-Join范型[39]

參見

引用

  1. ^ 1.0 1.1 Skillicorn, David B., "Models for practical parallel computation", International Journal of Parallel Programming, 20.2 133–158 (1991), https://www.ida.liu.se/~chrke55/papers/modelsurvey.pdf頁面存檔備份,存於網際網路檔案館
  2. ^ 2.0 2.1 Leslie G. Valiant. A bridging model for parallel computation (PDF). Communications of the ACM. August, 1990, 33 (8): 103–111 [2019-12-05]. (原始內容 (PDF)存檔於2019-08-11). 
  3. ^ John E. Savage, Models of Computation: Exploring the Power of Computing, 2008, Chapter 7 (Parallel Computation), http://cs.brown.edu/~jes/book/頁面存檔備份,存於網際網路檔案館
  4. ^ Ian Foster, Designing and Building Parallel Programs, 1995, Section 1.3, "A Parallel Programming Model", http://www.mcs.anl.gov/~itf/dbpp/text/node9.html頁面存檔備份,存於網際網路檔案館
  5. ^ Blaise Barney, Introduction to Parallel Computing, "Models", 2015, Lawrence Livermore National Laboratory, https://computing.llnl.gov/tutorials/parallel_comp/#Models頁面存檔備份,存於網際網路檔案館
  6. ^ 6.0 6.1 Michael McCool; James Reinders; Arch Robison. Structured Parallel Programming: Patterns for Efficient Computation (PDF). Elsevier. 2013 [2019-12-12]. (原始內容存檔 (PDF)於2018-11-23). Threading Building Blocks (TBB) supports fork–join with a work-stealing load balancer as its basic model. In contrast with Cilk Plus, TBB is a portable ISO C++ library, not a compiler extension. Because TBB is not integrated into the compiler, its fork–join implementation is not quite as efficient as Cilk Plus, and it cannot directly generate vectorized code. However, TBB also provides implementations of several patterns not available directly in Cilk Plus, such as pipelines. 
  7. ^ SALSA頁面存檔備份,存於網際網路檔案館
  8. ^ Gropp, William; Lusk, Ewing; Skjellum, Anthony. A High-Performance, Portable Implementation of the MPI Message Passing Interface. Parallel Computing. 1996, 22 (6): 789–828. doi:10.1016/0167-8191(96)00024-5. 
  9. ^ Sur, Sayantan; Koop, Matthew J.; Panda, Dhabaleswar K. MPI and communication---High-performance and scalable MPI over Infini Band with reduced memory usage. High-performance and Scalable MPI over InfiniBand with Reduced Memory Usage: An In-depth Performance Analysis. ACM. 4 August 2017: 105. ISBN 978-0769527000. doi:10.1145/1188455.1188565. 
  10. ^ Ian Foster, Designing and Building Parallel Programs, 1995, Section 2.2, "Partitioning", http://www.mcs.anl.gov/~itf/dbpp/text/node16.html頁面存檔備份,存於網際網路檔案館
  11. ^ Blaise Barney, Introduction to Parallel Computing, "Partitioning", 2015, Lawrence Livermore National Laboratory, https://computing.llnl.gov/tutorials/parallel_comp/#DesignPartitioning頁面存檔備份,存於網際網路檔案館
  12. ^ Blaise Barney. Introduction to Parallel Computing. [2019-12-02]. (原始內容存檔於2019-12-24). SINGLE PROGRAM: All tasks execute their copy of the same program simultaneously. This program can be threads, message passing, data parallel or hybrid. 
  13. ^ Flynn, Michael J. Some Computer Organizations and Their Effectiveness (PDF). IEEE Transactions on Computers. September 1972, C–21 (9): 948–960 [2019-12-05]. doi:10.1109/TC.1972.5009071. (原始內容 (PDF)存檔於2022-04-11).
    2) The single-instruction stream-multiple-data stream(SIMD), which includes most array processes, including Solomon[2](Illiac IV英語ILLIAC).
    3) Multiple-instruction stream-single-data stream type organizations(MISD), which include specialized streaming organizations using multiple-instruction streams on a single sequence of data and the derivatives thereof. The plug-board英語Plugboard machines of a bygone era were a degenerate form of MISD wherein the instruction streams were single instructions, and a derived datum(SD) passed from program step i to program step i+1(MI).
    4) Multiple-Instruction stream-multiple-data stream (MIMD), which include organizations referred to as "multiprocessor."Univac[3], among other corporations, has proposed various MIMD structures.
     
  14. ^ Michael McCool; James Reinders; Arch Robison. 2.4.3 Flynn’s Characterization. Structured Parallel Programming: Patterns for Efficient Computation (PDF). Elsevier. 2013: 52 [2019-12-12]. (原始內容存檔 (PDF)於2018-11-23). There is another related classification used especially by GPU vendors: Single Instruction, Multiple Threads (SIMT). This corresponds to a tiled SIMD architecture consisting of multiple SIMD processors, where each SIMD processor emulates multiple 「threads」 (fibers in our terminology) using masking. SIMT processors may appear to have thousands of threads, but in fact blocks of these share a control processor, and divergent control flow can significantly reduce efficiency within a block. On the other hand, synchronization between fibers is basically free, because when control flow is emulated with masking the fibers are always running synchronously. 
    Parallel Thread Execution ISA Version 6.5. https://docs.nvidia.com/. [2019-12-18]. (原始內容存檔於2019-12-01).
    On architectures prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained英語Granularity (parallel computing) sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from.
    Starting with the Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp. With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity英語Granularity (parallel computing), either to make better use of execution resources or to allow one thread to wait for data to be produced by another.
     
  15. ^ Blaise Barney. Introduction to Parallel Computing. [2019-12-02]. (原始內容存檔於2019-12-24). MULTIPLE PROGRAM: Tasks may execute different programs simultaneously. The programs can be threads, message passing, data parallel or hybrid. ……MPMD applications are not as common as SPMD applications, but may be better suited for certain types of problems, particularly those that lend themselves better to functional decomposition than domain decomposition (discussed later under Partioning). 
  16. ^ McBurney, D. L., and M. Ronan Sleep. "Transputer-based experiments with the ZAPP architecture." PARLE Parallel Architectures and Languages Europe. Springer Berlin Heidelberg, 1987.
    Hammond, Kevin. Parallel functional programming: An introduction. In International Symposium on Parallel Symbolic Computation, p. 46. 1994. 「The Alfalfa project implemented serial combinators for the Intel iPSC[44]. …… Buckwheat re-implemented this model for the Encore Multimax, a shared-memory multiprocessor. ……[45].」
  17. ^ Blelloch, Guy. Programming Parallel Algorithms (PDF). Communications of the ACM. 1996, 39 (3): 85–97 [2019-12-12]. doi:10.1145/227234.227246. (原始內容存檔 (PDF)於2016-03-04). An important advance in parallel computing was the introduction of the notion of virtual models. …… Virtual models can be taken a step further and used to define performance in more abstract measures than just running time on a particular machine. A pair of such measures are work and depth: Work is defined as the total number of operations executed by a computation, and depth is defined as the longest chain of sequential dependencies in the computation. 
    Siddhartha Chatterjee, Jan Prins. Parallel and Distributed Computing PRAM Algorithms (PDF). Spring 2002 [2019-12-06]. (原始內容存檔 (PDF)於2019-12-06). The barebones PRAM model is low-level and cumbersome, and writing anything other than trivial algorithms in this model is a nightmare. We will therefore switch to an equivalent but higher-level abstraction called the Work-Time (WT) paradigm to be independent of these details. …… In our algorithmic notation, we will use the forall construct to denote such concurrent operations, and we drop explicit mention of the processor id and p, the number of processors. In fact the forall construct is the only construct that distinguishes a WT algorithm from a sequential algorithm. 
    Vishkin, Uzi, Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques, 104 pages (PDF), Class notes of courses on parallel algorithms taught since 1992 at the University of Maryland, College Park, Tel Aviv University and the Technion, 2009 [2019-12-07], (原始內容存檔 (PDF)於2017-12-15), For concreteness, we proceed to an example of a PRAM algorithm. However, before doing this we present the pardo 「programming construct」, which is heavily used in these notes to express operations that are performed in parallell:
        for Pi , 1 ≤ i ≤ n pardo
            A(i) := B(i)
    This means that the following n operations are performed concurrently: processor P1 assigns B(1) into A(1), processor P2 assigns B(2) into A(2), and so on. …… An alternative model which is actually an alternative presentation mode, called Work-Depth, …… Strictly speaking, WD actually defines a slightly different model of computation. Consider an instruction of the form
        for i , 1 ≤ i ≤ α pardo
          A(i) := B(C(i))
    where the time unit under consideration, consists of a sequence of α concurrent instructions, for some positive integer α.
     
  18. ^ Skillicorn, David B., and Domenico Talia, Models and languages for parallel computation, ACM Computing Surveys, 30.2 123–169 (1998), https://www.cs.utexas.edu/users/browne/CS392Cf2000/papers/ModelsOfParallelComputation-Skillicorn.pdf頁面存檔備份,存於網際網路檔案館
  19. ^ Jonathan Hill, Bill McColl, Dan Stefanescu, Mark Goudreau, Kevin Lang, Satish Rao, Torsten Suel, Thanasis Tsantilas, and Rob Bisseling. BSP Programming Library (PDF). Parallel Computing. December 1998, 24 (14): 1947–1980 [2019-12-05]. (原始內容存檔 (PDF)於2019-11-04). Like many other communications libraries, BSPlib adopts a Single Program Multiple Data(SPMD) programming model. The task of writing an SPMD program will typically involve mapping a problem that manipulates a data structure of size N into p instances of a program that each manipulate an N/p sized block of the original domain. The role of BSPlib is to provide the infrastructure required for the user to take care of the data distribution, and any implied communication necessary to manipulate parts of the data structure that are on a remote process. 
  20. ^ BSPlib頁面存檔備份,存於網際網路檔案館
  21. ^ MulticoreBSP頁面存檔備份,存於網際網路檔案館
  22. ^ BSPonMPI頁面存檔備份,存於網際網路檔案館
  23. ^ Introduction to Apache Giraph. [2019-12-09]. (原始內容存檔於2019-12-19). Apache Giraph is an iterative graph processing framework, built on top of Apache Hadoop. The input to a Giraph computation is a graph composed of vertices and directed edges, …… Computation proceeds as a sequence of iterations, called supersteps in BSP. ……There is a barrier between consecutive supersteps. 
  24. ^ Apache Hama. [2019-12-09]. (原始內容存檔於2019-12-18). Apache Hama(TM) is a framework for Big Data analytics which uses the Bulk Synchronous Parallel (BSP) computing model, ……It provides not only pure BSP programming model but also vertex and neuron centric programming models, inspired by Google's Pregel and DistBelief. 
  25. ^ OpenSHMEM. [2019-12-09]. (原始內容存檔於2019-12-09). The Programming Models and Languages team is focused on developing the OpenSHMEM programming model for extreme scale systems. ……Currently, the team partners with NVIDIA, University of Tennessee, Knoxville, Florida State University and Paratools. …… UCX provides communication interfaces, and protocols for efficiently implementing parallel programming models such as MPI, OpenSHMEM, and task-based models. 
  26. ^ Armstrong, Joe. Erlang. Communications of the ACM. September 2010, 53 (9): 68–75 [2020-05-07]. doi:10.1145/1810891.1810910. (原始內容存檔於2020-06-09). Erlang is conceptually similar to the occam programming language, though it recasts the ideas of CSP in a functional framework and uses asynchronous message passing instead of the synchronous message passing in CSP. 
  27. ^ CUDA Programming Guide 1.0 (PDF). 6/23/2007 [2019-12-09]. (原始內容存檔 (PDF)於2018-04-17). CUDA stands for Compute Unified Device Architecture and is a new hardware and software architecture for issuing and managing computations on the GPU as a data-parallel computing device without the need of mapping them to a graphics API. 
    Apple to Ship Mac OS X Snow Leopard on August 28. August 24, 2009 [2019-12-09]. (原始內容存檔於2019-12-09). OpenCL, a C-based open standard, allows developers to tap the incredible power of the graphics processing unit for tasks that go beyond graphics. 
    September 2009 OpenCL Public Downloads. [2019-12-09]. (原始內容存檔於2019-12-09). Note: OpenCL drivers have been included in all publicly available NVIDIA drivers since October 2009. The CUDA Toolkit now includes the Visual Profiler for OpenCL as well as the OpenCL Programming Guide, Best Practices Guide, and other developer documentation. 
    MPI Solutions for GPUs. [2019-12-09]. (原始內容存檔於2019-12-09). MPI is fully compatible with CUDA, CUDA Fortran, and OpenACC, all of which are designed for parallel computing on a single computer or node. 
    OpenSHMEM Communication on GPUs. [2019-12-09]. (原始內容存檔於2019-12-09). NVSHMEM implements the OpenSHMEM standard for GPU memory, with extensions for improved performance on GPUs. 
  28. ^ Pocl頁面存檔備份,存於網際網路檔案館
  29. ^ Brook Language. [2019-12-09]. (原始內容存檔於2019-11-19). Brook is an extension of standard ANSI C and is designed to incorporate the ideas of data parallel computing and arithmetic intensity into a familiar, efficient language. The general computational model, referred to as streaming, ……A stream is a new data type addition which represents a collection of data which can be operated on in parallel. ……Kernels are special functions that operate on streams. A kernel is a parallel function applied to every element of the input streams. 
  30. ^ Howes, Lee. The OpenCL Specification Version: 2.1 Document Revision: 23 (PDF). Khronos OpenCL Working Group. November 11, 2015 [November 16, 2015]. (原始內容存檔 (PDF)於2015-11-18). 
  31. ^ Stone, John E.; Gohara, David; Shi, Guochin. OpenCL: a parallel programming standard for heterogeneous computing systems. Computing in Science & Engineering. 2010, 12 (3): 66–73. Bibcode:2010CSE....12c..66S. PMC 2964860 . PMID 21037981. doi:10.1109/MCSE.2010.69. 
  32. ^ Xilinx Inc, Form 8-K, Current Report, Filing Date Apr 26, 2006. secdatabase.com. [May 6, 2018]. (原始內容存檔於2018-05-07). 
  33. ^ Why use OpenCL on FPGAs?. StreamComputing. 2014-09-16 [2019-12-06]. (原始內容存檔於2017-01-01). 
  34. ^ MapReduce: Simplified Data Processing on Large Clusters (PDF). googleusercontent.com. [2019-12-05]. (原始內容存檔 (PDF)於2020-06-15). MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. 
  35. ^ MapReduce Tutorial. Apache Hadoop. [3 July 2019]. (原始內容存檔於2019-12-14). 
  36. ^ MapReduce: Simplified Data Processing on Large Clusters (PDF). googleusercontent.com. [2019-12-05]. (原始內容存檔 (PDF)於2020-06-15). Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical "record" in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance. 
  37. ^ Lämmel, R. Google's MapReduce programming model — Revisited (PDF). Science of Computer Programming. 2008, 70: 1–30 [2019-12-07]. doi:10.1016/j.scico.2007.07.001. (原始內容存檔 (PDF)於2019-12-07). The following overview lists more detailed questions and summarizes our findings:
    Is MAP essentially the map combinator? NO.
    Is MAP essentially an application of the map combinator? NO.
    Does MAP essentially serve as the argument of map? YES.
    Is REDUCE essentially the reduce combinator? NO.
    Is REDUCE essentially an application of the reduce combinator? TYPICALLY.
    Does REDUCE essentially serve as the argument of reduce? NO.
    Does REDUCE essentially serve as the argument of map? YES.
    Hence, there is no trivial correspondence between MapReduce's MAP&REDUCE and what is normally called map&reduce in functional programming.
     
  38. ^ Vishkin, Uzi, Using simple abstraction to reinvent computing for parallelism (PDF), Communications of the ACM, 2011, 54: 75–85, doi:10.1145/1866739.1866757, The array exchange serial algorithm serially iterates the standard exchange algorithm n times. Its pseudo-code follows.
    For i =0 to n−1 do
        X:=A( i ) ; A( i ):=B( i ) ; B( i ):=X
    …… A parallel array exchange algorithm uses an auxiliary array X[0..n-1] of size n, the parallel algorithm applies concurrently the iterations of the above serial algorithm, each exchanging A(i) with B(i) for a different value of i. Note the new pardo command in the following pseudo-code.
    For i =0 to n−1 pardo
        X( i ):=A( i ) ; A( i ):=B( i ) ; B( i ):=X( i )
    …… Consider the following example of a small XMTC program for the parallel exchange algorithm of the previous section:
    spawn (0 ,n−1) {
        var x
          x:=A( $ ) ; A( $):=B( $ ) ; B($ ):=x
    }
    The program simply spawns a concurrent thread for each of the depth-3 serial exchange iterations, using a local variable x. Note that the join command is implicit, and implied by the right parenthesis at the end of the above program. …… Commitments to silicon of XMT include a 64-processor, 75MHz computer based on field-programmable gate array(FPGA) technology [29], and 64-processor ASIC 10mm X 10mm chip using IBM』 s 90nm technology, pictured in Figure5. A basic yet stable compiler has also been developed. …… XMT is easy to build. A single graduate student, with no prior design experience, completed the XMT hardware description (in Verilog) in just over 2 year.
     
  39. ^ Vishkin, Uzi; Dascal, Shlomit; Berkovich, Efraim; Nuzman, Joseph, Explicit Multi-Threading (XMT) bridging models for instruction parallelism, Proc. 1998 ACM Symposium on Parallel Algorithms and Architectures (SPAA): 140–151, 1998 [2019-12-12], (原始內容存檔於2017-09-21), This paper envisions an extension to a standard instruction set which efficiently implements PRAM-style algorithms using explicit multi-threaded instruction-level parallelism (ILP); that is, Explicit Multi-Threading (XMT), a fine-grained computational paradigm …… . …… We actually suggest two alternative bridging models: (1) Spawn-based multi-threading (Spawn-MT): This is based on (asynchronous, or synchronous) nesting-free Spawn-Join commands. This is the main model for the presentation in this paper. (2) Elastic multi-threading (EMT): This is more involved model concept enables nesting of Spawn-Join commands (similar to NESL英語NESL) and relies on the ability of the hardware to suspend, and later reactive, threads. 

進一步閱讀