How to Parallel Process in R 📂R

How to Parallel Process in R

Overview

While R is not a language used for its speed, there are definitely times when speed is necessary. Even if the code is well written and tidy, if it takes too long, parallel processing or the use of GPUs is usually considered. At first glance, it might seem like there’s not much need for parallel processing in R, but when dealing with big data or conducting large-scale simulations, parallel processing becomes especially useful. In fact, R could be seen as a language that makes extensive use of parallel processing.

Code

Here is an R code that draws 1000 large numbers for prime factorization and measures the time it takes:

library(foreach)
library(doParallel)

eratosthenes<-function(n){
  residue<-2:n
  while(n %in% residue){
    p<-residue[1]
    residue<-residue[as.logical(residue%%p)]
  }
  return(p)
}
  
  set.seed(150421)
  test<-sample(2*1:10^5+1,1000)
  
  system.time({
    for(n in test){
      eratosthenes(n)
    }
  })
  
  numCores <- detectCores() -1
  myCluster <- makeCluster(numCores)
  registerDoParallel(myCluster)
  
  record<-numeric(0)
  clusterExport(myCluster, "record")
  
  system.time({
  foreach(n = test, .combine = c) %dopar% {
    eratosthenes(n)
  }
  })
  stopCluster(myCluster)

$20190830\_165312.png$

Explanation

In the case of prime factorization, it’s not the type of problem that benefits greatly from parallel processing, but still, we can see that the time was reduced by more than half. It’s worth noting that having eight logical processors does not reduce the execution time to 1/8. On the positive side, even without parallel processing, the CPU already distributes work reasonably efficiently. On the negative side, using parallel processing doesn’t change the fact that it’s the same old CPU at work. Nonetheless, the benefits of parallel processing are significant, whether viewed positively or negatively. The following are GIFs showing the CPU usage when executing a regular loop and when doing parallel processing, respectively:

The GIF above shows the CPU usage when running a loop, with most of the logical processors idle. Not all processors are used, but essentially only the working processors do the job, while the rest are idle.

The GIF above shows the CPU usage when parallel processing is executed. Compared to a regular loop, parallel processing indisputably makes all processors show 100% usage. This demonstrates that regardless of the actual time saved, it’s doing its best.

The packages used are foreach and doParallel, and the functions are as follows:

detectCores(): Finds and returns the number of logical processors in the current computer. The reason why the example code subtracts one core with -1 is because if all processors were assigned for parallel processing, it would be impossible to use the computer at all. Though not planning to do much with the computer while it’s processing, it’s still preferable to be able to move the mouse cursor and check the progress to some extent. Therefore, unless aiming for extreme efficiency, it’s better to leave one processor free. Without this precaution, it would be difficult to tell whether the screen is not moving because of heavy calculations or if the computer has frozen.
makeCluster(): As the name suggests, it creates a cluster. It’s okay to think of it as just allocating memory.
registerDoParallel(): Assigns the created cluster for parallel processing.
clusterExport(): Specifies variables to receive the data obtained through parallel processing in the cluster. In the example, it assigns the data obtained from mycluster to the record variable. It may seem grammatically awkward but becomes quite natural with familiarity.
foreach(): Though commonly seen in languages other than R or Python, it has a somewhat different syntax when used for parallel processing. n = test serves a similar role to n in test in R syntax. The combine option decides how to aggregate the obtained data. combine=c simply stores the data as calculated. Since c() is the function to create vectors, this makes sense. You could choose to store data in order without using c(), but this might lead to performance degradation, so it’s better to avoid such code from the beginning.
%dopar%: It’s best thought of as a syntactic element when it comes to parallel processing. It serves to define how the loops set by foreach() should be executed.
stopCluster(): This function stops the cluster. If makeCluster() allocates memory, then stopCluster() releases it. If you understand computers to some extent, it’s not hard to see why this is important. Even if you don’t, if you’re doing work that requires parallel processing, it’s easy to appreciate why such functions exist.

An interesting fact is that it doesn’t matter if you don’t know exactly what these functions do. You can just copy and paste, adjust the parts you need, and understand the options of foreach() correctly. Trying once is better than reading a hundred times. If you really need it, don’t focus on the meaning of each function and just give it a try.