Speed up rsync with Ruby to transfer a huge amount of files (1Tb in my case)

Rsync

Servers of my own web site made with Ruby on Rails (Fan Party) were taken off line with an entire data centre Steephost in Kharkov, Ukraine on April, 16th 2015 by Ukrainian special service SBU. They were going to find a web site supporting terrorism in Ukraine. And it seems in Ukraine there is no other way to find a physical server than taking off all 130 physical servers from a data centre.

Last couple of weeks I spent in hope that our special services will return back the data (about 3Tb of user images and 7Gb of a MySQL database). Users want their lovely site back online. But there is still no luck.

As result I’ve decided to recreate a server from scratch in a different data centre located in other country.  It seems that to host a server in Ukraine became dangerous. No one likes to lose data.

What I have locally: tons of images (1Tb backup made 2 years ago. It is better than nothing). Images are located in 3365 folders. I used rsync to make that backup and I was going to use it again to transfer the data to a remote server over SSH protocol. I tried to find some alternative software but it seems that there is nothing better than Rsync.

As you know Rsync first goes through the entire folders structure and makes a list of folders/files to be transferred. As I have a huge amount of small images that examination process takes about 6 hours. It’s really long time. Also I know that the file transfer with Rsync could be interrupted because of my Mac restart or a network connection failures. And It really was so: I had to restart my laptop during my work, and file transfer with rsync was interrupted two times because of an Internet connectivity issues.

All of that made my brains to think better and I created the own solution. It’s a Ruby rake task that is a part of a fanparty Rails application.  My code improves the rsync files transfer in a several ways:

  1. It calls rsync on every fanclub folder. That makes a preparation phase shorter as rsync needs to examine only a limited amount of files. Some folders are huge itself (Justin Bieber for example) but most folders are smaller.
  2. It keeps the list of folders already copied. That makes rsync to not examine the all folders again and again. And it speed ups the entire file transfer process.
  3. It saves system resources of my laptop. CPU and RAM.
  4. It makes possible to stop and continue the file transfer from the last folder.

The process is simple. Code goes through the backup root folder subfolders and calls the rsync on every folder found. As soon as a folder transfer complete it saves the folder’s path to a CSV file ‘synced.csv’.  That CSV will be used by a Rake task during the next start to avoid the double examination of folders already copied by rsync. I used the existing Ruby gem rsync-ruby that provides a wrapper over the rsync and gives me a success/error notifications after the each rsync run.

So, the code is

desc 'Upload fanparty.ru pictures with rsync'
task :upload_club_folders do
  require 'rsync'
  require 'csv'

  csv_path = "#{Rails.root}/synced.csv"
  local_root = '/fanparty/fanclubs'
  remote_path = 'user@fanparty.ru:/fanclubs'

  all_folders = Dir["#{local_root}*"].sort
  synced_folders = CSV.read(csv_path).flatten
  folders_to_sync = all_folders - synced_folders

  p "Total folders: #{all_folders.length}. Folders left: #{folders_to_sync.length}"

  while folders_to_sync.length > 0 do
    folder = folders_to_sync.shift
    if File.directory?(folder)
      Rsync.run("'#{folder}'", remote_path, ['-avP']) do |result|
        p "#{folder.split('/').last}: #{result.error}"
        if result.success?
          CSV.open(csv_path, 'ab') do |csv|
            csv << [folder]
          end
        end
      end
    end
  end
end

With a Rake task I have already copied about 140Gb in 15 hours. It is 5 times more than I transferred in a whole week before.

I hope this approach will help other people to speedup transfer a lot of files with rsync and I am going to use it again to backup the data.

Cheers

Добавить комментарий

Заполните поля или щелкните по значку, чтобы оставить свой комментарий:

Логотип WordPress.com

Для комментария используется ваша учётная запись WordPress.com. Выход / Изменить )

Фотография Twitter

Для комментария используется ваша учётная запись Twitter. Выход / Изменить )

Фотография Facebook

Для комментария используется ваша учётная запись Facebook. Выход / Изменить )

Google+ photo

Для комментария используется ваша учётная запись Google+. Выход / Изменить )

Connecting to %s