Speed up `shutil.copytree` !

WBOY
Release: 2024-08-28 18:32:06
Original
337 people have browsed it

Speed up `shutil.copytree` !

Discuss of speeding up shutil.copytree

Write here

This is a discussion on , see: https://discuss.python.org/t/speed-up-shutil-copytree/62078. If you have any ideas, send me please !

Background

shutil is a very useful moudle in Python. You can find it in github: https://github.com/python/cpython/blob/master/Lib/shutil.py

shutil.copytree is a function that copies a folder to another folder.

In this function, it calls _copytree function to copy.

What does _copytree do ?

  1. Ignoring specified files/directories.
  2. Creating destination directories.
  3. Copying files or directories while handling symbolic links.
  4. Collecting and eventually raising errors encountered (e.g., permission issues).
  5. Replicating metadata of the source directory to the destination directory.

Problems

_copytree speed is not very fast when the numbers of files are large or the file size is large.

Test here:

import os import shutil os.mkdir('test') os.mkdir('test/source') def bench_mark(func, *args): import time start = time.time() func(*args) end = time.time() print(f'{func.__name__} takes {end - start} seconds') return end - start # write in 3000 files def write_in_5000_files(): for i in range(5000): with open(f'test/source/{i}.txt', 'w') as f: f.write('Hello World' + os.urandom(24).hex()) f.close() bench_mark(write_in_5000_files) def copy(): shutil.copytree('test/source', 'test/destination') bench_mark(copy)
Copy after login

The result is:

write_in_5000_files takes 4.084963083267212 seconds
copy takes 27.12768316268921 seconds

What I done

Multithreading

I use multithread to speed up the copying process. And I rename the funtion _copytree_single_threaded add a new function _copytree_multithreaded. Here is the copytree_multithreaded:

def _copytree_multithreaded(src, dst, symlinks=False, ignore=None, copy_function=shutil.copy2, ignore_dangling_symlinks=False, dirs_exist_ok=False, max_workers=4): """Recursively copy a directory tree using multiple threads.""" sys.audit("shutil.copytree", src, dst) # get the entries to copy entries = list(os.scandir(src)) # make the pool with ThreadPoolExecutor(max_workers=max_workers) as executor: # submit the tasks futures = [ executor.submit(_copytree_single_threaded, entries=[entry], src=src, dst=dst, symlinks=symlinks, ignore=ignore, copy_function=copy_function, ignore_dangling_symlinks=ignore_dangling_symlinks, dirs_exist_ok=dirs_exist_ok) for entry in entries ] # wait for the tasks for future in as_completed(futures): try: future.result() except Exception as e: print(f"Failed to copy: {e}") raise
Copy after login

I add a judgement to choose use multithread or not.

if len(entries) >= 100 or sum(os.path.getsize(entry.path) for entry in entries) >= 100*1024*1024: # multithreaded version return _copytree_multithreaded(src, dst, symlinks=symlinks, ignore=ignore, copy_function=copy_function, ignore_dangling_symlinks=ignore_dangling_symlinks, dirs_exist_ok=dirs_exist_ok) else: # single threaded version return _copytree_single_threaded(entries=entries, src=src, dst=dst, symlinks=symlinks, ignore=ignore, copy_function=copy_function, ignore_dangling_symlinks=ignore_dangling_symlinks, dirs_exist_ok=dirs_exist_ok)
Copy after login

Test

I write 50000 files in the source folder. Bench Mark:

def bench_mark(func, *args): import time start = time.perf_counter() func(*args) end = time.perf_counter() print(f"{func.__name__} costs {end - start}s")
Copy after login

Write in:

import os os.mkdir("Test") os.mkdir("Test/source") # write in 50000 files def write_in_file(): for i in range(50000): with open(f"Test/source/{i}.txt", 'w') as f: f.write(f"{i}") f.close()
Copy after login

Two comparing:

def copy1(): import shutil shutil.copytree('test/source', 'test/destination1') def copy2(): import my_shutil my_shutil.copytree('test/source', 'test/destination2')
Copy after login
  • "my_shutil" is my modified version of shutil.

copy1 costs 173.04780609999943s
copy2 costs 155.81321870000102s

copy2 is faster than copy1 a lot. You can run many times.

Advantages & Disadvantages

Use multithread can speed up the copying process. But it will increase the memory usage. But we do not need to rewrite the multithread in the code.

Async

Thanks to "Barry Scott". I will follow his/her suggestion :

You might get the same improvement for less overhead by using async I/O.

I write these code:

import os import shutil import asyncio from concurrent.futures import ThreadPoolExecutor import time # create directory def create_target_directory(dst): os.makedirs(dst, exist_ok=True) # copy 1 file async def copy_file_async(src, dst): loop = asyncio.get_event_loop() await loop.run_in_executor(None, shutil.copy2, src, dst) # copy directory async def copy_directory_async(src, dst, symlinks=False, ignore=None, dirs_exist_ok=False): entries = os.scandir(src) create_target_directory(dst) tasks = [] for entry in entries: src_path = entry.path dst_path = os.path.join(dst, entry.name) if entry.is_dir(follow_symlinks=not symlinks): tasks.append(copy_directory_async(src_path, dst_path, symlinks, ignore, dirs_exist_ok)) else: tasks.append(copy_file_async(src_path, dst_path)) await asyncio.gather(*tasks) # choose copy method def choose_copy_method(entries, src, dst, **kwargs): if len(entries) >= 100 or sum(os.path.getsize(entry.path) for entry in entries) >= 100 * 1024 * 1024: # async version asyncio.run(copy_directory_async(src, dst, **kwargs)) else: # single thread version shutil.copytree(src, dst, **kwargs) # test function def bench_mark(func, *args): start = time.perf_counter() func(*args) end = time.perf_counter() print(f"{func.__name__} costs {end - start:.2f}s") # write in 50000 files def write_in_50000_files(): for i in range(50000): with open(f"Test/source/{i}.txt", 'w') as f: f.write(f"{i}") def main(): os.makedirs('Test/source', exist_ok=True) write_in_50000_files() # 单线程复制 def copy1(): shutil.copytree('Test/source', 'Test/destination1') def copy2(): shutil.copytree('Test/source', 'Test/destination2') # async def copy3(): entries = list(os.scandir('Test/source')) choose_copy_method(entries, 'Test/source', 'Test/destination3') bench_mark(copy1) bench_mark(copy2) bench_mark(copy3) shutil.rmtree('Test') if __name__ == "__main__": main()
Copy after login

Output:

copy1 costs 187.21s
copy2 costs 244.33s
copy3 costs 111.27s


You can see that the async version is faster than the single thread version. But the single thread version is faster than the multi-thread version. ( Maybe my test environment is not very good, you can try and send your result as a reply to me )

Thank you Barry Scott !

Advantages & Disadvantages

Async is a good choice. But no solution is perfect. If you find some problem, you can send me as a reply.

End

This is my first time to write discussion on python.org. If there is any problem, please let me know. Thank you.

My Github: https://github.com/mengqinyuan
My Dev.to: https://dev.to/mengqinyuan

The above is the detailed content of Speed up `shutil.copytree` !. For more information, please follow other related articles on the PHP Chinese website!

source:dev.to
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!