Python 下载文件的几种方法

大林鸱大约 2 分钟

Python 下载文件的几种方法

使用 Python 脚本在网上下载文件。这个需求是怎么出现的呢？使用浏览器不更加方便吗？确实，使用浏览器更方便，我是在下载大模型文件时，因为文件太大，导致 git 下载失败，然后使用浏览器下载，但是浏览器下载速度太慢，所以尝试 Python 脚本下载，实时证明也不快。这里只做记录，你可能在其他时候会碰到需要是要 Python 脚本下载文件的场景。

使用 requests 库

import requests
url = 'https://www.modelscope.cn/api/v1/models/ZhipuAI/chatglm3-6b/repo?Revision=master&FilePath=pytorch_model-00003-of-00007.bin'
myfile = requests.get(url)
open('pytorch_model-00003-of-00007.bin', 'wb').write(myfile.content)

使用 wget 库

import wget
url = "https://www.modelscope.cn/api/v1/models/ZhipuAI/chatglm3-6b/repo?Revision=master&FilePath=pytorch_model-00003-of-00007.bin"
wget.download(url, 'pytorch_model-00003-of-00007.bin')

分块下载大文件

import requests
url = 'https://www.modelscope.cn/api/v1/models/ZhipuAI/chatglm3-6b/repo?Revision=master&FilePath=pytorch_model-00003-of-00007.bin'
r = requests.get(url, stream = True)
with open("pytorch_model-00003-of-00007.bin", "wb") as Pypdf:
    for chunk in r.iter_content(chunk_size = 10240000): # 10.24 MB
        if chunk:
            Pypdf.write(chunk)

下载重定向的文件

import requests
url = 'https://readthedocs.org/projects/python-guide/downloads/pdf/latest/'
myfile = requests.get(url, allow_redirects=True)
open('hello.pdf', 'wb').write(myfile.content)

下载多个文件-串行

import os
import requests
from time import time
from multiprocessing.pool import ThreadPool

def url_response(url):
    path, url = url
    r = requests.get(url, stream=True)
    with open(path, 'wb') as f:
        for ch in r:
            f.write(ch)

urls = [("Event1", "https://www.python.org/events/python-events/805/"),
        ("Event2", "https://www.python.org/events/python-events/801/"),
        ("Event3", "https://www.python.org/events/python-events/790/"),
        ("Event4", "https://www.python.org/events/python-events/798/"),
        ("Event5", "https://www.python.org/events/python-events/807/"),
        ("Event6", "https://www.python.org/events/python-events/807/"),
        ("Event7", "https://www.python.org/events/python-events/757/"),
        ("Event8", "https://www.python.org/events/python-user-group/816/")]

start = time()
for x in urls:
    url_response(x)

print(f"Time to download: {time() - start}")

# Time to download: 7.306085824966431

下载多个文件-并行

import os
import requests
from time import time
from multiprocessing.pool import ThreadPool

def url_response(url):
    path, url = url
    r = requests.get(url, stream=True)
    with open(path, 'wb') as f:
        for ch in r:
            f.write(ch)

urls = [("Event1", "https://www.python.org/events/python-events/805/"),
        ("Event2", "https://www.python.org/events/python-events/801/"),
        ("Event3", "https://www.python.org/events/python-events/790/"),
        ("Event4", "https://www.python.org/events/python-events/798/"),
        ("Event5", "https://www.python.org/events/python-events/807/"),
        ("Event6", "https://www.python.org/events/python-events/807/"),
        ("Event7", "https://www.python.org/events/python-events/757/"),
        ("Event8", "https://www.python.org/events/python-user-group/816/")]

start = time()
# 启动了9个线程，线程数的设置要考虑机器性能，例如我是8核，那就可以设置成8，多了也没啥用
ThreadPool(9).imap_unordered(url_response, urls)
print(f"Time to download: {time() - start}")

# Time to download: 0.0064961910247802734