跳至主要內容

Python 下载文件的几种方法

大林鸱大约 2 分钟研发工具文件下载Python

Python 下载文件的几种方法

使用 Python 脚本在网上下载文件。这个需求是怎么出现的呢?使用浏览器不更加方便吗?确实,使用浏览器更方便,我是在下载大模型文件时,因为文件太大,导致 git 下载失败,然后使用浏览器下载,但是浏览器下载速度太慢,所以尝试 Python 脚本下载,实时证明也不快。这里只做记录,你可能在其他时候会碰到需要是要 Python 脚本下载文件的场景。

  1. 使用 requests 库
import requests
url = 'https://www.modelscope.cn/api/v1/models/ZhipuAI/chatglm3-6b/repo?Revision=master&FilePath=pytorch_model-00003-of-00007.bin'
myfile = requests.get(url)
open('pytorch_model-00003-of-00007.bin', 'wb').write(myfile.content)
  1. 使用 wget 库
import wget
url = "https://www.modelscope.cn/api/v1/models/ZhipuAI/chatglm3-6b/repo?Revision=master&FilePath=pytorch_model-00003-of-00007.bin"
wget.download(url, 'pytorch_model-00003-of-00007.bin')
  1. 分块下载大文件
import requests
url = 'https://www.modelscope.cn/api/v1/models/ZhipuAI/chatglm3-6b/repo?Revision=master&FilePath=pytorch_model-00003-of-00007.bin'
r = requests.get(url, stream = True)
with open("pytorch_model-00003-of-00007.bin", "wb") as Pypdf:
    for chunk in r.iter_content(chunk_size = 10240000): # 10.24 MB
        if chunk:
            Pypdf.write(chunk)
  1. 下载重定向的文件
import requests
url = 'https://readthedocs.org/projects/python-guide/downloads/pdf/latest/'
myfile = requests.get(url, allow_redirects=True)
open('hello.pdf', 'wb').write(myfile.content)
  1. 下载多个文件-串行
import os
import requests
from time import time
from multiprocessing.pool import ThreadPool

def url_response(url):
    path, url = url
    r = requests.get(url, stream=True)
    with open(path, 'wb') as f:
        for ch in r:
            f.write(ch)

urls = [("Event1", "https://www.python.org/events/python-events/805/"),
        ("Event2", "https://www.python.org/events/python-events/801/"),
        ("Event3", "https://www.python.org/events/python-events/790/"),
        ("Event4", "https://www.python.org/events/python-events/798/"),
        ("Event5", "https://www.python.org/events/python-events/807/"),
        ("Event6", "https://www.python.org/events/python-events/807/"),
        ("Event7", "https://www.python.org/events/python-events/757/"),
        ("Event8", "https://www.python.org/events/python-user-group/816/")]

start = time()
for x in urls:
    url_response(x)

print(f"Time to download: {time() - start}")

# Time to download: 7.306085824966431
  1. 下载多个文件-并行
import os
import requests
from time import time
from multiprocessing.pool import ThreadPool

def url_response(url):
    path, url = url
    r = requests.get(url, stream=True)
    with open(path, 'wb') as f:
        for ch in r:
            f.write(ch)

urls = [("Event1", "https://www.python.org/events/python-events/805/"),
        ("Event2", "https://www.python.org/events/python-events/801/"),
        ("Event3", "https://www.python.org/events/python-events/790/"),
        ("Event4", "https://www.python.org/events/python-events/798/"),
        ("Event5", "https://www.python.org/events/python-events/807/"),
        ("Event6", "https://www.python.org/events/python-events/807/"),
        ("Event7", "https://www.python.org/events/python-events/757/"),
        ("Event8", "https://www.python.org/events/python-user-group/816/")]

start = time()
# 启动了9个线程,线程数的设置要考虑机器性能,例如我是8核,那就可以设置成8,多了也没啥用
ThreadPool(9).imap_unordered(url_response, urls)
print(f"Time to download: {time() - start}")

# Time to download: 0.0064961910247802734
上次编辑于: