Python 下载文件的几种方法
大约 2 分钟
Python 下载文件的几种方法
使用 Python 脚本在网上下载文件。这个需求是怎么出现的呢?使用浏览器不更加方便吗?确实,使用浏览器更方便,我是在下载大模型文件时,因为文件太大,导致 git 下载失败,然后使用浏览器下载,但是浏览器下载速度太慢,所以尝试 Python 脚本下载,实时证明也不快。这里只做记录,你可能在其他时候会碰到需要是要 Python 脚本下载文件的场景。
- 使用 requests 库
import requests
url = 'https://www.modelscope.cn/api/v1/models/ZhipuAI/chatglm3-6b/repo?Revision=master&FilePath=pytorch_model-00003-of-00007.bin'
myfile = requests.get(url)
open('pytorch_model-00003-of-00007.bin', 'wb').write(myfile.content)
- 使用 wget 库
import wget
url = "https://www.modelscope.cn/api/v1/models/ZhipuAI/chatglm3-6b/repo?Revision=master&FilePath=pytorch_model-00003-of-00007.bin"
wget.download(url, 'pytorch_model-00003-of-00007.bin')
- 分块下载大文件
import requests
url = 'https://www.modelscope.cn/api/v1/models/ZhipuAI/chatglm3-6b/repo?Revision=master&FilePath=pytorch_model-00003-of-00007.bin'
r = requests.get(url, stream = True)
with open("pytorch_model-00003-of-00007.bin", "wb") as Pypdf:
for chunk in r.iter_content(chunk_size = 10240000): # 10.24 MB
if chunk:
Pypdf.write(chunk)
- 下载重定向的文件
import requests
url = 'https://readthedocs.org/projects/python-guide/downloads/pdf/latest/'
myfile = requests.get(url, allow_redirects=True)
open('hello.pdf', 'wb').write(myfile.content)
- 下载多个文件-串行
import os
import requests
from time import time
from multiprocessing.pool import ThreadPool
def url_response(url):
path, url = url
r = requests.get(url, stream=True)
with open(path, 'wb') as f:
for ch in r:
f.write(ch)
urls = [("Event1", "https://www.python.org/events/python-events/805/"),
("Event2", "https://www.python.org/events/python-events/801/"),
("Event3", "https://www.python.org/events/python-events/790/"),
("Event4", "https://www.python.org/events/python-events/798/"),
("Event5", "https://www.python.org/events/python-events/807/"),
("Event6", "https://www.python.org/events/python-events/807/"),
("Event7", "https://www.python.org/events/python-events/757/"),
("Event8", "https://www.python.org/events/python-user-group/816/")]
start = time()
for x in urls:
url_response(x)
print(f"Time to download: {time() - start}")
# Time to download: 7.306085824966431
- 下载多个文件-并行
import os
import requests
from time import time
from multiprocessing.pool import ThreadPool
def url_response(url):
path, url = url
r = requests.get(url, stream=True)
with open(path, 'wb') as f:
for ch in r:
f.write(ch)
urls = [("Event1", "https://www.python.org/events/python-events/805/"),
("Event2", "https://www.python.org/events/python-events/801/"),
("Event3", "https://www.python.org/events/python-events/790/"),
("Event4", "https://www.python.org/events/python-events/798/"),
("Event5", "https://www.python.org/events/python-events/807/"),
("Event6", "https://www.python.org/events/python-events/807/"),
("Event7", "https://www.python.org/events/python-events/757/"),
("Event8", "https://www.python.org/events/python-user-group/816/")]
start = time()
# 启动了9个线程,线程数的设置要考虑机器性能,例如我是8核,那就可以设置成8,多了也没啥用
ThreadPool(9).imap_unordered(url_response, urls)
print(f"Time to download: {time() - start}")
# Time to download: 0.0064961910247802734