Uploading files with non-ASCII filenames using Python requests
The Python requests module allows for uploading of files. You can specify the filename with the files parameter. Requests then automatically sends a multipart/form-data request with the file. For example:
import requests requests.post('http://localhost/upload.php', files={ 'fieldname': ('filename.txt', 'some contents') })This results in a request like this:
POST /upload.php HTTP/1.1 Host: localhost Content-Length: 166 User-Agent: python-requests/2.9.1 Content-Type: multipart/form-data; boundary=a3e4676b2d2f469cb31d081e619f87c1 --a3e4676b2d2f469cb31d081e619f87c1 Content-Disposition: form-data; name="fieldname"; filename="filename.txt" some contents --a3e4676b2d2f469cb31d081e619f87c1--
As you can see the filename we specified shows up in the filename part of the Content-Disposition header. The format of this header changes a bit if we use a unicode filename:
# coding: utf8 import requests requests.post('http://localhost/upload.php', files={ 'fieldname': (u'tête-à-tête.txt', 'some contents') })
Content-Disposition: form-data; name="fieldname"; filename*=utf-8''t%C3%AAte-%C3%A0-t%C3%AAte.txtWhat happened to our filename? Because it is non-ASCII, it has been encoded. However, it has been encoded in a way the PHP script won't understand, so our file will not be uploaded correctly. Urllib3 encodes these filenames according to RFC 2231. Even though this is the correct encoding to use in MIME responses, it isn't in form submissions. This is not entirely urllib3's fault, because a long time it was unclear how to encode non-ASCII filenames. It is clear now that the encoding that urllib3 uses is not the right one. According to RFC 7578:
NOTE: The encoding method described in [RFC5987], which would add a "filename*" parameter to the Content-Disposition header field, MUST NOT be used.The urllib3 developers know this and there is an issue and a pull request to solve this.
Workaround
We want requests to use the correct filename when uploading. To do this, it would be nice if we can hook in somewhere to change the filename part in the Content-Disposition header just before the request is sent. We can use the auth hook, although it is obviously meant for authentication and not for request rewriting:
# coding: utf8 import requests import re def rewrite_request(prepared_request): filename = u'tête-à-tête.txt'.encode('utf-8') prepared_request.body = re.sub(b'filename\*=.*', b'filename=' + filename, prepared_request.body) return prepared_request requests.post('http://localhost/upload.php', files={ 'fieldname': (u'tête-à-tête.txt', 'some contents') }, auth=rewrite_request)
Requests creates a PreparedRequest and calls the auth handler on it before sending it off. We change the filename in the request body using a regex and return the request. This way, the filename is encoded using UTF8 instead of RFC 2231:
Content-Disposition: form-data; name="fieldname"; filename=tête-à-tête.txt
If you want a less dirty hack, you can create your own PreparedRequest and send it, without using the auth hook.