The story began in my previous projects. There was a requirement to develop a feature that an administrator can upload a text file containing bad words. The system used these words to real-time check the contents that the user submitted. The uploaded file needs to follow a specific format.
To prevent users from uploading files other than text files, we can do it on the frontend.
<input type="file" accept="text/plain" />
So that user merely selects a text file in the file selection window.
However, to ensure system security, just blocking the user on the interface is not enough. It is necessary to re-verify the uploaded file on the backend to see if the user has uploaded a text file or not. The problem we need to solve is to determine the actual type of file uploaded by the user.
To illustrate the above problem, we will build a demo system with a frontend using React.js and a backend using Java/Spring Boot.
Our interface is quite simple, consisting of an input[type=file]
and a button
to upload the selected file. When selecting a file, the UI will display the MIME Type that the browser determines. After uploading the file, the system will return the MIME type identified by the backend. All source code is here.
Also, prepare some files to test whether the system determines correctly or not.
Prepare 3 files with the correct extension, then copy these files and rename them:
- real.png -> fake.txt
- real.jpg -> fake.zip
- real.svg -> fake.docx
The backend system in the project is written in Java using Spring Boot. A controller is implemented to receive the upload request from the user as well.
1 |
|
And a Response
to return the result to the user.
1 |
|
Using the MIME Type defined by the User-agent
When selecting a file from input[type=file]
, the file type is already determined by the browser (user-agent) follow MIME type format and then transmitted to the backend via the Content-Type
request header. So the MultipartFile
class in the controllerβs parameter already has information about the fileβs type.
Now you can use getContentType()
to determine the file type based on the MIME Type.
@PostMapping(path = "/check-file-type", consumes = MediaType.MULTIPART_FORM_DATA_VALUE)
public ResponseEntity<Response> checkFileType(@RequestPart MultipartFile file) {
String mimeType = file.getContentType();
return ResponseEntity.ok(new Response(mimeType));
}
Letβs test the files we prepared above.
In the case of the file real.png
, the user-agent identified the correct MIME type via the .png
extension. But with the file fake.zip
, the user-agent cannot correctly identify its file type as JPG but determines it through the .zip
extension. Therefore, relying on a client-defined MIME type may have some risks when users intentionally change the fileβs name and extension.
Each file type has different specifications and is stored differently, so if you want to determine the exact type of the file, you need to read the contents of that file.
MIME Type and some ways to determine file type
MIME type (Multipurpose Internet Mail Extensions) is a standard that defines the nature and format of a document, file, or set of bytes. It is defined and standardized in IETFβs RFC 6838.
The structure of the MIME type includes type and subtype:
type/subtype
Example: text/plain
, application/zip
, β¦
In detail:
- Type is the general category to which the data type belongs, such as
video
ortext
. - Subtype determines the exact data type classified. For example: With type
text
, we can have subtypes likeplain
(plain text),html
(HTML source code), orcalendar
(iCalendar.ics
format).
In general, MIME Type is a name assigned to a file type and is used to determine what type of content to transmit data and the applications based on it to behave accordingly. From the MIME type, we can determine the file type, so from a file how to identify its MIME type?
To determine the MIME type, we need to read its contents. Each file type will have a different storage method, such as a ZIP file with a file specification like here. But there are still some common features that can be used for identification.
File signature are pattern bytes stored at the beginning of the file (also known as magic number or magic bytes), used to identify the content and format of the file. The table below lists some file signatures of some popular formats (see some file signatures here).
Hex signature | ISO 8859-1 | Offset | Extension | Description |
---|---|---|---|---|
89 50 4E 47 0D 0A 1A 0A |
β°PNGββββ |
0 | png | Image encoded in the Portable Network Graphics format |
EF BB BF |
 |
0 | txt | UTF-8 byte order mark, commonly seen in text files. |
25 50 44 46 2D |
%PDF- |
0 | PDF document | |
66 74 79 70 69 73 6F 6D |
ftypisom |
4 | mp4 | ISO Base Media file (MPEG-4) |
37 7A BC AF 27 1C |
`7zΒΌΒ―’β | 0 | 7z | 7-Zip File Format |
Besides using the file signature, sometimes itβs necessary to read file content to find the exact file type. For example, SVG format is essentially XML. Therefore, to determine it, in addition to having to read magic number to determine the XML format, it is also necessary to read more content inside to determine the SVG format correctly.
Some other formats, such as Apple iWork, are actually a collection of XML files inside a Zip file. At this time, the Zip file is responsible for making the container containing the XML files. File type identification becomes more difficult due to the need to decompress the content inside.
Using Apache Tika to determine MIME Type
With Java systems, Apache Tika can be used to extract information and determine the exact format of the fileβs data. Apache Tika finds out the data format of a file based on several criteria:
- Magic number: Set of first bytes of the file.
- File name extension: Partially based on file extension.
- Metadata of files downloaded from the Internet.
- Define container and its contents.
To use Tika in a Maven project, you can add a dependency to pom.xml
:
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>2.1.0</version>
</dependency>
So we can write more functions that determine the correct MIME type of a file when uploading to the system.
1 |
|
Create one more API under the backend and use Tika to recognize the MIME Type.
@PostMapping(path = "/check-real-type", consumes = MediaType.MULTIPART_FORM_DATA_VALUE)
public ResponseEntity<Response> checkRealType(@RequestPart MultipartFile file) {
String mimeType = FileUtils.getRealMimeType(file);
return ResponseEntity.ok(new Response(mimeType));
}
After that, edit the UI to upload files to the backend using the newly created API and test again with some files.
In the file real.png
Tika identified the correct MIME type. With the file fake.zip
, Tika correctly identified the original MIME type of the file as image/jpeg
despite being renamed fake.zip
.
The list of formats that Tika is supporting can be found here.
Backend systems should verify the type of the uploaded file when receiving an uploaded file. Checking the file type based on the MIME type detected by the browser may not be sufficient because there will be some cases where the file is changed to an extension to phishing the system. Each file type has a different structure. Itβs possible to determine the exact type of a file based on its format with the help of Apache Tika on Java systems.
References
#MIME #Type #Uploaded #File #Type #Detection #Problem