As part of my work here at Admios, I have large files that need to pass through an API. To do this, I split the files into smaller sections. In this post, I’m going to share my process with you.
The first thing I do is calculate the amount of parts or pieces that I need to create. I base this on the maximum allowed file size.
So, I have two parameters, the original file and the maximum size of each piece. It’s quite simple, merely divide the original file size by the maximum allowed size. Pretty straightforward, right?
But it's actually a little more difficult. I have to load the file content in a byte array and then track an index pointing to the array being processed so that in the first run I copy the first bytes of information to a new file, then I move that index to the next byte after the last I one copied, copy more chunks, and repeat until I have read the entire file and divided it into smaller sections.
So, after checking some examples, I end up with something like this:
The problem with this approach is that we end up loading the entire file into the JVM. If I try to split a 2GB file, it will load in the JVM completely. This is ok for small files, in fact, this will work in all Java Versions JDK1.0 and above. This is my first attempt and it is compiled with Java 1.0. This should not use any dependencies or third party libraries.
But, memory is a big concern since the only reason to split a file is its size. Looking for another approach, I found many examples, some using BufferedOutputStream and FileOutputStream, and then some versions that only work on Java 5 others Java 7.
To show how Java has evolved from Java 1.2 to 1.7, I will make a second version that is compatible with Java 1.4 (note that Version 1.0 of the splitter above is compatible with Java 1.2).
For these examples (Version 1 and 2) were made using Java 1.4.
For this second version, I didn't load the entire file - a small part is just saved in a buffer, writes it in a new file, and then moves to the next file. The key is that the 'maxReadBufferSize' should be small. It will take more time, but the application will use less memory.
Java 7 comes with a cool package java.nio.file.* that has a better way to read and write files in a simpler way. Also, I used channels, to replace the use of BufferedOutputStream. All of this also comes with a more efficient use of memory.
Instead of using InputStreams and OutputStream, we will use channels. Imagine a channel is like a file pointer. We set the pointer to the beginning of where we want to read or write, in this case, we start to read from the original file. Channels are not new in the Java SDK. It’s been there since Java 1.4, but it gets used more since Java 7.
Check out the official Java documentation referring to Channels.
Using the same RandomAccessFile we get the file channel and set the starting position to zero in the first run.
I am using the same logic to calculate the size of the parts that is shown in Version 1. The difference is that now we don't load any of the file to the JVM, we just use the underline host system.
And for the output file we do this:
Ok, everything is ready and set to go. Now, we just need to copy the bytes we want using transferFrom();
This just says copy from Channel, put it on 'position' with a size of 'count'.
And that's it. Files are created.
Java 7 gives some useful methods in java.nio.file.Files and java.nio.file.Paths like Files.size() and Paths.get(fileName) making the entire code more readable.
For the test, we used two files, a small 5,376,008 bytes (5.4 MB) file, and a 1,077,622,793 bytes (1.09 GB) file. We ran each version twice with the files. The following charts are presented in logarithmic scale (time is displayed in seconds and memory is displayed in megabytes).
As we can see, the memory consumption is greatly reduced. As for time, it is almost the same for every version. For small files, the time increases, but this should not concern us because it is very unlikely that we would need to divide a small 10 or 20 MB file.
In case we need to do maintenance or implement this into an old system, Java 1.4 is a good option since channels have been available since then (only java.nio.file.Files and java.nio.file.Paths are from Java 7 and you can probably implement a similar solution for an old legacy system).
I hope these examples help you split your own files or to just copy and use.