Preparing a 25-million-word translation project in memoQ

Pleading to progress bars, and TM+ to the rescue!

Dec 23, 2024

The task

In true Christmas spirit, I’ve spent the last few days preparing a really enormous translation project in memoQ. That, and sometimes sleeping too. The tasks were these: handle the pre-processing of the files (if needed), figure out the import settings, and get everything imported and analyzed to come up with a word count for quoting.

The product to localize had three main “versions”, and the team needed to provide a quote for the total amount. I’m giving the three products code names here to make sure they aren’t identifiable: "Original" (which is the oldest), "Advanced", and "Current". It turned out that Current is not only the latest but also the largest in terms of word count. The overall raw word count was 25 million for two target languages, i.e. 12.5 million source words.

The idea was to simulate translating the first project with one subproduct in it (Current), creating a TM out of that (as if it was all translated first before starting the others), and run stats with that simulated TM on the second project (Advanced). After that, simulate translating Advanced as well, and again create a “fake” TM of Advanced. Finally, run stats on the third project (Original) with the simulated TM(s) containing all the “fake” translations from both Current and Advanced. We assumed there would be a lot of repetitions among the three products, so we were preparing to translate them sequentially anyway.

The long way to TM+

I didn’t have much trouble analyzing Current, and then Advanced using the TM I built from Current. This first TM contained 1M words. I didn’t make measurements, but it all went quite fine until the end, when I was trying to get Original analyzed. I can't tell why, but performance plummeted when I used the fake translations from both previous projects to analyze the third one. I first tried with two separate TMs (Advanced and Current), and it was way too slow. It would have taken days, maybe many, to finish. Then I cancelled it, and thought I'd try combining the two TMs to see if having a single larger TM helped at all. It didn't help the slightest, but even combining them (importing the Advanced TM into Current via TMX) took 10 hours. I suspect it might be something about having lots and lots of short segments that are often very similar.

Adding the entries from Advanced to the Current TM (with a TM growth from 1M entries to 1.3M entries) made pre-translation performance 4 times slower on Original. (Yes, it is no mistake: when the TM only contained the entries from Current, it was several times faster, and when adding the relatively small number of new entries from Advanced, it started being extremely slow, despite the fact that the TM only grew by maybe 30%.) And it was already quite slow before that, so it slipped into "properly unmanageable" territory. My frustration was beginning to mount. I kept creating screenshots of progress bars and measured the pixel length to try and figure out how many days it would take for statistics to finish.

I cancelled the statistics operation again after several hours, and tried putting everything on a RAM disk, thinking that there are many tiny disk reads and writes happening, and the storage performance could be a bottleneck. The RAM disk “upgrade” again didn't help enough with performance. It may have made statistics a few percents faster, but I needed something much more dramatic. Available RAM was not a problem. As a test, I throttled my CPU (to make it significantly slower), then tried again with a sample file to confirm that the main bottleneck was most probably single-threaded CPU performance. Now, there's not much I can do about single-threaded performance. You can’t even really throw money at the problem: I could go out and buy a crazy high-end PC, but that would have given me a 100% speed boost maybe at best, which probably still wouldn't have been enough to prepare the project in time. If you rent a VM in Azure or similar, you don’t get much better single threaded performance than an average work laptop’s.

I didn't have any better ideas, so I decided to cave in and try TM+. I have been quite cautious with TM+ so far, because I looked at it as new technology that maybe hasn’t seen enough action in the wild for all the rough edges to be 100% polished off. I also haven’t encountered such performance dead-ends that would have absolutely forced me to try TM+. But this time I had no choice: I used the latest client, everything (TM and project) local. The combined TM got upgraded to TM+ rather quickly, in maybe 15 minutes. Statistics performance got 9 times faster with TM+ in this specific case compared to the legacy TM format, with everything else unchanged. (At least according to my little benchmark I repeated many times by running statistics on one of the largest files in the project.)

Nine times faster. This is crazy. One thing to double check: everything was still in the RAM disk, which is probably not realistic for normal operation. (Note that with the legacy TMs, storing everything in the RAM disk didn't change performance significantly.) When I have a little time to breath, I'll try my benchmark again with the (TM+) TM and the local project stored on the SSD instead of the RAM disk. Another thing I’ve learned: judging from the CPU utilization, running statistics in TM+ is not entirely single-threaded like with the legacy TM format: I would guess from the numbers that it probably used two threads.

Copy source to target on many millions of words?

I mentioned above that we decided to “simulate” translating the products one after the other, to see what kind of TM leverage we would get from translating one of the products when we go to the next, and, again, what leverage we would get from the first two when we get to the third one. You’d do this by confirming the segments into a temporary TM, and then using that temporary TM to run statistics on the next project. But memoQ doesn’t let you create TM entries with empty translations, and I assumed that “copy source to target” on almost 20 million words would take forever. I didn’t even try it. Instead, while I was waiting for the statistics of the first project to finish, I developed a small tool that would do the same on the mqxliff level. Wherever the <target> element was empty, my tool replaced the empty target segment with just a lonely letter A. Also, wherever the status was “Not started”, the tool changed it to “Edited”. I had the mqxliff files already exported, because the files (all 25 million words) had already been imported into one jumbo project, which I had to split into 3 projects by exporting the mqxliff files. (Thankfully, the splitting was trivial based on subfolders and file names.) The mqxliff files had to be imported back to the project to update the documents with the changes. I have no idea whether this was really faster or slower than using the built-in “copy source to target” command. I never tested that.

Having millions of words is one problem, a million files is worse

Another challenge with the project was that one of the file types was HTML, and there were about 150k of them, most of them containing just a couple of words. Why is this a problem? Because having a million small files is generally a source of performance issues. And I’m not just jokingly rounding up to a million files here: for each source file you put into a project, memoQ produces several additional files for the preview, the skeleton, the version data, and whatnot. It would literally be around a million files in the server’s file system, or if a PM checked out the complete project. When you check out a memoQ project, all those files need to be zipped up by the server for sending, downloaded by the client, unzipped into the local project folder, and so on. All these operations always perform way worse for thousands of tiny files than for a few hundred much larger ones.

From very recent experience, we knew that even with just 5000 documents in a project, operations like project synchronization can become painfully slow. 150k documents was no option at all, so I developed a small software tool to concatenate several HTML files into one. The logic was simply this: the tool kept joining HTML files together and stopped joining the next one when the total file size has reached a minimum amount. I decided to go with 50 kB for this “minimum” file size for the joined HTMLs. I thought that joining the files would make it more difficult for project members to locate content in the merged files, so I constructed the joined file names like this: “name_of_first_file_joined--- name_of_last_file_joined.html”. Of course, before delivery, we’ll need to split the files back up again. To make that possible, before writing each “part” file into the combined file, an HTML comment was added, containing the name of the part file. Of course, such joined HTML files are not valid HTML files anymore, but memoQ had no issue processing them.

Repetition injury

I used a local project to import and run statistics, because I was somewhat scared of causing potential performance problems for production users of the memoQ TMS (memoQ server) by processing many millions of files on the server. When I started, I also wasn’t aware of the final project size, and for all I knew, it could have been way more words. One thing that I didn’t enjoy about the experience of preparing the files locally was repetition detection. When I imported files, memoQ took several hours with “detecting repetitions”. Then, when I updated the project from the mqxliff files to fill the documents with “fake translations”, it again spent five hours to detect repetitions. And, of course, when I run statistics, it needed to detect repetitions again. I have no problem accepting that if you have 25 million words, the time it takes to detect repetitions explodes. However, when working with word counts like this, it would be awesome to be able to tell memoQ not to bother detection repetitions when working on the file imports and the mqxliff imports or updates. I really only needed repetition detection once: when I performed statistics.

When you import files in an online project, repetition detection is not performed as part of the file import, so that is a benefit for server-side project preparation. I’m not sure, but probably when you check the project out, the client will perform the repetition detection for the files you are checking out, which is for translators just a tiny fraction of the whole project. In order to make auto-propagation (and maybe segment filtering and possibly other features I can’t think of right now) to work across files when translating them, memoQ needs to know which rows are repetitions of which other rows.

Investing in hardware

How should you spec the PCs of people who need to deal with such crazy project sizes? I do not have a monster machine. I have a pretty average work laptop, but I paid attention to certain details when buying it. One thing I always tell everybody is that they have to be generous with RAM: if you want a laptop, then identify a machine that has upgradable RAM, or have enough RAM pre-installed. Nowadays, I would say average Windows users already need 16 GB, even if some IT departments think they can get away with just 8 GB. (Even Apple has recently decided that 8 GB is no longer enough, not even for its most basic machines.) If you are doing anything serious with professional software, go with 32 GB. If crazy things tend to happen to you, go even larger. My laptop has 40 GB installed. When RAM is low, that is extremely painful: your machine starts to “swap”, meaning it can’t fit new data into RAM anymore, so it writes it out onto your disk temporarily. And the disk, even if it is a fast SSD, is many times slower than RAM. My laptop is a Lenovo IdeaPad 3 with an AMD 5700U processor, which is nothing crazy, but has 8 cores, performs decently on each, and doesn’t get too hot or too loud. The IdeaPad 3 is cheap and has pleasant surprises: it actually overperforms compared to the promises in the spec sheet at two key points: it has HDMI 2.0 instead of 1.4b, so it has no problem driving a higher resolution external monitor (but you only get this bonus if the CPU you chose is an AMD!), and I could install 40 GB of RAM, despite the specs saying 32 GB max or even less. However, I did have to swap out the factory installed WiFi modem, because it was unreliable. The IdeaPad 3 (or another laptop with similar specs) is a nice machine for PMing and engineering, even development.

If I didn’t care about money, power consumption, weight, size, noise, heat, and so on, I could probably buy something crazy that has up to twice the single-threaded CPU performance. (And single-threaded performance is still the main bottleneck for many things in memoQ and other software.) If I spent all my time preparing projects like this one, I could consider it. But I don’t think many of us juggle dozens of millions of words on a daily basis. Also keep in mind that making the right choices (for example, using TM+ here for analysis) made a much higher impact on performance than any hardware investment could have done. About storage, by all means get an SSD (if it’s still a question nowadays), but, again, I don’t think that a top-of-the-line SSD gives much of an edge over a decent one, when it comes to memoQ performance. When I was struggling to make statistics faster for this project, even using a RAM disk didn’t make much of a difference, and a RAM disk uses part of your RAM as if it was a disk, making all the disk reads and writes way faster than with normal SSDs.

I could also go with a more recent laptop in the same “thin and light but still relatively performant” category, but I concluded that the generational improvements of the last few years are not yet compelling enough to upgrade. If my laptop broke today, I might buy another IdeaPad 3 if I could still find one, because a newer laptop that is 30% more performant is maybe 200% more expensive. (I haven’t checked the exact numbers recently.)

Translation Technologist

Discussion about this post