How to get non-ASCI characters displayed correctly in logs?

Christoph · 19 October 2018 14:45

Is anybody else having problems with UTF / non-ASCI characters not displaying correctly in duplicacy logs?

I’m suspecting that this is not duplicacy’s fault but has to do with how I process its output in powershell, but I’m not sure.

Here is the script that I use to run the backups and produce the log file:

$backupID = "ALPHA_C" 
$repositorypath = "C:\" 
$backupoptions = "-log backup -vss -vss-timeout 120 -stats -threads 6"
# Construct logfile name for the day
$logfiledate = get-date -format yyyy-MM-dd 
$tmplogfilename = "tmp_$backupID.log" 
$prunelogfilename = "tmp_prune_$backupID-$logfiledate.log"
$logfilename = "backup_$backupID-$logfiledate.log" 
$logfilepath = "c:\duplicacy\logs\" + $logfilename
# Go to repository 
Set-Location -Path $repositorypath >> "C:\duplicacy\logs\$logfilename"
$(get-date -Format "yyyy-MM-dd HH:mm:ss") + " *** Starting new backup of " + $(convert-path $(get-location).Path) + " ***" >> "C:\duplicacy\logs\$logfilename"
# Start backup
Start-Process -FilePath "c:\duplicacy\duplicacy.exe" -ArgumentList "$backupoptions" -RedirectStandardOutput "C:\duplicacy\logs\$tmplogfilename" -wait -NoNewWindow
Get-Content ( JOIN-PATH "C:\duplicacy\logs\" $tmplogfilename ) | Out-File -filePath $logfilepath -Append
$(get-date -Format "yyyy-MM-dd HH:mm:ss") + " *** Backup of " + $(convert-path $(get-location).Path) + " stopped ***" >> "C:\duplicacy\logs\$logfilename"

towerbr · 19 October 2018 16:37

What kind of problem? Could you give an example?

I have no problems with my UTF-8 / Portuguese logs.

Christoph · 19 October 2018 20:57

I don’t have my actual logs at hand but what I mean is simply that certain characters like öéüå etc come up as something like Ã¶ or so.

It’s clearly related to character encoding…

towerbr · 19 October 2018 21:28

What software are you using to open the logs? Logexpert or something similar?

Some softwares have the option of setup the encoding.

I already had some problems with Notepad++ similar to what you described.

Christoph · 20 October 2018 10:20

That’s what I’m using and given that you can do pretty much anything with it, I tend to blame it on me or another software if notepad++ is not displaying stuff right.

towerbr · 20 October 2018 12:33

I also use Notepad++ for almost everything, but specifically for logs I use LogExpert. Besides not having problems with encoding, it has an interesting tail function, which in Notepad++ doesn’t work well. And you can also highlight colored error lines, eg.

My scripts save Duplicacy logs with .log extension, and I’ve associated this type of file with LogExpert. It works perfectly.

Give it a try:

Christoph · 20 October 2018 12:43

Thanks for the tip. Just tried it and, unfortunately. I’m seeing the exact same garbage as in Notepad++. What does this mean?

towerbr · 20 October 2018 12:51

Can be a software or Windows setup.

In LogExpert, try changing the view setting encoding:

settings

In Windows, see which code page is active. Open a command prompt (I don’t really know where that is in the Win10 GUI, I always do this by command line) and type chcp.

Another point: did this start happening now or was it always like this?

Since is occurring in more than one software, then it is probably something related to your environment / windows.

Christoph · 20 October 2018 13:15

Tried that (just like I did in Notepad++), even reloaded the file. To no avail.

That fives me:

Active code page: 850

I think it’s always been like that but I had other stuff to worry about before I started to care.

I am only seeing this in my duplicacy logs. And at this point I can confirm that it’s not specific to my script. I just did duplicacy -log list -r 25 -files > list.txt and the resulting text file has the same problems. Both in Notepad and LogExpert.

Edit: It is starting to dawn on me that this might be a much more serious problem than just log-file aestetics. I just tried to restore one of those files with an umlaut it can’t seem to be done. I tried both

duplicacy -log restore -stats -ignore-owner -r 25 c/home/christoph/H B├╝rgschaft.doc

and

duplicacy -log restore -stats -ignore-owner -r 25 c/home/christoph/H Bürgschaft.doc

and both gives

2018-10-20 15:21:41.827 INFO RESTORE_START Restoring D:\restore to revision 25
2018-10-20 15:21:41.921 INFO RESTORE_END Restored D:\restore to revision 25
2018-10-20 15:21:41.921 INFO RESTORE_STATS Files: 0 total, 0 bytes
2018-10-20 15:21:41.921 INFO RESTORE_STATS Downloaded 0 file, 0 bytes, 0 chunks
2018-10-20 15:21:41.922 INFO RESTORE_STATS Total running time: 00:00:06

towerbr · 20 October 2018 13:25

Same mine, which maybe in your case (Swedish, right?) might not be the correct one.

Type at the command prompt: chcp 1252

This will change the code page to 1252. See if it corrects the txt view.

Reference:

https://docs.microsoft.com/en-us/windows/desktop/intl/code-page-identifiers

Christoph · 20 October 2018 13:40

Not sure what exactly that will do to my system so I’d prefer not to do it. My system is ok (and I’m seeing the same on my other computer as well). I am not having any issues with code pages or character encodings with other programs, so I’m starting to think that this might a duplicacy bug (see in particular my edit of my previous post).

If someone has backed up a file with an umlaut in its name, could you run duplicacy -log list -r <revision number> -files > list.txt and look at how the file name is represented in list.txt. If it it looks just fine, we need to figure out what the difference is between our systems.

towerbr · 20 October 2018 23:28

I’ve done some tests here and in fact, changing the code page and entering characters like ü using alt key makes no difference.

I remembered now that long ago I had problems in Delphi with unlaut characters because they occupy 2 bytes instead of 1, or something like this, but to be honest I don’t remember the details. This may explain the strange restore behavior you described above.

Anyway, we are dealing here with Go, and I know almost nothing, maybe @gchen can help.

Edit: Similar problem, another language (Ruby):

gchen · 21 October 2018 02:07

Those file names should be encoded in utf-8 by default. I think you should try two things. First, create a txt file with a unicode character in utf-8 in it and try to open it the same way you open list.txt. Second, check the encoding of the unicode code to see if it is properly encoded in utf-8.

Christoph · 21 October 2018 22:10

Could you send me one? Cause the only way I know to make sure I’m in fact creating a utf-8 file is by using notepad++ and since I’m also using notepad++ to open the logfiles, I’m certain that it will open it’s own file just fine.

Perhaps this is the more important exercise. I’m not sure if it is. I’m getting different results. If I check what notepad++ tells me is UCS-2 LE-BOM:

When I check in notepad, it says “Unicode”

According to this StackOverflow answer notepad’s notion of “Unicode” is somewhat inaccurate so that I assume the that Notepad++ is right/more accurate. (As you can see from the other answers to that StackOverflow question, it seems to be rather a challenging task to determine the encoding of a file.)

So if the encoding is UCS-2 LE-BOM, what does that mean? According to this answer, UCS-2 is the same as UTF-16, which means the logfile is not in UTF-8.

So what does this mean? Who is responsible for the file-encoding of that logfile? duplicacy or powershell? If it is powershell then we might know how to fix the log files, but we still have the problem that duplicacy seems to be unable to restore files with umlauts in the file name:

Christoph:

duplicacy -log restore -stats -ignore-owner -r 25 c/home/christoph/H B├╝rgschaft.doc

and

duplicacy -log restore -stats -ignore-owner -r 25 c/home/christoph/H Bürgschaft.doc

and both gives

2018-10-20 15:21:41.827 INFO RESTORE_START Restoring D:\restore to revision 25
2018-10-20 15:21:41.921 INFO RESTORE_END Restored D:\restore to revision 25
2018-10-20 15:21:41.921 INFO RESTORE_STATS Files: 0 total, 0 bytes
2018-10-20 15:21:41.921 INFO RESTORE_STATS Downloaded 0 file, 0 bytes, 0 chunks
2018-10-20 15:21:41.922 INFO RESTORE_STATS Total running time: 00:00:06

If duplicacy is responsible for the encoding of the logfile, then this would be a bug, right?

In any case, what I still don’t understand, is: if notepad correctly (?) recognizes the file encoding as UCS-2 LE-BOM, why does it not display the characters correctly?

TheBestPessimist · 22 October 2018 04:36

I cannot upload .txt files (maybe we want to change that) so here’s a google drive link to a small file: utf8.txt - Google Drive .

It was created with sublime text and saved as UTF-8 without BOM.

Christoph · 22 October 2018 08:46

Thanks! I can display the file without any problems (quod erad expectandum) and Notepad++ shows its encoding as UTF-8.

So, since you are also producing logfiles via a powershell script, what is their encoding?

Edit:

Okay, I think I located the source of the problem: The Powershell Out-File cmdlet encodes to UCS-2 by default (and, as you can see in my script above, I’m using it to append logfiles created on the same day). When using -RedirectStandardOutput, no re-encoding occurs, so those tmp files are indeed in UTF-8. Only when they are written into the final log-file do things get messed up.

Some more details

If you are wondering why PowerShell behaves in this way, well, you’re not alone. In Powershell 6.0, they have changed the default encoding to UTF-8)

So, the solution is to add -Encoding UTF8 to the Out-File command. I have not tested it yet (will wait for the scheduled backup to run) but I’m confident that this fixes the logfile encoding problem.

What’s still not solved, however, is why I can’t restore that file with an umlaut, as shown previously?

Christoph · 23 October 2018 14:25

Unfortunately, this Odyssee wasn’t over yet. Turns out that even when you finally get Out-File write UTF8 it insists on writing UTF8 with BOM. And I don’t even know if this was actually the main reason why it still didn’t work for me because I also noticed that the >> redirect also re-encodes things and produces files encoded in UCS-2 LE BOM. And since my logfiles are initially created by such a redirect, it didn’t really help to append UTF8 encoded stuff to it. It looked like this:

Anyway, the easiest way to resolve this seems to be to use Add-Content instead of Out-File (because Add-Content doesn’t re-encode) as well as instead of >>.

I’ll let you know when I confirmed it to work.

ber-ro · 5 June 2023 18:50

I think I suffered the same problem, only that I did not run a PowerShell script but used PowerShell as shell. In the redirected log ‘ü’ (‘C3 BC’ in UTF-8) was replaced by ‘├╝’ (see Unicode Codepoints), probably because PowerShell expected CP850. Later on the wrong characters were transformed to Unicode. I found a solution at Using UTF-8 Encoding (CHCP 65001) in Command Prompt / Windows Powershell (Windows 10) - Stack Overflow and extended $PROFILE:

$OutputEncoding =
[console]::InputEncoding =
[console]::OutputEncoding = New-Object System.Text.UTF8Encoding