Looking for advice with a character set issue

garretn · Post by **garretn** » May 18th, 2012, 3:12 pm

Hey all,

It's my first post to these forums, but I've been lurking for a while. I have a rather unusual issue, presumably, that could almost certainly be considered an edge case -- as such, I'm really just looking to see if anyone can provide any advice. I run the latest beta sabnzbd on a home server running the current ubuntu server LTS, but I also use greyhole which is samba based. My issue is basically the same issue as is presented in this forum thread:

Fixed foreign accents on Synology NAS running SABnzbd

Where my problem is that my download directory is in the normal file-system, but my "Move completed downloads to" directory is a mounted samba share -- so what happens is the files fail to move to the completed download directory due to the accented characters, and the script in the above thread is never ran. If I run the script manually after the download fails to move, and then use retry in the interface, then it moves fine / completes.

Right now the only solution I can think of is moving the completed directory off the samba share, and then a post-processing wrapper script that also moves it to the real destination, or a cron that moves them. Moving the files around several times obviously isn't ideal, so I'm curious if anyone has any suggestions or has ran into any similar problems? Is there a way to run a script on a download _before_ sabnzbd moves the files to their final destination?

Post by **shypike** » May 19th, 2012, 1:18 am

Is the way the local file system stores names different from what Samba does?
Is there no way to get them equal?
All files created by unrar have UTF-8 names, so Samba should store them the same.

garretn · Post by **garretn** » May 20th, 2012, 9:38 am

Well, it's like in the thread I linked. If the rar archive contains a file not in UTF-8 encoding but with extended characters (usually accents), then the un-rar'd file isn't actually in UTF-8 either -- in the other thread the issue was serving the file, however the problem occurs both ways -- my issue occurs when sabnzbd attempts to move the unrar'd file in to a samba share with the encoding issue. If I "fix" the encoding after it fails to move, and use retry in sabnzbd on the fixed files, it works.

Here's an example NZB:

[Commie] Kore wa Zombie Desu kai¼ of the Dead - 01 [CBFA1695].mkv (300mb)

http://fanzub.com/nzb/184993

If you want to reproduce, set your data dir somewhere on the local filesystem, then the completed files dir to a samba share, and have it do full processing (extract/repair/unpack/etc).

Post by **shypike** » May 20th, 2012, 11:14 am

This specific post does not contain RAR files.
Instead it has the MKV segments as direct downloadable files.
Unfortunately, the poster has messed up.
The characters in the names are either in an unspecified 8-bit codepage
or it's supposed to be UTF-8, but incorrectly encoded.
A shortcoming of the popular yEnc protocol is that it doesn't specify the encoding of file names.
This problem is also why most posters choose to use a RAR wrapper.

On Windows and OSX the same problem occurs, so it's not exclusive for your Synology system.
Also, it has nothing to do with SABnzbd or your system. The poster screwed up.

garretn · Post by **garretn** » May 20th, 2012, 12:39 pm

Right, I said right away that it wasn't a sab problem. I was just looking for advice. Good call on the not rar thing, I didn't even notice, thanks!

garretn · Post by **garretn** » May 24th, 2012, 5:07 pm

I don't actually use synology like in that other thread, it's just a similar problem. So, knowing there isn't any actual bug (Note that I did state right away that I didn't think it was a bug, and was just asking advice for dealing with it), I don't mind hacking and/or adding a feature myself for dealing with invalid/broken character sets, so we're still at advice if you're willing to put up with me.

Now, the issue described again is when a file OR directory using a non-UTF8 charset (in my case, iso-8859-15) that was either downloaded or created (such as "workdir") on the local filesystem tries to move to a samba (samba only supports UTF-8, this is actually a known issue with samba) share, it will fail as samba will reject it.

Now, since this is definitely not a sabnzbd bug, we're still talking advice. I don't mind getting my hands dirty, so, I poked around the sabnzbd source earlier to see what the best way to deal with this was (excluding not downloading the files, just roll with me here). There seems to basically be two places that would need to check/fix encoding from a few simple tests I did, the first is the directory that sabnzbd creates where it downloads the files (the one that includes __ADMIN__) -- it uses the name of the download, and it appears to extend to the bad encoding -- and the other is the downloaded files themselves.

The second one is easy enough to deal with, and that is to just convert the encoding at the end of postproc.py's process_job after unpacking/repairing but before the files/directory actually gets moved. The first one I'm not quite as sure where to deal with it, and it "seems" like the best place would be to fix it wherever sabnzbd sets that variable in the first place -- so it'd use a corrected version from the get-go. Thoughts? I had it fixing the files and directory at the end of post-process, but that wasn't working for me because sabnzbd would "lose" the download directory since the name would change (Yes, if I had thought about it a little harder, I would've realized that would happen prior to experimenting with it).

One final note, I realize I talk about fixing things a lot up there and it makes it sound like a bug -- like many Americans, my English and grammar are terrible, so I apologize.

Post by **shypike** » May 25th, 2012, 4:44 pm

The basic problem is that SABnzbd doesn't know what the character set used is.
No amount of clever coding will fix that.
Maybe I'm missing the point, but I don't get it from what you've written so far.

garretn · Post by **garretn** » May 25th, 2012, 4:58 pm

Oh, I didn't realize you weren't following. The script in the thread I originally linked to does basic detection (in pure python) that finds the bad encoding. But for now, lets forget about that entirely.

I was really more asking where the appropriate place to "adjust" the directory name, the source appears to use a lot of variables and I've not yet found which one I really want -- it seems like it would be "workdir", but I suspect it isn't. The working directory for downloads is based on the download name, and it seems like quite a few later variables (such as workdir_complete) are also based on it, and I'm just looking for where it initially assigns that -- assuming the later ones are all based off that mysterious variable in some way.

Sorry, I suspect if I knew the correct variable to refer to, I wouldn't be asking.

Post by **shypike** » May 26th, 2012, 4:48 am

Looked up ISO-8859-15, it's a minor variant of "Latin-1".
It's going to be quite a problem to implement anything in the current SABnzbd
as it is very much bound to the Latin-1 character set.
It's not just renaming folder, you also need to rename files.
But you cannot just rename files, because PAR2 will change them back.
Any renaming would have to be done at the very end of post-processing.

With the current SABnzbd it's not really feasible, but the next major release
will be fully Unicode-compliant and then it might be possible to have
a user-controlled setting for the mapping of yEnc-encoded filenames.
(Although par2 could still mess things up, because most versions have the same character set issue).

If you insist on patching now, you can hack away in the block after these lines in postproc.py:

Code: Select all

            if all_ok:
                ## Move any (left-over) files to destination
                nzo.status = Status.MOVING

That's where files are moved from "incomplete" to "complete".
It will only work for directly downloaded files, but that should be OK
because files coming from RARs shouldn't have to be renamed.

Support Forum

Looking for advice with a character set issue

Looking for advice with a character set issue

Re: Looking for advice with a character set issue

Re: Looking for advice with a character set issue

Re: Looking for advice with a character set issue

Re: Looking for advice with a character set issue

Re: Looking for advice with a character set issue

Re: Looking for advice with a character set issue

Re: Looking for advice with a character set issue

Re: Looking for advice with a character set issue