[Feature #43] Max-retention heuristics suggestion

arucard · Post by **arucard** » November 12th, 2010, 6:05 am

I noticed the max-retention feature and i had some thoughts on this. I don't know if this will help or not, but i thought I'd give some suggestions on possible heuristic methods for this. I saw there was a possible max retention time that would have to be found. I think it would be good to have a minimum retention time for this as well, being the lowest retention time found for this server. This has the added benefit of being able to differentiate between knowing for sure that something can be found on a server and it being possible that something can be found on a server. Only in the second case you would need to run the heuristic for finding the retention of a server, so it might have performance benefits. Another reason for the minimum retention time is that many usenet providers are continuously upgrading their retention, most at a rate of 1 day every day. So even if an accurate retention value has been found, it might be incorrect, the very next day, so it should just be a minimum retention. This also means that the max retention rate would have to be incremented daily, since it might otherwise become lower than the actual retention of the server. This might also require some kind of maximum deviance from its set value, for example no more than 1 year higher than its set value. This would only become an issue if the server is not used very often, but should an nzb be loaded that's actually older than the known max retention for any server, you could just try downloading this with the server with the highest max retention (and the second-highest after that, if it fails) and then set their new minimum (or maximum, if it fails) retention values.

As for the heuristic method itself, you could check the first (or first three, just to make sure) article(s) to see if they have failed with a "no article found" error (423 or 430 nntp response code) since this would indicate that the articles that this nzb is composed of are simply not available on this server (as opposed to being available, but corrupted, incomplete or have failed during transfer) which would likely be caused by being out of the server's retention. If it is indeed out of retention then the entire nzb would be out of retention and would require being downloaded by the backup server. If the backup server does successfully download them, you can set this retention as a max for the primary server and a minimum for the backup server, checking their previous values of course. Checking the first three articles instead of just the first one would make sure that it is not just a random problem with the first article. Of course the amount of articles is arbitrary and can be changed to any appropriate amount.

If the first (or one of the first three) article(s) is available, it should continue downloading with the primary server, sending articles to the backup server as needed. For this use case, I would like to suggest something. If the amount of articles that have been sent to the backup server (and successfully downloaded with that) exceeds the amount of parity that has been provided in the nzb (only if parity has been provided) or 10% of the total download size (commonly used amount of parity) then all the remaining articles should be sent to the backup server. The reason for using the parity as a guideline is that this would indicate that the nzb download would not have been retrievable from the primary server alone, as the amount of missing data exceeds the amount of parity data, and therefore should not have been downloaded from that server. Alternatively, you could use a guideline of about 1% missing articles, because most (premium) usenet providers specify a completion of 99+%. If less than 99% can be found, it would indicate that it is not a complete post. This is not very accurate though, as this completion refers to all data in the newsgroups, so one single post could still have a completion of significantly less than 99%. The amount should only consist of completely missing articles, not the missing data from corrupt or partially missing articles since these are the only things indicating a low retention rate. As a side-note, the checking of the first few articles, as described in the previous paragraph, can be ignored, and only the total amount of failed articles could be checked as has been described in this paragraph. This would just cause it to take a bit longer for the entire nzb to be transferred to the backup server. That first method of checking the first few articles is actually just an optimization for checking the retention, with this second method being a fall-back for cases where there are still some (but not enough) articles remaining on the server.

Allowing users to define their retention rate for each server does seem more accurate. However, it might be good to allow them the option of letting it grow one day at a time as well, as many servers are currently growing their retention rate this way.

I don't know if any of this is actually technically possible in sabnzbd, or if some (or all) of it has already been implemented. If so, my apologies. I also don't know if it is actually appreciated to start a topic on an existing feature, but i noticed that the last change to this feature request was almost 2 years ago so I hope this is of any use to someone working on sabnzbd.

Post by **shypike** » November 12th, 2010, 6:50 am

Thanks for your novel-sized spec

There's a reason that this is not implemented.
It's bloody complicated to do and even worse to test in real-life scenario's.
One of the reason it's complicated is that a lot is downloaded in parallel.
So when you say "the first three articles", you should realize that there might be
a whole bunch more already being requested from the server.

Also it caters to a limited audience, namely those who have no quality payed Usenet subscription.
At this moment it doesn't have priority. although we are open to people wanting to help implementing it.

arucard · Post by **arucard** » November 12th, 2010, 7:29 am

I can understand that it could be complicated. On the reason you give though, even if things are downloaded simultaneously, you should at some point be getting a completed or failed for these downloads. So even if it might not be the first three articles of the post, you can still check if 3 articles fail before one article completes correctly, i think. Similarly you should be able to check the amount of failed articles for the entire nzb, which would be needed in the second method. But obviously, there are more complexities involved.

The audience however also includes people who have a high-quality usenet provider, but also an account (temporary or block) with higher retention, but other drawbacks such as price or performance that the primary usenet provider doesn't have. Though this audience might still be limited, i do think this feature would benefit many people, if only by giving them the opportunity of getting the best of different usenet providers. (I have a ~350 day retention on my primary server, and a ~700 day retention on a secondary (more expensive) account, which i share with a friend of mine, that's why i can't use that as primary server account)

I was hoping that a description of a possible heuristic method might help in making this easier to implement, though i can understand that it doesn't have that high a priority. Perhaps the manual retention setting for each server could be implemented first, with the appropriate code for respecting that retention value. That way this feature gets broken up in separately (and easier) implementable sections.

I hope you didn't mind this mini-sequel to my novel

Post by **shypike** » November 12th, 2010, 8:25 am

Manual retention setting might be doable.
Past experience learns us that providers do not have hard retention limits.
Sometimes older jobs remain while younger ones are already gone.
So neither manual or heuristic settings will give an optimal result.
Then there's the additional issue of DMCA notices going to the major providers
but not to the little ones, skewing the results even more.

BTW: the current method should work OK with your setup.
The cheap one being the primary and the expensive one the backup.
This does have the advantage that the expensive server will only
be used when really needed. You cannot be sure of that with alternative methods.

arucard · Post by **arucard** » November 12th, 2010, 11:43 am

I hadn't considered those arguments, especially the DMCA notices. I think that the manual retention setting, though not perfect, would be the only real way to set the retention of a server. Mostly because usenet providers do actually give a very specific retention limit for binary posts. The "soft" limit would always be beyond that and doesn't guarantee completion, but i suppose an override for the retention setting of a server would allow you to try it anyway.

As for the heuristic method, i don't think it'd be very good at finding the retention of a server. Perhaps this method can simply be used to allow an nzb to be transferred to the backup server in its entirety. This would give people the option of bypassing the timeout period for every single article on the primary server, thereby increasing performance of the download. Since this is the only real drawback of the current method, this might be worth implementing.

And you're right, the current method does work OK for me, i just don't like the performance drawback. That's why i switch my servers manually, i don't have to do this very often though, so it's not a problem for me. But i figured someone else might have the same performance drawback, without wanting to switch servers manually. Also, it does make the use of a backup server less appealing, since this server would always perform worse than if it were a primary server. But as you indicated, this is a consideration where performance must be weighed against the amount being downloaded by the backup server. I just think this option would leave that choice to the users themselves.

Post by **inpheaux** » November 13th, 2010, 11:13 am

Why worry about this when you could just check to see if all of a post is there before downloading? Sander's script could be run as an 0.6 pre-processing script, set to bail if it doesn't complete.

It would be great to know your server's real max retention, but in my experience, once you get near the edge of what a host claims is their max retention, it's not like they have every single post up to some datestamp. Instead completion feathers out until it hits 0%. So while a host might CLAIM 700 or 800 days of retention, they might start having crappy completion starting at 775 days. This makes knowing what their actual max retention is less useful information.

arucard · Post by **arucard** » November 13th, 2010, 5:39 pm

That script does sound promising, but i think it would be more useful if you didn't implement it as a pre-downloading script. From what i understand of the explanation of how that script works, it sends a stat-command to see if an article is available without actually downloading it. This seems to take practically no time at all for a single article (I believe he mentioned a single thread being able to perform 20 stat-commands per second, which is 0.05 seconds for a single stat-command), which pretty much solves the problem of the performance drawback that sabnzbd currently has.

I think if you just made sabnzbd perform a stat-command on every article before actually downloading it, you could verify if it exists without having to wait for a timeout of the connection. This way your backup server will only be used when the article actually isn't on the primary server and the performance of the backup server will be the same as if it were a primary server (since checking the post would only take ~0.05 seconds for every article, which i believe is negligible).

The pre-downloading script could be useful when you only wish to download something that has all articles available, but i think this would need to be tied together with the backup switch-over so it can also see a download as complete when some articles are unavailable on the primary server, but available on the backup server, but that's just an extra thought. However, i don't think that using this pre-download script is a good solution for managing the switch-over to the backup server, since this would not allow you to download an incomplete download which could still be repaired automatically.

So while i think the pre-download script by itself is a good feature, i think using the stat-commands to verify the articles' availability would be more elegant and has the added benefit of more closely matching the current way that sabnzbd works. I assume this means that less work would be needed to implement this.

Post by **shypike** » November 14th, 2010, 5:03 am

I don't see how "stat" would help.
Decent servers give a standardized answer when they don't have an article.
Only the less civilized ones time out.
I would not be surprised that those would time out on a "stat" too, but I could be wrong.

The next 0.6.0 Alpha release will have a retention time setting per server.

I'll look at these suggestions in the future, but it's not on the list for 0.6.0.

Support Forum

[Feature #43] Max-retention heuristics suggestion

[Feature #43] Max-retention heuristics suggestion

Re: [Feature #43] Max-retention heuristics suggestion

Re: [Feature #43] Max-retention heuristics suggestion

Re: [Feature #43] Max-retention heuristics suggestion

Re: [Feature #43] Max-retention heuristics suggestion

Re: [Feature #43] Max-retention heuristics suggestion

Re: [Feature #43] Max-retention heuristics suggestion

Re: [Feature #43] Max-retention heuristics suggestion