[Feature #43] Max-retention heuristics suggestion
Posted: November 12th, 2010, 6:05 am
I noticed the max-retention feature and i had some thoughts on this. I don't know if this will help or not, but i thought I'd give some suggestions on possible heuristic methods for this. I saw there was a possible max retention time that would have to be found. I think it would be good to have a minimum retention time for this as well, being the lowest retention time found for this server. This has the added benefit of being able to differentiate between knowing for sure that something can be found on a server and it being possible that something can be found on a server. Only in the second case you would need to run the heuristic for finding the retention of a server, so it might have performance benefits. Another reason for the minimum retention time is that many usenet providers are continuously upgrading their retention, most at a rate of 1 day every day. So even if an accurate retention value has been found, it might be incorrect, the very next day, so it should just be a minimum retention. This also means that the max retention rate would have to be incremented daily, since it might otherwise become lower than the actual retention of the server. This might also require some kind of maximum deviance from its set value, for example no more than 1 year higher than its set value. This would only become an issue if the server is not used very often, but should an nzb be loaded that's actually older than the known max retention for any server, you could just try downloading this with the server with the highest max retention (and the second-highest after that, if it fails) and then set their new minimum (or maximum, if it fails) retention values.
As for the heuristic method itself, you could check the first (or first three, just to make sure) article(s) to see if they have failed with a "no article found" error (423 or 430 nntp response code) since this would indicate that the articles that this nzb is composed of are simply not available on this server (as opposed to being available, but corrupted, incomplete or have failed during transfer) which would likely be caused by being out of the server's retention. If it is indeed out of retention then the entire nzb would be out of retention and would require being downloaded by the backup server. If the backup server does successfully download them, you can set this retention as a max for the primary server and a minimum for the backup server, checking their previous values of course. Checking the first three articles instead of just the first one would make sure that it is not just a random problem with the first article. Of course the amount of articles is arbitrary and can be changed to any appropriate amount.
If the first (or one of the first three) article(s) is available, it should continue downloading with the primary server, sending articles to the backup server as needed. For this use case, I would like to suggest something. If the amount of articles that have been sent to the backup server (and successfully downloaded with that) exceeds the amount of parity that has been provided in the nzb (only if parity has been provided) or 10% of the total download size (commonly used amount of parity) then all the remaining articles should be sent to the backup server. The reason for using the parity as a guideline is that this would indicate that the nzb download would not have been retrievable from the primary server alone, as the amount of missing data exceeds the amount of parity data, and therefore should not have been downloaded from that server. Alternatively, you could use a guideline of about 1% missing articles, because most (premium) usenet providers specify a completion of 99+%. If less than 99% can be found, it would indicate that it is not a complete post. This is not very accurate though, as this completion refers to all data in the newsgroups, so one single post could still have a completion of significantly less than 99%. The amount should only consist of completely missing articles, not the missing data from corrupt or partially missing articles since these are the only things indicating a low retention rate. As a side-note, the checking of the first few articles, as described in the previous paragraph, can be ignored, and only the total amount of failed articles could be checked as has been described in this paragraph. This would just cause it to take a bit longer for the entire nzb to be transferred to the backup server. That first method of checking the first few articles is actually just an optimization for checking the retention, with this second method being a fall-back for cases where there are still some (but not enough) articles remaining on the server.
Allowing users to define their retention rate for each server does seem more accurate. However, it might beĀ good to allow them the option of letting it grow one day at a time as well, as many servers are currently growing their retention rate this way.
I don't know if any of this is actually technically possible in sabnzbd, or if some (or all) of it has already been implemented. If so, my apologies. I also don't know if it is actually appreciated to start a topic on an existing feature, but i noticed that the last change to this feature request was almost 2 years ago so I hope this is of any use to someone working on sabnzbd.
As for the heuristic method itself, you could check the first (or first three, just to make sure) article(s) to see if they have failed with a "no article found" error (423 or 430 nntp response code) since this would indicate that the articles that this nzb is composed of are simply not available on this server (as opposed to being available, but corrupted, incomplete or have failed during transfer) which would likely be caused by being out of the server's retention. If it is indeed out of retention then the entire nzb would be out of retention and would require being downloaded by the backup server. If the backup server does successfully download them, you can set this retention as a max for the primary server and a minimum for the backup server, checking their previous values of course. Checking the first three articles instead of just the first one would make sure that it is not just a random problem with the first article. Of course the amount of articles is arbitrary and can be changed to any appropriate amount.
If the first (or one of the first three) article(s) is available, it should continue downloading with the primary server, sending articles to the backup server as needed. For this use case, I would like to suggest something. If the amount of articles that have been sent to the backup server (and successfully downloaded with that) exceeds the amount of parity that has been provided in the nzb (only if parity has been provided) or 10% of the total download size (commonly used amount of parity) then all the remaining articles should be sent to the backup server. The reason for using the parity as a guideline is that this would indicate that the nzb download would not have been retrievable from the primary server alone, as the amount of missing data exceeds the amount of parity data, and therefore should not have been downloaded from that server. Alternatively, you could use a guideline of about 1% missing articles, because most (premium) usenet providers specify a completion of 99+%. If less than 99% can be found, it would indicate that it is not a complete post. This is not very accurate though, as this completion refers to all data in the newsgroups, so one single post could still have a completion of significantly less than 99%. The amount should only consist of completely missing articles, not the missing data from corrupt or partially missing articles since these are the only things indicating a low retention rate. As a side-note, the checking of the first few articles, as described in the previous paragraph, can be ignored, and only the total amount of failed articles could be checked as has been described in this paragraph. This would just cause it to take a bit longer for the entire nzb to be transferred to the backup server. That first method of checking the first few articles is actually just an optimization for checking the retention, with this second method being a fall-back for cases where there are still some (but not enough) articles remaining on the server.
Allowing users to define their retention rate for each server does seem more accurate. However, it might beĀ good to allow them the option of letting it grow one day at a time as well, as many servers are currently growing their retention rate this way.
I don't know if any of this is actually technically possible in sabnzbd, or if some (or all) of it has already been implemented. If so, my apologies. I also don't know if it is actually appreciated to start a topic on an existing feature, but i noticed that the last change to this feature request was almost 2 years ago so I hope this is of any use to someone working on sabnzbd.