Duplicate checking
Duplicate checking
I've seen this referred to in other feature requests, but not a specific request for it. This is a feature of Newsbin and I would really like to see it in sabnzbd. It works but taking a hash of the first few kb of a file and storing it in a database, then it checks the hash of new files and compares it to the db to avoid dupes. It may be a bit resource intensive but it's incredibly useful and a lot more precise than sabs current nzb filename checking.
Re: Duplicate checking
Are you talking about checking the downloaded files or the NZB files?
-
rollingeyeball
- Release Testers

- Posts: 181
- Joined: January 30th, 2009, 12:26 pm
Re: Duplicate checking
This sounds like a top idea. This means the same nzb from different sites, where the nzb _name_ might be different won't get downloaded again, if you were using rss or something
Would that make a difference in use? I thought he meant the nzb.Are you talking about checking the downloaded files or the NZB files?
Re: Duplicate checking
It makes a lot of difference.
Suppose we check each file in the NZB as it starts downloading to see if it hasn't been downloaded already.
I can assure you that we will not implement such a method.
Fingerprinting the NZB itself may have its uses, but I'm not really convinced.
Suppose we check each file in the NZB as it starts downloading to see if it hasn't been downloaded already.
I can assure you that we will not implement such a method.
Fingerprinting the NZB itself may have its uses, but I'm not really convinced.
-
rollingeyeball
- Release Testers

- Posts: 181
- Joined: January 30th, 2009, 12:26 pm
Re: Duplicate checking
Can you explain a bit more?
What differences, specifically
Why would you not implement that?
What differences, specifically
Why would you not implement that?
Re: Duplicate checking
Checking NZB files.
Look at the nett content of the NZB (i.e. articles without identifier fields from the search site).
Checking files. Suppose a NZB downloads 50 files.
We would need to check each of the 50 files against previous downloads to see if
we downloaded it in a previous job.
That we will definitely not do, because that's a huge overhead.
The former is feasible and not a bad idea.
The latter, never.
Look at the nett content of the NZB (i.e. articles without identifier fields from the search site).
Checking files. Suppose a NZB downloads 50 files.
We would need to check each of the 50 files against previous downloads to see if
we downloaded it in a previous job.
That we will definitely not do, because that's a huge overhead.
The former is feasible and not a bad idea.
The latter, never.
-
rollingeyeball
- Release Testers

- Posts: 181
- Joined: January 30th, 2009, 12:26 pm
Re: Duplicate checking
The former is feasible and not a bad idea.
I agree, I think it should be the NZB, checking files won't really offer much more and I suspect it'd be harder for you to implement, but for arguments sake..
How much overhead is a database of hashes, really?
Re: Duplicate checking
It's not just that, the amount of additional code and added complexity is large.rollingeyeball wrote: How much overhead is a database of hashes, really?
It would not be worth the investment.
-
rollingeyeball
- Release Testers

- Posts: 181
- Joined: January 30th, 2009, 12:26 pm
Re: Duplicate checking
Fair enough then.
Thanks for the answers
Thanks for the answers
Re: Duplicate checking
Yeah, it is fairly intensive. Newsbin does it quite well, but it's a large program and is more or less the antithesis of sab because of that. When it comes across an nzb/set of files it has already done, the cpu load becomes apparent as it will go to 100% and lock up as it churns through them.shypike wrote: It's not just that, the amount of additional code and added complexity is large.
It would not be worth the investment.
It would have been nice as a switch on sab, to be turned off if you wanted to run on a lower powered machine. I didn't think it would add too much to the idle footprint, obviously it would though, so nvm.
