Page 1 of 1

Duplicate checking

Posted: September 18th, 2009, 12:17 pm
by c0ld
I've seen this referred to in other feature requests, but not a specific request for it. This is a feature of Newsbin and I would really like to see it in sabnzbd. It works but taking a hash of the first few kb of a file and storing it in a database, then it checks the hash of new files and compares it to the db to avoid dupes. It may be a bit resource intensive but it's incredibly useful and a lot more precise than sabs current nzb filename checking.

Re: Duplicate checking

Posted: September 19th, 2009, 6:36 am
by shypike
Are you talking about checking the downloaded files or the NZB files?

Re: Duplicate checking

Posted: September 19th, 2009, 3:59 pm
by rollingeyeball
This sounds like a top idea. This means the same nzb from different sites, where the nzb _name_ might be different won't get downloaded again, if you were using rss or something
Are you talking about checking the downloaded files or the NZB files?
Would that make a difference in use? I thought he meant the nzb.

Re: Duplicate checking

Posted: September 19th, 2009, 5:38 pm
by shypike
It makes a lot of difference.
Suppose we check each file in the NZB as it starts downloading to see if it hasn't been downloaded already.
I can assure you that we will not implement such a method.

Fingerprinting the NZB itself may have its uses, but I'm not really convinced.

Re: Duplicate checking

Posted: September 19th, 2009, 6:54 pm
by rollingeyeball
Can you explain a bit more?

What differences, specifically

Why would you not implement that?

Re: Duplicate checking

Posted: September 20th, 2009, 8:22 am
by shypike
Checking NZB files.
Look at the nett content of the NZB (i.e. articles without identifier fields from the search site).

Checking files. Suppose a NZB downloads 50 files.
We would need to check each of the 50 files against previous downloads to see if
we downloaded it in a previous job.
That we will definitely not do, because that's a huge overhead.

The former is feasible and not a bad idea.
The latter, never.

Re: Duplicate checking

Posted: September 20th, 2009, 1:36 pm
by rollingeyeball
The former is feasible and not a bad idea.
:D That's a much more pleasant thing to hear

I agree, I think it should be the NZB, checking files won't really offer much more and I suspect it'd be harder for you to implement, but for arguments sake..
How much overhead is a database of hashes, really?

Re: Duplicate checking

Posted: September 20th, 2009, 4:15 pm
by shypike
rollingeyeball wrote: How much overhead is a database of hashes, really?
It's not just that, the amount of additional code and added complexity is large.
It would not be worth the investment.

Re: Duplicate checking

Posted: September 20th, 2009, 6:22 pm
by rollingeyeball
Fair enough then.
Thanks for the answers :)

Re: Duplicate checking

Posted: September 23rd, 2009, 5:03 pm
by c0ld
shypike wrote: It's not just that, the amount of additional code and added complexity is large.
It would not be worth the investment.
Yeah, it is fairly intensive. Newsbin does it quite well, but it's a large program and is more or less the antithesis of sab because of that. When it comes across an nzb/set of files it has already done, the cpu load becomes apparent as it will go to 100% and lock up as it churns through them.

It would have been nice as a switch on sab, to be turned off if you wanted to run on a lower powered machine. I didn't think it would add too much to the idle footprint, obviously it would though, so nvm.