Duplicate checking

c0ld · Post by **c0ld** » September 18th, 2009, 12:17 pm

I've seen this referred to in other feature requests, but not a specific request for it. This is a feature of Newsbin and I would really like to see it in sabnzbd. It works but taking a hash of the first few kb of a file and storing it in a database, then it checks the hash of new files and compares it to the db to avoid dupes. It may be a bit resource intensive but it's incredibly useful and a lot more precise than sabs current nzb filename checking.

Post by **shypike** » September 19th, 2009, 6:36 am

Are you talking about checking the downloaded files or the NZB files?

rollingeyeball · Post by **rollingeyeball** » September 19th, 2009, 3:59 pm

This sounds like a top idea. This means the same nzb from different sites, where the nzb _name_ might be different won't get downloaded again, if you were using rss or something

Are you talking about checking the downloaded files or the NZB files?

Would that make a difference in use? I thought he meant the nzb.

Post by **shypike** » September 19th, 2009, 5:38 pm

It makes a lot of difference.
Suppose we check each file in the NZB as it starts downloading to see if it hasn't been downloaded already.
I can assure you that we will not implement such a method.

Fingerprinting the NZB itself may have its uses, but I'm not really convinced.

rollingeyeball · Post by **rollingeyeball** » September 19th, 2009, 6:54 pm

Can you explain a bit more?

What differences, specifically

Why would you not implement that?

Post by **shypike** » September 20th, 2009, 8:22 am

Checking NZB files.
Look at the nett content of the NZB (i.e. articles without identifier fields from the search site).

Checking files. Suppose a NZB downloads 50 files.
We would need to check each of the 50 files against previous downloads to see if
we downloaded it in a previous job.
That we will definitely not do, because that's a huge overhead.

The former is feasible and not a bad idea.
The latter, never.

rollingeyeball · Post by **rollingeyeball** » September 20th, 2009, 1:36 pm

The former is feasible and not a bad idea.

That's a much more pleasant thing to hear

I agree, I think it should be the NZB, checking files won't really offer much more and I suspect it'd be harder for you to implement, but for arguments sake..
How much overhead is a database of hashes, really?

Post by **shypike** » September 20th, 2009, 4:15 pm

rollingeyeball wrote: How much overhead is a database of hashes, really?

It's not just that, the amount of additional code and added complexity is large.
It would not be worth the investment.

rollingeyeball · Post by **rollingeyeball** » September 20th, 2009, 6:22 pm

Fair enough then.
Thanks for the answers

c0ld · Post by **c0ld** » September 23rd, 2009, 5:03 pm

shypike wrote: It's not just that, the amount of additional code and added complexity is large.
It would not be worth the investment.

Yeah, it is fairly intensive. Newsbin does it quite well, but it's a large program and is more or less the antithesis of sab because of that. When it comes across an nzb/set of files it has already done, the cpu load becomes apparent as it will go to 100% and lock up as it churns through them.

It would have been nice as a switch on sab, to be turned off if you wanted to run on a lower powered machine. I didn't think it would add too much to the idle footprint, obviously it would though, so nvm.

Support Forum

Duplicate checking

Duplicate checking

Re: Duplicate checking

Re: Duplicate checking

Re: Duplicate checking

Re: Duplicate checking

Re: Duplicate checking

Re: Duplicate checking

Re: Duplicate checking

Re: Duplicate checking

Re: Duplicate checking