Duplicate checking

Want something added? Ask for it here.
Post Reply
c0ld
Newbie
Newbie
Posts: 48
Joined: March 30th, 2009, 6:28 am

Duplicate checking

Post by c0ld »

I've seen this referred to in other feature requests, but not a specific request for it. This is a feature of Newsbin and I would really like to see it in sabnzbd. It works but taking a hash of the first few kb of a file and storing it in a database, then it checks the hash of new files and compares it to the db to avoid dupes. It may be a bit resource intensive but it's incredibly useful and a lot more precise than sabs current nzb filename checking.
User avatar
shypike
Administrator
Administrator
Posts: 19773
Joined: January 18th, 2008, 12:49 pm

Re: Duplicate checking

Post by shypike »

Are you talking about checking the downloaded files or the NZB files?
rollingeyeball
Release Testers
Release Testers
Posts: 181
Joined: January 30th, 2009, 12:26 pm

Re: Duplicate checking

Post by rollingeyeball »

This sounds like a top idea. This means the same nzb from different sites, where the nzb _name_ might be different won't get downloaded again, if you were using rss or something
Are you talking about checking the downloaded files or the NZB files?
Would that make a difference in use? I thought he meant the nzb.
User avatar
shypike
Administrator
Administrator
Posts: 19773
Joined: January 18th, 2008, 12:49 pm

Re: Duplicate checking

Post by shypike »

It makes a lot of difference.
Suppose we check each file in the NZB as it starts downloading to see if it hasn't been downloaded already.
I can assure you that we will not implement such a method.

Fingerprinting the NZB itself may have its uses, but I'm not really convinced.
rollingeyeball
Release Testers
Release Testers
Posts: 181
Joined: January 30th, 2009, 12:26 pm

Re: Duplicate checking

Post by rollingeyeball »

Can you explain a bit more?

What differences, specifically

Why would you not implement that?
User avatar
shypike
Administrator
Administrator
Posts: 19773
Joined: January 18th, 2008, 12:49 pm

Re: Duplicate checking

Post by shypike »

Checking NZB files.
Look at the nett content of the NZB (i.e. articles without identifier fields from the search site).

Checking files. Suppose a NZB downloads 50 files.
We would need to check each of the 50 files against previous downloads to see if
we downloaded it in a previous job.
That we will definitely not do, because that's a huge overhead.

The former is feasible and not a bad idea.
The latter, never.
rollingeyeball
Release Testers
Release Testers
Posts: 181
Joined: January 30th, 2009, 12:26 pm

Re: Duplicate checking

Post by rollingeyeball »

The former is feasible and not a bad idea.
:D That's a much more pleasant thing to hear

I agree, I think it should be the NZB, checking files won't really offer much more and I suspect it'd be harder for you to implement, but for arguments sake..
How much overhead is a database of hashes, really?
User avatar
shypike
Administrator
Administrator
Posts: 19773
Joined: January 18th, 2008, 12:49 pm

Re: Duplicate checking

Post by shypike »

rollingeyeball wrote: How much overhead is a database of hashes, really?
It's not just that, the amount of additional code and added complexity is large.
It would not be worth the investment.
rollingeyeball
Release Testers
Release Testers
Posts: 181
Joined: January 30th, 2009, 12:26 pm

Re: Duplicate checking

Post by rollingeyeball »

Fair enough then.
Thanks for the answers :)
c0ld
Newbie
Newbie
Posts: 48
Joined: March 30th, 2009, 6:28 am

Re: Duplicate checking

Post by c0ld »

shypike wrote: It's not just that, the amount of additional code and added complexity is large.
It would not be worth the investment.
Yeah, it is fairly intensive. Newsbin does it quite well, but it's a large program and is more or less the antithesis of sab because of that. When it comes across an nzb/set of files it has already done, the cpu load becomes apparent as it will go to 100% and lock up as it churns through them.

It would have been nice as a switch on sab, to be turned off if you wanted to run on a lower powered machine. I didn't think it would add too much to the idle footprint, obviously it would though, so nvm.
Post Reply