Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
How Much Data Should We Store?

By pwayner in Technology
Fri May 24, 2002 at 09:04:07 AM EST
Tags: Round Table (all tags)
Round Table

For the last few years I've been working on making databases more secure by trying to convince people that keeping less information is better. Databases come with many security options, but they are still open to abuse by insiders and hackers. Sometimes less data can provide the same value to the company or agency while significantly reducing the dangers for customers, clients, employees and others. I wanted to ask Kuro5hin readers about their experiences with limiting and regulating the amount of data in their systems.




The biggest problem is often cultural. Information is a glue for any organization. Many people instinctively take down as much data as they can because it may just be useful in the future. This packrat instinct may pay off many times, but it can have dramatic failures. The same bits of data that help a business are also essential tools for stalkers, rapists, identity thieves, fraudsters, and eavesdroppers.

Balancing the needs against the responsibilities can often be done with all of the classical encryption algorithms. If the sensitive, personal information is encrypted in the right way, then any hacker or insider can't get at it. (Here's a plug: I've written a book called Translucent Databases . It comes with dozens of examples in Java and SQL. )

Encrypting the right amount of information can provide good security for everyone without hobbling the database. The sensitive or personal information problem is scrambled, but the practical data is left open for the database to use.

The security comes at a price because the information is also scrambled for the database administrators and everyone else who used to enjoy privileged access. The businesses and agencies can't rely upon the records to help straighten out problems or resolve disputes. The data may still be in the system, but it is effectively gone from the culture.

In many cases, the missing information is not a big loss. A few months after most transactions, there's no need to look back over old records. The credit card numbers or other sensitive information are usually only good for malicious browsers, many of whom come from the inside.

While many businesses have fewer responsibilities and worries after creating a translucent database, the culture still resists the loss of control. The databases go from being omniscient oracles to neutral switches that don't understand the data flowing through them.

How can an organization deal correctly with the power of information? The open source community often diffuses the concentration of power by forcing everyone to share equally. That's great for source code, but it doesn't work for dangerous information like credit card numbers or the schedule of databases.

I want to know if the readers have confronted this problem or dilemma at work. How was it solved? Was there a good compromise available? Did politics enter the equation? Can companies give up their packrat nature?

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Related Links
o Kuro5hin
o Translucent Databases
o Also by pwayner


Display: Sort:
How Much Data Should We Store? | 34 comments (25 topical, 9 editorial, 0 hidden)
Simple answer to your question in the headline (3.40 / 5) (#2)
by cem on Thu May 23, 2002 at 12:03:54 PM EST

Q: How Much Data Should We Store?
A: As less as possible but as much as necessary.


Young Tarzan: I'll be the best ape ever!
In an ideal world yes (5.00 / 1) (#3)
by gazbo on Thu May 23, 2002 at 12:14:58 PM EST

But when I design a database for a client, I make sure it will store all sorts of data that I could see becoming useful in the future.

In the real world, clients turn around and say "You know, it'd be really neat if I could give you a purchase order and you could cross reference X with Y." I don't ever want to reply "Well you just tear that PO up, because storing Y was never in the original spec so I didn't do it, despite how useful it now appears to be."


-----
Topless, revealing, nude pics and vids of Zora Suleman! Upskirt and down blouse! Cleavage!
Hardcore ZORA SULEMAN pics!

[ Parent ]

Usually ... (none / 0) (#5)
by cem on Thu May 23, 2002 at 12:20:37 PM EST

... I ask myself, will I ever use (read or manipulate) a specific data record or field really? Do I really need it?

If I'm not sure, it will not be part of my database design.

Well, I always think, less is more ... and faster.


Young Tarzan: I'll be the best ape ever!
[ Parent ]

Get everything you can (4.00 / 1) (#6)
by maroberts on Thu May 23, 2002 at 12:23:16 PM EST

Companies want all the data they can get on a customer because they are never sure when it may come in useful. In addition to that, they can sell the data, so data that is useless to them may have value to a third party.
~~~
The greatest trick the Devil pulled was to convince the world he didn't exist -- Verbil Kint, The Usual Suspects
My experience is ... (5.00 / 1) (#7)
by cem on Thu May 23, 2002 at 12:30:42 PM EST

... companies never really need more than 90% of their data. But ... you don't know which 10% is usefull.

Data storage can be very expensive for larger companies. And selling data can in many cases be against the law.


Young Tarzan: I'll be the best ape ever!
[ Parent ]

How much data? (3.80 / 5) (#8)
by pb on Thu May 23, 2002 at 01:00:23 PM EST

Why, all of it, of course.

Encrypting sensitive information is a very good idea.  But you're still storing it, just not in plaintext.  :)
---
"See what the drooling, ravening, flesh-eating hordes^W^W^W^WKuro5hin.org readers have to say."
-- pwhysall

Not all so-called encryption is reversible. (4.66 / 3) (#12)
by pwayner on Thu May 23, 2002 at 02:23:26 PM EST

The book devotes plenty of discussion to true one-way functions like MD5 and SHA. They may not be true encryption, but they're always discussed in the same areas.

These functions can't be reversed, but they can still be used in interesting ways. Equality is easy to test. One of most common solutions is the UNIX password database. It stores h(password) instead of password itself. The structure of h(x) makes it difficult to guess x from h(x) alone, but if someone presents a value of x', you can compute h(x') and test equality.

These functions can produce some complicated solutions with inscrutible databases.

[ Parent ]
yup (4.50 / 6) (#17)
by pb on Thu May 23, 2002 at 05:08:42 PM EST

Actually it encrypts a blank string with something like DES, using the (first 8 characters of the) password as the key, resulting in a hash of the password. But if it isn't reversible, it isn't really encryption (even if it does use similar algorithms); hashes are generally used for authentication, as your password example shows.

But yes, I am a pack-rat, and I'd like to keep all my data around if possible, as well as copies of whatever I can get my grubby pack-rat paws on.  This has little to do with whether it's encrypted, or stored in a database, and more to do with "ooo, this piece of data might be useful someday, maybe"...  I mean, I'm still annoyed that I ever got rid of my C64, and I eventually did get another NES, and I still have everything I'd need to recreate my old 386 running DOS...

Basically, I feel that it's a shame that so much data is lost every day; for example, there are no decent archives of the early days of the web, or anything reliable pre-1996.  That includes no real references to the WWWW (the World-Wide Web Worm, a very popular search engine), or whatever the neat-o new site was that used Netscape 1.1's "BODY BACKGROUND" tag, or buttons...  And obviously there's a lot missing from everything pre-WWW on the Internet, like gopherspace, and USENET...

Actually, it's amazing how much has survived in the first place.  But now that we can store so much more data, and at the rate that data storage capacities are increasing, I think it's not unreasonable for today's tape backups to become part of next year's historic archives.  Obviously you wouldn't want just anyone to have access to all of it, so you might want to encrypt parts of it, or even have some (private) data omitted; but even without any private data, there is a wealth of publicly available data that should be archived, but never will be...

Obviously, this is a different topic altogether from what people should keep in their own databases, but it does stem directly from "How Much Data Should We Store"; my answer is, as much as humanly possible.  :)
---
"See what the drooling, ravening, flesh-eating hordes^W^W^W^WKuro5hin.org readers have to say."
-- pwhysall
[ Parent ]

Encryption has limites (2.50 / 2) (#26)
by Betcour on Fri May 24, 2002 at 08:16:02 AM EST

Encryptions means there's a key somewhere (I'll forget about hash systems, which are only usable for some things). Usually the key is on the computer between the client and the database (like on a web server if the database is accessed that way). This makes the system vulnerable (a bit more difficult to hack because the attacker must spend some time figuring out where is the key and use it, but its only a minor difficulty once he is inside)

[ Parent ]
Determine the business cycle, keep 'working' data (4.00 / 5) (#11)
by SaintPort on Thu May 23, 2002 at 01:40:44 PM EST

as long as relevant.  We have databases that only keep the last 30 days, 90 days, quarter, year...

The hard part with this mentality is someone usually ends-up asking for some data just a little too late.

I have created archive databases (with more limited user access) for data past its normal working range.

Good Topic.

--
Search the Scriptures
Start with some cheap grace...Got Life?

Oh, it's so hard to throw away data (3.60 / 5) (#16)
by westfirst on Thu May 23, 2002 at 04:18:06 PM EST

I've got hundreds of backup disks and it's just hard to throw something away. If I get subpoenaed like MS, Enron, or AA, I'm going to have to give them so much. Not that there's much for them to see or read. It's all half-baked source code.

I think the best solution is for the lawyers to enforce rules to delete information. They're the ones with the real power and the instinct to throw away/destroy information. System administrators are packrats by nature.

Half a Petabyte of storage... (2.00 / 2) (#18)
by mikecap on Thu May 23, 2002 at 05:28:51 PM EST

SLAC recently achieved this milestone with their experimental data DB. It's an object-oriented RDBMS that they've customized with an additional 500,000 lines of code. It runs on a bunch of Sun boxes, and they grow it every day by another 500 GB (they have an upper daily transfer limit of 1000 GB).

More info

Picture

Mike

US Patriot Act (4.88 / 9) (#19)
by Spork on Thu May 23, 2002 at 05:38:40 PM EST

This is a US-centric comment, but it's an important thing to think about. My girlfriend is a librarian who just returned from a conference on the dreaded US Patriot Act. Under the provisions of that act, the FBI has pretty much unrestricted power of access to any US-based database. This means that when you get the letter from them, all your data belongs to them, and deleting anything after they ask for access is a federal crime. Librarians, doctors and many others justly freak out about this, because the privacy of their clients is severely compromised. Basically, there is no data stored in the US that the FBI cannot access at will.

The only conclusion that anybody seems to agree on is that if client privacy is important to you, for god's sake, don't archive anything. Libraries that are on the ball are making an extra effort to delete all information about who borrowed and subsequently returned books. If the book is returned, those records become irrelevant anyways, but many libraries kept them all the same. Video stores, credit card companies and merchants often do the same thing. So please, US-Americans, delete stuff (emails, everything) before the FBI comes knocking with their insane "Patriot Act".

Ask your lawyers! (4.00 / 3) (#20)
by jayfoo2 on Thu May 23, 2002 at 06:00:40 PM EST

As sad and painful as it is to think about it what your company's lawyers say is actually important here.

One of the first questions to ask whenever you model a database is what information are you required to store, and for how long.

Records of financial transactions for example often have to be stored for 5 or 7 years in America. And just to make things more complicated, there are several types of data that must be destroyed after a given time period.

This is why most medium to large companies typically have a document retention/discussion policy.

In any case, one of your first stops should always be the legal department. let them make the call and sign off on what you are keeping, and for how long. That way all you need to say at the congressional hearing is that you were doing what he or she said.

Destruction or encryption? (4.83 / 6) (#21)
by bsee on Thu May 23, 2002 at 06:20:00 PM EST

I think we need to highlight the difference between encryption for purposes of security/access versus data destruction to protect from future litigation, etc.

The author appears to be advocating varying levels of encryption addresses the risk of unauthorized access by crackers or disgruntled employees.

This is a totally separate issue from what data is destroyed under a document "retention" policy.  

I'm sure it's been said in these forums before, but if you get involved in litigation and are subject to discovery, and your adversary wants it bad enough (and can justify the cost to the court), you will have to decrypt the data for them (or produce your keys), or face sanction from the court.  

Just because the data is securely encrypted doesn't provide protection in litigation.  That's why so many corporate document retention policies out there mandate the outright destruction of data (and backups).

Yes, you're right, but... (4.50 / 6) (#22)
by pwayner on Thu May 23, 2002 at 08:36:27 PM EST

I made a mistake by lumping together one-way functions like SHA and MD5 together with general encryption functions. These one-way functions can destroy information while still preserving the ability to test equality. So instead of storing a name, we can store SHA(name). There's no easy way to take SHA(name) and figure out name without doing a brute force search. On the other hand, if someone gives you name2, you can compute SHA(name2) and test to see if they're equal. This is a powerful technique if used in the right way in the right circumstances. You can destroy the data AND keep some use for it.

[ Parent ]
Data Protection Act (4.50 / 2) (#23)
by salsaman on Fri May 24, 2002 at 05:59:18 AM EST

You might want to take a look at the UK's Data Protection Act .

It has some good guidelines for dealing with sensitive personal information.

[IMO it's one of the very rare times when the UK govt. actually got some tech legislation right].

Other data retention laws (none / 0) (#28)
by wiml on Fri May 24, 2002 at 05:35:18 PM EST

Lots of countries have data protection laws, many in the wake of a certain war about fifty years back. (No matter what your policies are, they won't keep your data from being misused when a bunch of people storm into your country with tanks, guns, and so forth, and start looking through your databases for people matching particular profiles.)

Info on: the EU, Denmark, France, Belgium, and others. I'd include more but it's hard to find good translations.



[ Parent ]

No data (1.00 / 3) (#27)
by medham on Fri May 24, 2002 at 03:55:08 PM EST

Should be stored. It should be caressed by the eternal wifely.

The real 'medham' has userid 6831.

Have you every worked on a really big db? (4.66 / 3) (#29)
by bobm on Fri May 24, 2002 at 06:27:02 PM EST

statements like

In many cases, the missing information is not a big loss. A few months after most transactions, there's no need to look back over old records. The credit card numbers or other sensitive information are usually only good for malicious browsers, many of whom come from the inside.

makes me feel that you haven't really ever worked with any massive databases (or OLAP systems). If the information is missing you can't determine if it's a big loss or not, it's missing.

Other nits: I think you'll find that over time the data you don't store will be the data that you need.

this really shows itself when dealing with record changes or history and is a subject worthy of it's own discussion.

The reality is that you can never have enough information, it's storing and retrieving it that is the pain.

Also (in my experience) most _real_ companies don't sell the info, they keep it very secret (I'm talking Fortune 500 companies and I'm talking personal info). Bogus .com's are another issue.

Store Everything (none / 0) (#30)
by Scrymarch on Sun May 26, 2002 at 10:39:44 AM EST

You make a good point about security concerns being compromised by masses of historical untracked details, but it goes completely against my instincts.  There are excellent reasons to be a packrat.

Storing information isn't like keeping a desk clean.  The interfaces have to be crisp, so that the needed information can be accessed quickly, but as much actual information should be kept as possible.  In my experience organisations are not careful enough with their information, or have a haphazard approach to it.  I've lost most of the email I've ever sent, for instance, despite it being a wonderful record of the moment equivalent to snailmail.  It is valuable to a business to know every transaction a customer has made.  Small customers can become big customers.  A memory has always been vital to businesses, but due to both increases in scale and a more mobile workforce that frequently change jobs, there's much more reason to have it explicitly stored in database records than in the oral tradition of a company-man sales force.  The archives of organisations are one of their few core assets.  Buildings and people change, customers become more and less important, but the records remain (or don't, which makes it harder to continue rather than restart from scratch).

Maybe packrat is a good word, because there's often little care given to a meaningful organisation of old data.  Lack of standards and a ham-fisted attitude to storage compound the problem.  Instead data is hoarded with little more than a sense that "something here is important".  Maybe this is what an executive CIO would be good for - at the most abstract, making sure that information flows well within an organisation; that as much as physically or fiscally possible is available from any priveleged point through crisp interfaces; and that old data is a deep well of experience rather than a stagnant mud-pool for water packrats.

I'll now give the obligatory mention of the lost data from NASA missions.  David Gelertner, and Bruce Sterling,  have also written eloquently on this issue.

Yes, but... (none / 0) (#31)
by pwayner on Sun May 26, 2002 at 09:38:58 PM EST

Keeping information is nice, but it does cause trouble in some cases. Most information in databases is innocuous-- but some is quite valuable to fraudsters, stalkers, rapists, identity thieves, and others. I guess I'm really concerned about wondering aloud about when the packrat instinct inadvertantly helps the wrong people. How can you anticipate that?

[ Parent ]
Anticipation (none / 0) (#33)
by Scrymarch on Mon May 27, 2002 at 05:28:59 PM EST

I guess I'm really [...] wondering aloud about when the packrat instinct inadvertantly helps the wrong people. How can you anticipate that?

The same way you can anticipate anything about how a software system develops: imperfectly.

Again, protecting sensitive information is important - I've heard of clients requesting the sort of cryptographic security you're suggesting.  But you seem to be taking this and asking the question of users: "Do you really need this data?"  This is counterproductive.  The data is the business.  They need it.  Each business is building a domain-specific Library of Congress accessed with SQL.  I don't have the stats, but I'm guessing many books there aren't needed every year, and some might go years at a time without being read.  They can still be immensely useful in the future.  This is even more important with historical data that is useful in summary.

I would suggest two other questions: "How long do you need this?" and "How sensitive is this?".  You'll throw some data away - there's no need to keep session id's for the ages.  In addition, keeping the data forever is nigh on worthless if you have no sensible way to access it.  This is the converse of the packrat instinct - a big midden pile to be searched through for the occassional gem.

Now looking at the number of books you've written, these somewhat motherhood statement comments won't be new to you.  Basically I'm a fan of the evolutionary design attitude described by say, Buildings That Learn.  I think that continual design-savvy maintenance of systems lets users themselves produce things that will be historically useful.  A good design will make it possible to add "translucent" protection well into the lifetime of the product.  (Oh no, we'll be sued if we don't make the registration code translucent! ...)

To summarise: the packrat instinct is bad; much rubbish is hoarded and valuable information is lost.  The solution is not to throw away more information but maintain the design so that the information gains value rather than rotting in rubbish heaps.


[ Parent ]

Yes. (none / 0) (#34)
by pwayner on Tue May 28, 2002 at 11:02:04 PM EST

I would suggest two other questions: "How long do you need this?" and "How sensitive is this?".

These are good questions. Maybe I should be asking a meta question of "what are the good questions?" Then we may be able to get a better idea of what is the right data to keep around.

[ Parent ]
I worked with an FBI subcontractor... LSFIA (none / 0) (#32)
by MickLinux on Mon May 27, 2002 at 03:27:24 AM EST

I worked with an FBI subcontractor with the Low-speed Fingerprint Imaging and Acquisition project (92). The goal was to get all criminals' fingerprints in a vast computer database (as opposed to cards), and then follow it up by getting all the taxi drivers, all the day care workers , and eventually everybody in their database. At the time, I was concerned about recent abuses by the FBI and our paramilitary police agencies (ATF, Dept. of Parks, DEA, etc. etc. etc.), and especially about ongoing abuses that I *thought* were going on. So I asked FBI agents there how much the fingerprint records were abused. They informed me that a few times every year they catch employees looking up records illegally for personal purposes, and fire them. To me, that was like saying "we squash the small offenders." That, as opposed to saying "we have/had a problem and are trying to stop it." It let me know that there are indeed major problems with storing too much data. Look at the recent reports about the IRS illegally forcing VISA and MC to hand over their records of offshore account holders. If VISA and MC were holding information that was unnecessary, that information is right now damaging its owner. My feeling about this is -- keep no records that you don't have to keep. --- Followup. The depth of the problem was significant enough that -- before I was assigned to the job, I talked with the supervisor about the problem. He said that he had similar concerns, but it would be best to go and see, first. Before I actually went, he quit. I went, I saw, I talked with the FBI agents, and *I quit*. In each case, quitting also meant being fired by the subcontractor [even though there was plenty of other work to do.] But the problems really seemed that severe. After this, funding got cut for the entire project. Nonetheless, it seems that the same project has raised its ugly head in other forms since then.

I make a call to grace, for the alternative is more broken than you can imagine.

How Much Data Should We Store? | 34 comments (25 topical, 9 editorial, 0 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!