[[Secure Computer Systems/TCB - Trusted Computing Base|TCB - Trusted Computing Base|TCB]] or the Operating systems often support access to unstructured data that is stored in files. However, lots of structured data is stored in databases. > [!Question]- What is different about data stored in and accessed from a DB? > 1. Data is structured → Relational databases have relations/tables defined by a schema. > 2. Query or transactions are used to access data → Read or update transactions. Traditionally, integrity is addressed via concurrency control and commit protocols. ## Securing databases 1. Authentication 1. Users and roles → [[Role-based access control]] 2. Authorization 1. [[Role-based access control]] 2. Inference attacks → Multiple queries, each is allowed by access control, may be used to infer sensitive data ### Access control We need to define principals (subjects), objects and privileges. RBAC like model. Policy statement examples: 1. `GRANT “CREATE TABLE” to Manager [with admin option]` 2. `REVOKE “CREATE TABLE” from Manager` 3. `GRANT SELECT on TABLE to Manager [with Grant option]` 4. `REVOKE SELECT on TABLE from Manager` If we have access that is granted in a cascading way (i.e. Alice grants to Bob, who grants access to Charlie), then we can also have cascading revokes to remove access. Different access rights options: 1. Admin option 1. Ownership access rights. 2. They can revoke access even if they did not grant access. 3. Cascading revocation does not apply. 2. Grant option 1. Allows propagation and revocation to those who you granted to. 2. If your permissions are revoked, it will cascade. ### Stored procedures Like transformation procedures of [[Clark-Wilson policy]] Procedure access can be 1. Definer 1. If a user/role has definer access for a stored procedure, it can be executed without requiring separate privileges for the objects accessed by it 2. Invoker 1. Invoker access to a stored procedure requires privileges for the objects needed by the procedure for its execution ### Views or *Virtual databases* Views can be derived from a database. They contain a subset of the data that exists in the tables from which they are derived. 1. If a user/role has access to relations from which a view is created, then said user/role has access to the view. 2. User/role can grant access to views when the user/role has grant privileges to the base relations. ### Inference attacks Access control may not protect sensitive data in a database. In a database, we have some database constraints. For example, some attribute $C = A + B$. Let’s say, $A$ is the base salary, $B$ is the bonus and $C$ is the total compensation. $A$ and $C$ may be public but $B$ is not. Here, constraint reveals sensitive data. #### Functional dependency attack When a value from a column $A$ determines the value in $B$. For example, let’s say, the rank determines salary. If we do not wish to expose both name and salary together. You could query for name & rank, and then for rank and salary, and you will find out name and salary. #### Statistical queries and aggregate results - Provides aggregate information such as average GPA or salary. - If we don’t disclose sensitive data of a user or violate company policy, access should be allowed because it does not lead to breach of privacy. > [!question]- Can sensitive data be released by making an inference of aggregate data? > Multiple statistical queries can be used by an inference attack that can successfully discover a sensitive data value that is protected by access control. ### Small/large query attack - $N =$ total number of users in DB - $n =$ threshold set by system - $C =$ Characteristic that identifies a group of users, $|C| =$ the number of users who satisfy this characteristic Small/large query attack → Query $q$ meets this requirement when it is computed with users who satisfy characteristic $C$ and $|C|$ is either close to $N$ or less than $n$ > [!Example] Attack mitigation > Allow query only when $n ≤ |C| ≤ (N-n)$ However, even with this, we cannot completely mitigate the attack as inference can be made in other ways. #### Tracker attack Suppose in the diagram below that $C$ only contains 1 observation. We can query $C_1$​ and $T = C_1 - C_2$​ (the shaded region), both of which are legal. ![[attachments/Screenshot 2023-07-19 at 5.29.43 PM.png]] So now we can calculate stuff about $C$. - $C = C_1 - T$ >[!bug] Given an unlimited number of statistical queries that return correct answers, all statistical databases can be compromised. ### Public database If we just remove personally identifiable information then we can publish all of the data. ![[attachments/Screenshot 2023-07-19 at 5.33.34 PM.png]] >[!faq]- What does it mean to preserve someone’s privacy? >Whether someone’s data is included or not makes no changes to their privacy. #### De-identification We must remove identity information of Alice. This includes things like - Name, SS#, address, DOB, biometric information, photographs. However, maybe age, city, etc. can be included. Challenges to de-identification 1. Linking attacks must be harder → Combining information from public database with other available information to derive sensitive information. 2. Utility should not be compromised. **Attacks on de-identified DB** Some fields such as age, gender, zip code can be used as a quasi-identifier (QID). QID can be combined with publicly available information to figure out someone’s identity. We can make such attacks harder by replacing a specific value with a range of values (generalization). ### Anonymization #### $K$-anonymity At least $k$ different rows have the same QID for any QID. Generalization can be used to increase $k$. $K$ refers to lowest number of rows with the same QID Utility goes down as we increase privacy by increasing $k$. Linking attacks are still possible if the sensitive data lacks sufficient diversity, #### $l$-diversity All rows of the same QID must have at least $l$ distinct values in the sensitive data column. Increased by increasing generalizations. Each group of rows with same QID has at least $l$ different sensitive values Transforming a private database to create a $(k,l)$ public DB, where increasing $k$ or $l$ should be used to increase privacy.