Database Outline

Posted on 2021-05-14 In Concepts Word count in article: 3.4k Reading time ≈ 12 mins.

Database Outline

Disk Manager

Responsibilities:

represents the database in files on disk

manages its memory and move data back-and-forth from disk

How the DBMS represents the database in files on disk

File storage
- n-array storage model (row store, apply for OLTP)
  - advantage:
    
    fast inserts, updates, and deletes
    
    good for queries that need the entire tuple
  - disadvantage:
    
    Not good for scanning large portions of the table and/or a subset of the attributes
- decomposition storage model (column store, apply for OLAP)
  Tuple identification
  1. Fixed-length offset
    
    Each value is the same length for an attribute
  2. Embedded tuple ids
    
    Each value is stored with its tuple id in a column
  - advantage:
    
    reduces the amount wasted I/O because the DBMS only reads the data that it needs
    
    better query processing and data compression
  - disadvantage:
    
    slow for point queries, inserts, updates, and deletes because of tuple splitting/stitching
Page layout
- Slotted pages
- Log-structed file organization
  
  Log-structed compaction (level compaction or universe compaction)

Tuple layout

Data representation

Type	Representation
INTEGER/BIGINT/SMALLINT/TINYINT	C/C++ Representation
FLOAT/REAL vs. NUMERIC/DECIMAL	IEEE-754 Standard / Fixed-point Decimals
VARCHAR/VARBINARY/TEXT/BLOB	Header with length, followed by data bytes
TIME/DATE/TIMESTAMP	32/64-bit integer of (micro)seconds since Unix epoch

External value storage -> BLOB

The DBMS cannot manipulate the contents of an external file because of without durability & transaction protections

How the DBMS manages its memory and move data back-and-forth from disk
- Spatial Control: (where)
  
  Goal: keep pages that are used together often as physically close together as possible on disk
- Temporal Control: (when)
  
  Goal: minimize the number of stalls from having to read data from disk.

Buffer Pool Manager

Memory region organized as an array of fixed-size pages.

An array entry is called a frame.

The page table keeps track of pages that are currently in memory and also maintains additional meta-data per page.

Allocation policies
- Global policies
  
  Make decisions for all active txns.
- Local policies
  
  Allocate frames to a specific txn without considering the behavior of concurrent txns. Still need to support sharing pages.
Optimization
- multiple buffer pools
  
  Helps reduce latch contention and improve locality. Use embed object id or hashing for mapping.
- pre-fetching
- scan sharing
- buffer pool bypass
  
  The sequential scan operator will not store fetched pages in the buffer pool to avoid overhead.
Buffer replacement polices
- Least-Recently-Used
- Clock
LRU and CLOCK replacement policies are susceptible to sequential flooding.

sequential flooding

A query performs a sequential scan that reads every page and this pollutes the buffer pool with pages that are read once and then never again.
- LRU-K
- Localization
- Priority hints

Access Methods

Responsibilities:

Data Organization (in memory / pages)

Concurrency

Data structures designed for Disk-oriented database
1. hash table
  - hash function (how to map)
    - CRC-64
    - Google CityHash
    - …
  - hashing schema (how to handle key collisions)
    - static hashing schema
      - Linear probe hashing
      - Robin hood hashing
      - Cuckoo hashing
  - dynamic hashing schema
    - chained hashing
    - extendible hashing
    - linear hashing
2. B+Tree
- leaf node values
  - Record Ids -> A pointer to the location of the tuple that the index entry corresponds to.
  - Tuple Data -> The actual contents of the tuple is stored in the leaf node.
- duplicate keys
  - Append record id
  - Overflow leaf nodes
- node size
  
  The slower the storage device, the larger the optimal node size for a B+Tree.
- merge threshold
  
  Delaying a merge operation may reduce the amount of reorganization.
- variable length keys
  - pointers
  - variable length nodes (requires careful memory management)
  - padding
  - key map / indirection
    
    Embed an array of pointers that map to the key + value list within the node.
- intra-node search
  - linear
  - binary
  - interpolation
    
    Approximate location of desired key based on known distribution of key.
- optimization
  - prefix compression
    
    Instead of storing the entire key each time, extract common prefix and store only unique suffix for each key.
  - deduplication
    
    The leaf node can store the key once and then a maintain a list of record ids with that key.
  - suffix truncation
    
    Store a minimum prefix that is needed to correctly route probes into the index.
  - bulk insert
    
    to build a new B+Tree for an existing table, first sort the keys and then build the index from the bottom up.
  - pointer swizzling
    
    If a page is pinned in the buffer pool, then we can store raw pointers instead of page ids.
1. Trie tree
  - Trie key span
    
    The span of a trie level is the number of bits that each partial key / digit represents.
2. Radix tree
3. Trie variant
  - Judy Arrays (HP)
    - Linear Node
    - Bitmap Node
    - Uncompressed Node
  - ART Index (HyPer)
  - Masstree (Silo)

Data structure designed for in-memory databases
1. T Trees
  
  Based on AVL Trees. Instead of storing keys in nodes, store pointers to their original values.
  
  Advantages:
  - Uses less memory because it does not store keys inside of each node.
  - The DBMS evaluates all predicates on a table at the same time when accessing a tuple (i.e., not just the predicates on indexed attributes).
  Disadvantages:
  - Difficult to rebalance.
  - Difficult to implement safe concurrent access.
  - Must chase pointers when scanning range or performing binary search inside of a node. (greatly hurts cache locality)
2. BW Tree
  
  Latch-free B+Tree index built for the Microsoft Hekaton project.
  - garbage collection
  - Optimization (CMU open BW Tree)
    - Pre-Allocated Delta Records
    - Mapping Table Expansion
Index types
- Partial index PostgreSQL: Documentation: 12: 11.8. Partial Indexes
- Covering index PostgreSQL: Documentation: 11: 11.9. Index-Only Scans and Covering Indexes
- index include column
- functional / expression index PostgreSQL: Documentation: 12: 11.7. Indexes on Expressions
- Inverted index -> good at keyword searches
  
  An inverted index stores a mapping of words to records that contain those words in the target attribute.
Index concurrency control
A protocol’s correctness criteria can vary:
1. Logical Correctness: can a thread see the data that it is supposed to see
2. Physical Correctness: is the internal representation of the object found

Latch implement
- Blocking OS mutex
- Test-and-Set Spin Latch (TAS)
- Reader-Writer Locks
- Adaptive Spinlock
- Queue-based Spinlock (MCS)
Hash table latching
- Page latches
- Slot latches
B+ tree latching
- latch crabbing / coupling
  
  Basic Idea:
  1. Get latch for parent
  2. Get latch for child
  3. Release latch for parent if “safe”. (A safe node is one that will not split or merge when updated.)
- better latching algorithm
  
  Instead of assuming that there will be a split/merge, optimistically traverse the tree using read latches. If you guess wrong, repeat traversal with the pessimistic algorithm.
- B link-Tree Optimization
  
  When a leaf node overflows, delay updating its parent node.
- versioned latch coupling
  
  Optimistic crabbing scheme where writers are not blocked on readers.

Operator Execution

Operator Algorithms
- Sort
  - External merge sort
    
    Pass #0: Use B buffer pages and Produce ⌈N / B⌉ sorted runs of size B
    
    Pass #1,2,3…: Merge B-1 runs (i.e., K-way merge)
  - B+ tree sorting (clustered / unclustered)
  - Optimization
    - Chunk I/O into large blocks to amortize costs.
    - Double-buffering to overlap CPU and I/O.
- Aggregations
  
  We don’t always need the data to be ordered, so hashing is a better alternative (only need to remove duplicate, no need for sorting and cheaper computation).
  - Sorting
  - Hashing
    - External hashing aggregation
      
      Phase #1: partition
      
      Divide tuples into buckets based on hash key and write them out to disk when they get full.
      
      Phase #2: reHash
      
      Build in-memory hash table for each partition and compute the aggregation
- Join
  
  Join vs Cross-Product
  
  Join is the most common operation and thus must be carefully optimized.
  
  Cross-Product followed by a selection is inefficient because the cross-product is large.
  - Cost Analysis Criteria (determine whether one join algorithm is better than another)
    
    Metric: # of IOs to compute join
  - Join Algorithm
    - Nested loop join
    - Sort-Merge join
    - Hash join
  For table R owns M pages m tuples and table S owns N pages n tuples.
Query Processing Models
- Iterator model (Volcano / Pipeline)
  
  The operator implements a loop that calls Next on its children to retrieve their tuples and then process them.
- Materialization model (better for OLTP)
  
  Each operator processes its input all at once and then emits its output all at once.
- Vectorized / Batch model (better for OLAP)
  
  Like the iterator model, but each operator emits a batch of tuples instead of a single tuple.
Access method
- Sequential Scan
  
  Optimizations
  - Prefetching
  - Buffer Pool Bypass
  - Parallelization
  - Heap Clustering
  - Zone Maps
    
    Pre-computed aggregates for the attribute values in a page.
  - Late Materialization
    
    Delay stitching together tuples until the upper parts of the query plan.
- Index Scan
  
  Picks an index to find the tuples that the query needs.
- Multi-Index / “Bitmap” Scan
  
  If there are multiple indexes that the DBMS can use for a query
  - Compute sets of record ids using each matching index
  - Combine these sets based on the query’s predicates (union / intersect)
  - Retrieve the records and apply any remaining predicates.
Halloween problem can occur on clustered tables or index scans.

Anomaly where an update operation changes the physical location of a tuple, which causes a scan operator to visit the tuple multiple times.

The DBMS represents a WHERE clause as an expression tree (flexible but slow) for expression evaluation.
Runtime Architectures
- Process per DBMS Worker
  
  Relies on OS scheduler and shared memory.
  
  A process crash doesn’t take down entire system.
- Process pool
  
  Relies on OS scheduler and shared memory.
  
  Bad for CPU cache locality.
- Thread per DBMS Worker
  
  DBMS manages its own scheduling, may or may not use a dispatcher thread.
  
  Thread crash (may) kill the entire system.
Parallel Query Execution
- Execution Parallelism
  - Inter-Query (Different queries are executed concurrently)
    
    If multiple queries are updating the database at the same time, concurrency control is involved
  - Intra-Query (Execute the operations of a single query in parallel)
    - Intra-Operator (Horizontal)
      
      Decompose operators into independent fragments that perform the same function on different subsets of data.
    - Inter-Operator (Vertical) (Pipelined parallelism)
      
      Operations are overlapped in order to pipeline data from one stage to the next without materialization.
    - Bushy
      
      Extension of inter-operator parallelism. Exchange nodes inserted over essentially independent query plan fragments allow those fragments to execute independently of one another.
- I/O Parallelism
  - Multiple Disks per Database
  - One Database per Disk
  - One Relation per Disk
  - Split Relation across Multiple Disks
  Partition
  
  Split single logical table into disjoint physical segments that are stored/managed separately.
  - vertical partitioning (store a table’s attributes in a separate location)
  - horizontal partitioning (divide the tuples of a table up into disjoint segments based on some partitioning key)
    - hash partitioning
    - range partitioning
    - predicate partitioning

Query planning

Heuristics / Rules (Rewrite the query to remove stupid / inefficient things)

Two relational algebra expressions are equivalent if they generate the same set of tuples. The DBMS can identify better query plans without a cost model. This is often called query rewriting.

An optimizer implemented using if/then/else clauses or a pattern-matching rule engine, transforms a query’s expressions (e.g., WHERE clause predicates) into the optimal/minimal set of expressions.
- Logical query optimization
  1. split conjunctive predicates
    
    Decompose predicates into their simplest forms to make it easier for the optimizer to move them around.
  2. predicate pushdown
    
    Move the predicate to the lowest point in the plan after Cartesian products.
  3. replace cartesian products with joins
    
    Replace all Cartesian Products with inner joins using the join predicates
  4. projection pushdown
    
    Eliminate redundant attributes before pipeline breakers to reduce materialization cost.
  For nested sub-queries
  - rewrite to de-correlate / flatten them
  - decomposed nested query and store result to temporary table
Cost-based Search (Use a model to estimate the cost of executing a plan)
- cost model component
  - Physical costs (predict CPU cycles, I/O, … and it is heavily depends on hardware)
  - Logical costs (estimate operator result sizes and independency of the operator algorithm)
  - Algorithm costs (complexity of the operator algorithm implementation)
- selectivity estimations
  
  statistic:
  
  For each relation R, the DBMS maintains the following information:
  - NRNR : Number of tuples in R.
  - V(A,R)V(A,R): Number of distinct values for attribute A.
  The selection cardinality SC(A,R) is the average number of records with a value for an attribute A given NR/V(A,R)NR/V(A,R)
  
  The selectivity (sel) of a predicate P is the fraction of tuples that qualify.
  
  Formula depends on type of predicate (Equality / Range / Negation / Conjunction / Disjunction) and relies on the following assumptions
  - Assumption #1: Uniform data
    
    The distribution of values (except for the heavy hitters) is the same.
  - Assumption #2: Independent predicates
    
    The predicates on attributes are independent.
  - Assumption #3: Inclusive principle
    
    The domain of join keys overlap such that each key in the inner relation will also exist in the outer table.
- non-uniform approximation
  - equi-width histogram (all buckets have the same width)
  - equi-depth histogram (vary the width of buckets so that the total number of occurrences for each bucket is roughly the same)
  - sketches (probabilistic data structures that generate approximate statistics about a data set)
  - sampling
- query optimization
  - Single relation query planning
    1. Pick the best access method.
    2. Predicate evaluation ordering.
  - Multiple relation query planning
    1. Enumerate relation orderings (need to restrict search space)
    2. Enumerate the plans for each operator
    3. Enumerate access method choices for each table
    Use dynamic programming to reduce the number of cost estimations.
  - Nested sub-queries

Concurrency Control

Transaction

Correct criteria

Atomicity Consistency Isolation Durability

Atomicity
- Logging
- Shadow Paging
Consistency
- database consistency (accurately models the real world and follows integrity constraints)
- transaction consistency
  
  If the database is consistent before the transaction starts (running alone), it will also be consistent after. It is the application’s responsibility.
Isolation

A concurrency control protocol is how the DBMS decides the proper interleaving of operations from multiple transactions.
- pessimistic
- optimistic
Interleaved execution anomalies
- Read-Write Conflict -> unrepeatable read
- Write-Read Conflict -> dirty read
- Write-Write Conflict -> overwrite uncommitted data
Schedule correctness judgement

If a schedule is equivalent to some serial execution, we judge it is correct.
- Serial schedule
  
  A schedule that does not interleave the actions of different transactions.
- Equivalent schedule
  
  For any database state, the effect of executing the first schedule is identical to the effect of executing the second schedule.
- Serializable schedule
  
  A schedule that is equivalent to some serial execution of the transactions.
Different levels of serializability
- Conflict Serializability (most DBMSs try to support it)
  
  Verify using either the “swapping” method or dependency graphs.
- View Serializability (no DBMS can do this)
  
  No efficient way to verify.
Serial ⫋⫋ Conflict Serializable ⫋⫋ View Serializable ⫋⫋ All Schedules
Durability
- logging
- shadow paging

Pessimistic

Basic Lock Type:
- S-LOCK
- X-LOCK
Two-phase locking

Phase #1. Growing

Phase #2. Shrinking

Problem: it is subject to cascading aborts.
Strong strict 2PL

Release all locks at end of txn.

Allows only conflict serializable schedules.
2PL deadlocks
- deadlock detection (waits-for graph)
  - deadlock handling:
    - victim selection
    - rollback length: completely or minimally
- deadlock prevention
  
  Assign priorities based on timestamps: Older Timestamp = Higher Priority
  - wait-die
  If requesting txn has higher priority than holding txn, then requesting txn waits for holding txn.
  - wound-wait
  If requesting txn has higher priority than holding txn, then holding txn aborts and releases lock.
Lock granularities

Trade-off between parallelism versus overhead.

Fewer locks, larger granularity vs. more locks, smaller granularity
- intention locks
  - Intention-Shared (IS)
  - Intention-Exclusive (IX)
  - Shared+Intention-Exclusive (SIX)

Optimistic

Timestamp Ordering (T/O)

timestamp allocation
- system clock
- logical counter
- hybrid
Basic Timestamp Ordering (T/O) Protocol

Thomas Write Rule (reduce abort) :

If TS(Ti) < W-TS(X), ignore the write to allow the txn to continue executing without aborting.
Optimistic Concurrency Control
1. Read Phase
2. Validation Phase
  - Backward validation
  - Forward validation
3. Write Phase
Phantom problem
- Re-Execute Scans
- Predicate Locking
- Index Locking

MVCC

Snapshot isolation

Write skew anomaly

Concurrency control protocol (the default is Timestamp ordering)
Version storage
- Append only storage
  
  chain ordering (oldest to newest / newest to oldest)
- Time-travel storage
- Delta storage
Garbage collection
- Look for expired versions
  - Version tracking
    - Tuple level
      
      background vacuuming vs. cooperative cleaning
    - Transaction level
    - Epochs
- Decide when it is safe to reclaim memory
  - Frequency
    - Periodically
    - Continuously
  - Granularity
    - Single version
    - Group version
    - Tables
  - Comparison unit (determine whether version(s) are reclaimable)
    - Timestamp
    - Interval
Index management
- Logical pointers
- Physical pointers

Modern MVCC Implementation

Hekaton (SQL Server)
TUM HyPer
SAP HANA
CMU Cicada

Recovery

Ensure database consistency, transaction atomicity, and durability despite failures

Actions during normal txn processing to ensure that the DBMS can recover from a failure.
- Failure classification
  - Transaction failures
    - logical errors
      
      Transaction cannot complete due to some internal error condition (e.g., integrity constraint violation).
    - internal state errors
      
      DBMS must terminate an active transaction due to an error condition (e.g., deadlock).
  - System failures
    - software failure
      
      Problem with the OS or DBMS implementation (e.g., uncaught divide-by-zero exception).
    - hardware failure
      
      The computer hosting the DBMS crashes (e.g., power plug gets pulled).
  - Storage media failures
    - non-repairable hardware failure
      
      A head crash or similar disk failure destroys all or part of non-volatile storage.
      
      Destruction is assumed to be detectable (e.g., disk controller use checksums to detect failures)
- Undo & Redo
  
  Undo: The process of removing the effects of an incomplete or aborted txn.
  
  Redo: The process of re-instating the effects of a committed txn for durability.
  - Steal Policy
    
    Whether the DBMS allows an uncommitted txn to overwrite the most recent committed value of an object in non-volatile storage
  - Force Policy
    
    Whether the DBMS requires that all updates made by a txn are reflected on non-volatile storage before the txn can commit.
  no-steal + force -> shadow paging
  - disadvantages:
    
    copying the entire page table is expensive
    
    commit overhead is high
  steal + no-force -> write-ahead log
- Logging schema
  - physical logging
  - logical logging
  - physiological logging
- Checkpoints
Actions after a failure to recover the database to a state that ensures atomicity, consistency, and durability.
Mains ideas of ARIES
1. WAL with steal / no-force
2. fuzzy checkpoints (snapshot of dirty page ids)
3. Redo everything since the earliest dirty page
4. Undo txns that never commit
5. Write CLRs when undoing, to survive failures during restarts
1. Write-Ahead Logging
  - Log Sequence number
2. Fuzzy Checkpointing
  - Non-fuzzy checkpoints
    
    The DBMS halts everything when it takes a checkpoint to ensure a consistent snapshot.
  - Slightly better checkpoints
    - Active transaction table (ATT)
    - Dirty Page Table (DPT)
  - Fuzzy checkpoints
    
    Allows active txns to continue the run while the system flushes dirty pages to disk
    
    Checkpoint boundaries
    - begin (indicates start of checkpoint)
    - end (contains ATT+DPT)
3. Recovery phase
  
  Phase #1: Analysis
  
  Phase #2: Redo
  
  Phase #3: Undo

Others: Compression

granularity
- block level
- tuple level
- attribute level
- column level
columnar compression
- null supression
  
  Consecutive zeros or blanks in the data are replaced with a description of how many there were and where they existed.
- run-length encoding
  
  Compress runs of the same value in a single column into triplets:
  1. the value of the attribute
  2. the start position in the column segment
  3. the # of elements in the run
  Requires the columns to be sorted intelligently to maximize compression opportunities.
- bitmap encoding
  
  Store a separate bitmap for each unique value for an attribute where an offset in the vector corresponds to a tuple.
- delta encoding
  
  Recording the difference between values that follow each other in the same column.
- incremental encoding
  
  Type of delta encoding that avoids duplicating common prefixes/suffixes between consecutive tuples. This works best with sorted data.
- mostly encoding
  
  When values for an attribute are “mostly” less than the largest size, store them as smaller data type
- dictionary encoding
  
  Replace frequent patterns with smaller codes.
  - dictionary construction
    - all at once
    - incremental
  - dictionary scope
    - block level
    - table level
    - multi-table
  - dictionary data structures
    - array
    - hash table
    - b+ tree

Reference

CMU 15-445/645 FALL 2020 DATABASE SYSTEMS

CMU 15-721 SPRING 2020 Advanced Database Systems