ADSI Notes - Diogo Lopes

Search

Search IconIcon to open search

Lecture 4 - Query Processing

Last updated Mar 17, 2023

Slides

# Query Processing

# Basic Steps in Query Processing

  1. Parsing and translation – Translate the query into its internal form. This is then translated into relational algebra. – Parser checks syntax, verifies relations.
  2. Optimization – Construct an execution plan that minimizes the cost of query evaluation.
  3. Evaluation – The evaluation engine takes an execution plan, executes that plan, and returns the answers to the query.

The system’s goal is to create the best execution plan possible.

# Selection Operation

# Algorithm A1 (linear search, across all records)

Scan each file block and test all records to see whether they satisfy the selection condition. Assumes the data is sequential on disk.

# A2 (clustered index, equality on key (attr with unique values)).

Usually used when an Index on the PK is setup up Retrieve a single record that satisfies the corresponding equality condition

# A3 (clustered index, equality on non-key)

Retrieve multiple records.

# A4 (non-clustered index, equality on key/non-key)

Retrieve a single record if the search-key is a candidate key

It does not matter if an index is clustered or not if you only have to retrieve ONE record. Only one seek time for clustered vs multiple for non-clustered

# Selection Involving Comparisons

Can implement selections in relational algebra. like > or <, by using linear scans or indices:

# A5 (clustered index, comparison)

Retrieve multiple records.

# A6 (non-clustered index, comparison)

Retrieve multiple records.

# Implementation of Complex Selections

Composite index: index on multiple columns at the same time

# A7 (conjunctive selection using one index).

Select a combination of selection of columns and algorithms A1 through A7 that results in the
least cost. Test other conditions on tuple after fetching it into memory

# A8 (conjunctive selection using composite index).

# A9 (conjunctive selection by intersection of identifiers).

Requires indices with record pointers.
Use corresponding index for each condition, and take intersection of all the obtained sets of record pointers.
Then fetch records from file. If some conditions do not have appropriate indices, apply test in memory

# A10 (disjunctive selection by union of identifiers).

Applicable if all conditions have available indices.

Then fetch records from file.

# Sorting

Usually we cant bring the whole data into memory to sort it

# External Sort-Merge

# Steps

  1. RUNS: Load 3 records, sort, write to disk, repeat
    • Now we have multiple temporary files sorted in disk
  2. MERGE: Pick smallest record from a file and compare it with the records from another file, and write it to another file/disk
  3. Pick the next smallest record from a file and compare it to the smallest record in another file, output the smallest
  4. Repeat… and get the sorted output in disk

Allows us to create files larger than out memory Each step has I/O operations. And the number of steps decreases logarithmically with a factor of 2

# Cost analysis

The number of blocks in relation r is: b Initial runs I/O: 2b_r block transfers The number of initial runs is: [ b_r /M ] Each merge pass decreases the number of runs by a factor of M-1 The total number of merge passes is: [ log_M–1 (b_r /M) ] Each merge pass reads and writes every block: 2br block transfers For the final pass we don’t count the write cost: -b

# Cost of seeks:

in the example: 1 seek = 1 block -> 3 records During run generation: one seek to read each run and one seek to write each run : 2[b_r /M ] During the merge phase: Need 2b_r seeks for each merge pass (Except the final one which does not require a write)

# Join Operations

# Nested-Loop Join

# Block Nested-Loop Join

# Indexed Nested-Loop Join

# Merge-Join

  1. check if record 1 from r matches with record 1 from s
  2. if yes, output
  3. if no, step both r and s (a lot of seeks, but we might be able to read multiple blocks into memory)
  4. repeat…

# Hash-Join

Cost analysis:

Because we READ-WRITE-READ, the Hash-Join can be 3x SLOWER than Merge-Join (assuming the tables are sorted). This is the type of decision that the system has to do!

# Blocking Operations

Pipelining: as we generate results, we send them to the next stage

Materialization: we compute the resuts and store them on disk, then read and compute more… (saving in between)

Lecture 3 Indexing | Lecture 5 Query Optimization