Operator

class bigquery_operator.operator.Operator(client: Client, dataset_id: str)[source]

Bases: object

Wrapper for usual operations on a fixed BigQuery dataset.

Parameters:
  • client (google.cloud.bigquery.client.Client) – Client to manage connections to the BigQuery API.

  • dataset_id (str) – The dataset id in the format ‘project_id.dataset_name’.

build_table_id(table_name: str) str[source]

Return a table id.

Parameters:

table_name (str) – A table name.

Returns:

A table id in the format ‘dataset_id.table_name’ where

dataset_id is the argument passed to the __init__ method.

Return type:

str

clean_dataset() None[source]

Delete all the tables from the dataset.

property client: Client

The client.

Type:

google.cloud.bigquery.client.Client

property client_project_id: str

The id of the project which the client acts on behalf of.

Type:

str

copy_table(source_table_name: str, destination_table_name: str, time_to_live: int | None = None, source_dataset_id: str | None = None, write_disposition: WriteDisposition | None = 'WRITE_TRUNCATE') None[source]

Copy a table. source_dataset_id must be given in the format ‘project_id.dataset_name’. If not passed, falls back to self.dataset_id.

copy_tables(source_table_names: List[str], destination_table_names: List[str], time_to_live: int | None = None, source_dataset_id: str | None = None, write_disposition: WriteDisposition | None = 'WRITE_TRUNCATE') None[source]

Copy tables. source_dataset_id must be given in the format ‘project_id.dataset_name’. If not passed, falls back to self.dataset_id.

create_dataset(location: str) None[source]

Create the dataset.

create_dataset_if_not_exist(location: str) None[source]

Create the dataset if it does not exist. Otherwise, check that the actual location of the existing dataset equals the location specified in the argument.

create_empty_table(table_name: str, pre_delete_if_exists: bool | None = False, time_to_live: int | None = None, schema: List[SchemaField] | None = None, time_partitioning: TimePartitioning | None = None, range_partitioning: RangePartitioning | None = None, require_partition_filter: bool | None = None, clustering_fields: List[str] | None = None) None[source]

Create a empty table. Only specify at most one of time_partitioning or range_partitioning.

create_view(query: str, destination_table_name: str, pre_delete_if_exists: bool | None = False, time_to_live: int | None = None) None[source]

Create a view.

dataset_exists() bool[source]

Return True if the dataset exists.

property dataset_id: str

The dataset id in the format ‘project_id.dataset_name’.

Type:

str

property dataset_name: str

The dataset name.

Type:

str

property dataset_project_id: str

The project id of the dataset.

Type:

str

delete_dataset() None[source]

Delete the dataset.

delete_table(table_name: str) None[source]

Delete a table.

delete_table_if_exists(table_name: str) None[source]

Delete a table if it exists.

delete_table_if_mismatches(reference: str, table_name: str) None[source]

Delete a table if the format attributes of the table and the reference table are not the same. The format attributes are given by the method get_format_attributes.

extract_table(source_table_name: str, destination_uri: str, compression: str | None = None, field_delimiter: str | None = '|', print_header: bool | None = True) None[source]

Extract a table.

extract_tables(source_table_names: List[str], destination_uris: List[str], compression: Compression | None = None, field_delimiter: str | None = '|', print_header: bool | None = True) None[source]

Extract tables from BigQuery to Storage. Each source table is extracted as one or more CSV files.

get_columns(table_name: str) List[str][source]

Return the column names of a table.

get_dataset() Dataset[source]

Get the dataset. An api call is made.

get_format_attributes(table_name)[source]

Return the following table attributes: schema, time_partitioning, range_partitioning, require_partition_filter, clustering_fields.

get_query_rows(query: str) List[Row][source]

Return the rows of a query.

get_table(table_name: str) Table[source]

Get a table. An api call is made.

get_table_rows(table_name: str) List[Row][source]

Return the rows of a table.

instantiate_dataset() Dataset[source]

Instantiate the dataset. No api call is made.

instantiate_table(table_name: str) Table[source]

Instantiate a table. No api call is made.

list_tables() List[str][source]

List the names of the tables in the dataset.

load_table(source_uri: str, destination_table_name: str, time_to_live: int | None = None, schema: List[SchemaField] | None = None, field_delimiter: str | None = '|', write_disposition: WriteDisposition | None = 'WRITE_TRUNCATE') None[source]

Load one or more Storage CSV files into one BigQuery table.

load_tables(source_uris: List[str], destination_table_names: List[str], time_to_live: int | None = None, schemas: List[List[SchemaField]] | None = None, field_delimiter: str | None = '|', write_disposition: WriteDisposition | None = 'WRITE_TRUNCATE') None[source]

Load Storage CSV files into BigQuery tables.

run_queries(queries: List[str], destination_table_names: List[str], sample_size: int | None = None, time_to_live: int | None = None, write_disposition: WriteDisposition | None = 'WRITE_TRUNCATE') dict[source]

Run queries. Return monitoring as a dict in the format {‘duration’: d, ‘GB’: gb} where d is the execution duration in seconds and gb the number of gigabytes processed by the queries.

run_query(query: str, destination_table_name: str, sample_size: int | None = None, time_to_live: int | None = None, write_disposition: WriteDisposition | None = 'WRITE_TRUNCATE') dict[source]

Run a query. Return monitoring as a dict in the format {‘duration’: d, ‘GB’: gb} where d is the execution duration in seconds and gb the number of gigabytes processed by the query.

static sample_query(query: str, size: int) str[source]

Sample randomly a query.

The output query gives a subset of the lines given by the input query. This subset has approximately size lines. Nonetheless, the number of gigabytes processed is the same for the output query and the input query.

set_time_to_live(table_name: str, nb_days: int, retry_delays: Tuple[int, ...] = (10, 30)) None[source]

Set the time to live of a table in days. More precisely the expires attribute of the table is set to UTC midnight between (today + nb_days) and (today + nb_days + 1), if it has not already this value.

We have noticed that some unexpected google.api_core.exceptions.PreconditionFailed can happen. That is why we try to update the table several times. If this exception is raised, we try again after a delay. The retry delays are specified in seconds in the argument retry_delays.

table_exists(table_name: str) bool[source]

Return True if the table exists.

table_is_empty(table_name: str) bool[source]

Return True if the table is empty.