google.cloud.operators.bigquery

This module contains Google BigQueryAsync providers.

Module Contents

Classes

BigQueryInsertJobOperatorAsync

Starts a BigQuery job asynchronously, and returns job id.

BigQueryCheckOperatorAsync

BigQueryGetDataOperatorAsync

Fetches the data from a BigQuery table (alternatively fetch data for selected columns)

BigQueryIntervalCheckOperatorAsync

Checks asynchronously that the values of metrics given as SQL expressions are within

BigQueryValueCheckOperatorAsync

Attributes

BIGQUERY_JOB_DETAILS_LINK_FMT

class google.cloud.operators.bigquery.BigQueryInsertJobOperatorAsync(task_id, owner=conf.get('operators', 'DEFAULT_OWNER'), email=None, email_on_retry=conf.getboolean('email', 'default_email_on_retry', fallback=True), email_on_failure=conf.getboolean('email', 'default_email_on_failure', fallback=True), retries=conf.getint('core', 'default_task_retries', fallback=0), retry_delay=timedelta(seconds=300), retry_exponential_backoff=False, max_retry_delay=None, start_date=None, end_date=None, depends_on_past=False, wait_for_downstream=False, dag=None, params=None, default_args=None, priority_weight=1, weight_rule=conf.get('core', 'default_task_weight_rule', fallback=WeightRule.DOWNSTREAM), queue=conf.get('operators', 'default_queue'), pool=None, pool_slots=1, sla=None, execution_timeout=None, on_execute_callback=None, on_failure_callback=None, on_success_callback=None, on_retry_callback=None, pre_execute=None, post_execute=None, trigger_rule=TriggerRule.ALL_SUCCESS, resources=None, run_as_user=None, task_concurrency=None, max_active_tis_per_dag=None, executor_config=None, do_xcom_push=True, inlets=None, outlets=None, task_group=None, doc=None, doc_md=None, doc_json=None, doc_yaml=None, doc_rst=None, **kwargs)

Bases: airflow.providers.google.cloud.operators.bigquery.BigQueryInsertJobOperator, airflow.models.baseoperator.BaseOperator

Starts a BigQuery job asynchronously, and returns job id. This operator works in the following way:

  • it calculates a unique hash of the job using job’s configuration or uuid if force_rerun is True

  • creates job_id in form of

    [provided_job_id | airflow_{dag_id}_{task_id}_{exec_date}]_{uniqueness_suffix}

  • submits a BigQuery job using the job_id

  • if job with given id already exists then it tries to reattach to the job if its not done and its

    state is in reattach_states. If the job is done the operator will raise AirflowException.

Using force_rerun will submit a new job every time without attaching to already existing ones.

For job definition see here:

Parameters
  • configuration – The configuration parameter maps directly to BigQuery’s configuration field in the job object. For more details see https://cloud.google.com/bigquery/docs/reference/v2/jobs

  • job_id – The ID of the job. It will be suffixed with hash of job configuration unless force_rerun is True. The ID must contain only letters (a-z, A-Z), numbers (0-9), underscores (_), or dashes (-). The maximum length is 1,024 characters. If not provided then uuid will be generated.

  • force_rerun – If True then operator will use hash of uuid as job id suffix

  • reattach_states – Set of BigQuery job’s states in case of which we should reattach to the job. Should be other than final states.

  • project_id – Google Cloud Project where the job is running

  • location – location the job is running

  • gcp_conn_id – The connection ID used to connect to Google Cloud.

  • delegate_to – The account to impersonate using domain-wide delegation of authority, if any. For this to work, the service account making the request must have domain-wide delegation enabled.

  • impersonation_chain – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

  • cancel_on_kill – Flag which indicates whether cancel the hook’s job or not, when on_kill is called

execute(self, context)

This is the main method to derive when creating an operator. Context is the same dictionary used as when rendering jinja templates.

Refer to get_template_context for more context.

execute_complete(self, context, event)

Callback for when the trigger fires - returns immediately. Relies on trigger to throw an exception, otherwise it assumes execution was successful.

class google.cloud.operators.bigquery.BigQueryCheckOperatorAsync

Bases: airflow.providers.google.cloud.operators.bigquery.BigQueryCheckOperator

execute(self, context)
execute_complete(self, context, event)

Callback for when the trigger fires - returns immediately. Relies on trigger to throw an exception, otherwise it assumes execution was successful.

class google.cloud.operators.bigquery.BigQueryGetDataOperatorAsync

Bases: airflow.providers.google.cloud.operators.bigquery.BigQueryGetDataOperator

Fetches the data from a BigQuery table (alternatively fetch data for selected columns) and returns data in a python list. The number of elements in the returned list will be equal to the number of rows fetched. Each element in the list will again be a list where element would represent the columns values for that row.

Example Result: [['Tony', '10'], ['Mike', '20'], ['Steve', '15']]

Note

If you pass fields to selected_fields which are in different order than the order of columns already in BQ table, the data will still be in the order of BQ table. For example if the BQ table has 3 columns as [A,B,C] and you pass ‘B,A’ in the selected_fields the data would still be of the form 'A,B'.

Example:

get_data = BigQueryGetDataOperator(
    task_id='get_data_from_bq',
    dataset_id='test_dataset',
    table_id='Transaction_partitions',
    max_results=100,
    selected_fields='DATE',
    gcp_conn_id='airflow-conn-id'
)
Parameters
  • dataset_id – The dataset ID of the requested table. (templated)

  • table_id – The table ID of the requested table. (templated)

  • max_results – The maximum number of records (rows) to be fetched from the table. (templated)

  • selected_fields – List of fields to return (comma-separated). If unspecified, all fields are returned.

  • gcp_conn_id – (Optional) The connection ID used to connect to Google Cloud.

  • bigquery_conn_id – (Deprecated) The connection ID used to connect to Google Cloud. This parameter has been deprecated. You should pass the gcp_conn_id parameter instead.

  • delegate_to – The account to impersonate using domain-wide delegation of authority, if any. For this to work, the service account making the request must have domain-wide delegation enabled.

  • location – The location used for the operation.

  • impersonation_chain – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

generate_query(self)

Generate a select query if selected fields are given or with * for the given dataset and table id

execute(self, context)
execute_complete(self, context, event)

Callback for when the trigger fires - returns immediately. Relies on trigger to throw an exception, otherwise it assumes execution was successful.

class google.cloud.operators.bigquery.BigQueryIntervalCheckOperatorAsync

Bases: airflow.providers.google.cloud.operators.bigquery.BigQueryIntervalCheckOperator

Checks asynchronously that the values of metrics given as SQL expressions are within a certain tolerance of the ones from days_back before.

This method constructs a query like so ::

SELECT {metrics_threshold_dict_key} FROM {table} WHERE {date_filter_column}=<date>

Parameters
  • table – the table name

  • days_back – number of days between ds and the ds we want to check against. Defaults to 7 days

  • metrics_thresholds – a dictionary of ratios indexed by metrics, for example ‘COUNT(*)’: 1.5 would require a 50 percent or less difference between the current day, and the prior days_back.

  • use_legacy_sql – Whether to use legacy SQL (true) or standard SQL (false).

  • gcp_conn_id – (Optional) The connection ID used to connect to Google Cloud.

  • bigquery_conn_id – (Deprecated) The connection ID used to connect to Google Cloud. This parameter has been deprecated. You should pass the gcp_conn_id parameter instead.

  • location – The geographic location of the job. See details at: https://cloud.google.com/bigquery/docs/locations#specifying_your_location

  • impersonation_chain – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).

  • labels – a dictionary containing labels for the table, passed to BigQuery

execute(self, context)
execute_complete(self, context, event)

Callback for when the trigger fires - returns immediately. Relies on trigger to throw an exception, otherwise it assumes execution was successful.

class google.cloud.operators.bigquery.BigQueryValueCheckOperatorAsync

Bases: airflow.providers.google.cloud.operators.bigquery.BigQueryValueCheckOperator

execute(self, context)
execute_complete(self, context, event)

Callback for when the trigger fires - returns immediately. Relies on trigger to throw an exception, otherwise it assumes execution was successful.