astronomer.providers.google.cloud.operators.bigquery
¶
This module contains Google BigQueryAsync providers.
Module Contents¶
Classes¶
Starts a BigQuery job asynchronously, and returns job id. |
|
BigQueryCheckOperatorAsync is asynchronous operator, submit the job and check |
|
Fetches the data from a BigQuery table (alternatively fetch data for selected columns) |
|
Checks asynchronously that the values of metrics given as SQL expressions are within |
|
Performs a simple value check using sql code. |
Attributes¶
- astronomer.providers.google.cloud.operators.bigquery.BIGQUERY_JOB_DETAILS_LINK_FMT = https://console.cloud.google.com/bigquery?j={job_id}¶
- class astronomer.providers.google.cloud.operators.bigquery.BigQueryInsertJobOperatorAsync(configuration, project_id=None, location=None, job_id=None, force_rerun=True, reattach_states=None, gcp_conn_id='google_cloud_default', delegate_to=None, impersonation_chain=None, cancel_on_kill=True, result_retry=DEFAULT_RETRY, result_timeout=None, deferrable=False, **kwargs)[source]¶
Bases:
airflow.providers.google.cloud.operators.bigquery.BigQueryInsertJobOperator
,airflow.models.baseoperator.BaseOperator
Starts a BigQuery job asynchronously, and returns job id. This operator works in the following way:
it calculates a unique hash of the job using job’s configuration or uuid if
force_rerun
is True- creates
job_id
in form of [provided_job_id | airflow_{dag_id}_{task_id}_{exec_date}]_{uniqueness_suffix}
- creates
submits a BigQuery job using the
job_id
- if job with given id already exists then it tries to reattach to the job if its not done and its
state is in
reattach_states
. If the job is done the operator will raiseAirflowException
.
Using
force_rerun
will submit a new job every time without attaching to already existing ones.For job definition see here:
- Parameters:
configuration (dict[str, Any]) – The configuration parameter maps directly to BigQuery’s configuration field in the job object. For more details see https://cloud.google.com/bigquery/docs/reference/v2/jobs
job_id (str | None) – The ID of the job. It will be suffixed with hash of job configuration unless
force_rerun
is True. The ID must contain only letters (a-z, A-Z), numbers (0-9), underscores (_), or dashes (-). The maximum length is 1,024 characters. If not provided then uuid will be generated.force_rerun (bool) – If True then operator will use hash of uuid as job id suffix
reattach_states (set[str] | None) – Set of BigQuery job’s states in case of which we should reattach to the job. Should be other than final states.
project_id (str | None) – Google Cloud Project where the job is running
location (str | None) – location the job is running
gcp_conn_id (str) – The connection ID used to connect to Google Cloud.
delegate_to (str | None) – The account to impersonate using domain-wide delegation of authority, if any. For this to work, the service account making the request must have domain-wide delegation enabled.
impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).
cancel_on_kill (bool) – Flag which indicates whether cancel the hook’s job or not, when on_kill is called
- class astronomer.providers.google.cloud.operators.bigquery.BigQueryCheckOperatorAsync(*, sql, gcp_conn_id='google_cloud_default', use_legacy_sql=True, location=None, impersonation_chain=None, labels=None, deferrable=False, **kwargs)[source]¶
Bases:
airflow.providers.google.cloud.operators.bigquery.BigQueryCheckOperator
BigQueryCheckOperatorAsync is asynchronous operator, submit the job and check for the status in async mode by using the job id
- class astronomer.providers.google.cloud.operators.bigquery.BigQueryGetDataOperatorAsync(*, dataset_id, table_id, project_id=None, max_results=100, selected_fields=None, gcp_conn_id='google_cloud_default', delegate_to=None, location=None, impersonation_chain=None, deferrable=False, **kwargs)[source]¶
Bases:
airflow.providers.google.cloud.operators.bigquery.BigQueryGetDataOperator
Fetches the data from a BigQuery table (alternatively fetch data for selected columns) and returns data in a python list. The number of elements in the returned list will be equal to the number of rows fetched. Each element in the list will again be a list where element would represent the columns values for that row.
Example Result:
[['Tony', '10'], ['Mike', '20'], ['Steve', '15']]
Note
If you pass fields to
selected_fields
which are in different order than the order of columns already in BQ table, the data will still be in the order of BQ table. For example if the BQ table has 3 columns as[A,B,C]
and you pass ‘B,A’ in theselected_fields
the data would still be of the form'A,B'
.Example:
get_data = BigQueryGetDataOperator( task_id='get_data_from_bq', dataset_id='test_dataset', table_id='Transaction_partitions', max_results=100, selected_fields='DATE', gcp_conn_id='airflow-conn-id' )
- Parameters:
dataset_id (str) – The dataset ID of the requested table. (templated)
table_id (str) – The table ID of the requested table. (templated)
max_results (int) – The maximum number of records (rows) to be fetched from the table. (templated)
selected_fields (str | None) – List of fields to return (comma-separated). If unspecified, all fields are returned.
gcp_conn_id (str) – (Optional) The connection ID used to connect to Google Cloud.
delegate_to (str | None) – The account to impersonate using domain-wide delegation of authority, if any. For this to work, the service account making the request must have domain-wide delegation enabled.
location (str | None) – The location used for the operation.
impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).
- generate_query()[source]¶
Generate a select query if selected fields are given or with * for the given dataset and table id
- class astronomer.providers.google.cloud.operators.bigquery.BigQueryIntervalCheckOperatorAsync(*, table, metrics_thresholds, date_filter_column='ds', days_back=-7, gcp_conn_id='google_cloud_default', use_legacy_sql=True, location=None, impersonation_chain=None, labels=None, deferrable=False, **kwargs)[source]¶
Bases:
airflow.providers.google.cloud.operators.bigquery.BigQueryIntervalCheckOperator
Checks asynchronously that the values of metrics given as SQL expressions are within a certain tolerance of the ones from days_back before.
- This method constructs a query like so ::
SELECT {metrics_threshold_dict_key} FROM {table} WHERE {date_filter_column}=<date>
- Parameters:
table (str) – the table name
days_back (SupportsAbs[int]) – number of days between ds and the ds we want to check against. Defaults to 7 days
metrics_thresholds (dict) – a dictionary of ratios indexed by metrics, for example ‘COUNT(*)’: 1.5 would require a 50 percent or less difference between the current day, and the prior days_back.
use_legacy_sql (bool) – Whether to use legacy SQL (true) or standard SQL (false).
gcp_conn_id (str) – (Optional) The connection ID used to connect to Google Cloud.
location (str | None) – The geographic location of the job. See details at: https://cloud.google.com/bigquery/docs/locations#specifying_your_location
impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).
labels (dict | None) – a dictionary containing labels for the table, passed to BigQuery
- class astronomer.providers.google.cloud.operators.bigquery.BigQueryValueCheckOperatorAsync(*, sql, pass_value, tolerance=None, gcp_conn_id='google_cloud_default', use_legacy_sql=True, location=None, impersonation_chain=None, labels=None, deferrable=False, **kwargs)[source]¶
Bases:
airflow.providers.google.cloud.operators.bigquery.BigQueryValueCheckOperator
Performs a simple value check using sql code.
See also
For more information on how to use this operator, take a look at the guide: Compare query result to pass value
- Parameters:
sql (str) – the sql to be executed
use_legacy_sql (bool) – Whether to use legacy SQL (true) or standard SQL (false).
gcp_conn_id (str) – (Optional) The connection ID used to connect to Google Cloud.
location (str | None) – The geographic location of the job. See details at: https://cloud.google.com/bigquery/docs/locations#specifying_your_location
impersonation_chain (str | Sequence[str] | None) – Optional service account to impersonate using short-term credentials, or chained list of accounts required to get the access_token of the last account in the list, which will be impersonated in the request. If set as a string, the account must grant the originating account the Service Account Token Creator IAM role. If set as a sequence, the identities from the list must grant Service Account Token Creator IAM role to the directly preceding identity, with first account from the list granting this role to the originating account (templated).
labels (dict | None) – a dictionary containing labels for the table, passed to BigQuery
deferrable (bool) – Run operator in the deferrable mode