Async Queries via Celery

Celery

On large analytic databases, it’s common to run queries that execute for minutes or hours. To enable support for long running queries that execute beyond the typical web request’s timeout (30-60 seconds), it is necessary to configure an asynchronous backend for Liteset which consists of:

one or many Liteset workers (which is implemented as a Celery worker), and can be started with the celery worker command, run celery worker --help to view the related options.
a celery broker (message queue) for which we recommend using Redis or RabbitMQ
a results backend that defines where the worker will persist the query results

Configuring Celery requires defining a CELERY_CONFIG in your superset_config.py. Both the worker and ASGI server processes should have the same configuration.

class CeleryConfig(object):
    broker_url = "redis://localhost:6379/0"
    imports = (
        "superset.sql_lab",
        "superset.tasks.scheduler",
    )
    result_backend = "redis://localhost:6379/0"
    worker_prefetch_multiplier = 10
    task_acks_late = True
    task_annotations = {
        "sql_lab.get_sql_results": {
            "rate_limit": "100/s",
        },
    }

CELERY_CONFIG = CeleryConfig

To start a Celery worker to leverage the configuration, run the following command:

celery --app=superset.tasks.celery_app:app worker --pool=prefork -O fair -c 4

To start a job which schedules periodic background jobs, run the following command:

celery --app=superset.tasks.celery_app:app beat

To set up a result backend, point RESULTS_BACKEND in your superset_config.py at any object that implements Liteset's SyncCacheProtocol (defined in superset.cache.manager). Liteset ships a production-grade Redis adapter (SyncRedisCacheAdapter) that the SQL Lab worker can use directly — no Flask-Caching wiring needed. You can also implement your own adapter (S3, memcached, etc.) by exposing the same get/set/delete/clear interface. Your superset_config.py may look something like:

# On Redis (recommended)
from redis import Redis
from superset.cache.manager import SyncRedisCacheAdapter

RESULTS_BACKEND = SyncRedisCacheAdapter(
    Redis.from_url("redis://localhost:6379/0"),
    default_ttl=86400,
    key_prefix="superset_results_",
)

# On S3 — bring your own adapter that satisfies SyncCacheProtocol
from s3cache.s3cache import S3Cache  # third-party
S3_CACHE_BUCKET = 'foobar-superset'
S3_CACHE_KEY_PREFIX = 'sql_lab_result'
RESULTS_BACKEND = S3Cache(S3_CACHE_BUCKET, S3_CACHE_KEY_PREFIX)

For performance gains, MessagePack and PyArrow are now used for results serialization. This can be disabled by setting RESULTS_BACKEND_USE_MSGPACK = False in your superset_config.py, should any issues arise. Please clear your existing results cache store when upgrading an existing environment.

Important Notes

It is important that all the worker nodes and ASGI servers in the Liteset cluster share a common metadata database. This means that SQLite will not work in this context since it has limited support for concurrency and typically lives on the local file system.
There should only be one instance of celery beat running in your entire setup. If not, background jobs can get scheduled multiple times resulting in weird behaviors like duplicate delivery of reports, higher than expected load / traffic etc.
SQL Lab will only run your queries asynchronously if you enable Asynchronous Query Execution in your database settings (Sources > Databases > Edit record).

Celery Flower

Flower is a web based tool for monitoring the Celery cluster which you can install from pip:

pip install flower

You can run flower using:

celery --app=superset.tasks.celery_app:app flower

Celery​

Celery Flower​

Celery

Celery Flower