Asset Storage Configuration
Document Engine supports multiple storage backends for PDFs and other assets, as detailed below.
Built-In Asset Storage
By default, Document Engine stores assets as Binary Large OBjects (BLOBs) in the database. For production environments, especially if you have bigger individual PDFs, we recommend using object storage. We currently support Amazon S3-compatible object storage and Azure Blob Storage.
Set ASSET_STORAGE_BACKEND
to built-in
to use the built-in asset storage.
When deploying with Helm, use the pspdfkit.storage.assetStorageBackend
value.
S3-Compatible Object Storage
Document Engine can also store your assets in any Amazon S3-compatible object storage service.
Configuration
Set ASSET_STORAGE_BACKEND
to S3
. Other configuration options depend on whether you’re using AWS S3 or another S3-compatible storage provider.
Here are the available parameters as Helm values:
assetStorage: # `ASSET_STORAGE_BACKEND`: `built-in`, `s3` or `azure` assetStorageBackend: s3 # S3 backend storage settings, in case `pspdfkit.storage.assetStorageBackend` is set to `s3 s3: # `ASSET_STORAGE_S3_ACCESS_KEY_ID` accessKeyId: "<...>" # `ASSET_STORAGE_S3_SECRET_ACCESS_KEY` secretAccessKey: "<...>" # `ASSET_STORAGE_S3_BUCKET` bucket: "<...>" # `ASSET_STORAGE_S3_REGION` region: "<...>" # `ASSET_STORAGE_S3_HOST` #host: "os.local" # `ASSET_STORAGE_S3_PORT` port: 443 # `ASSET_STORAGE_S3_SCHEME` #scheme: "https://" # External secret name #externalSecretName: ""
AWS S3
When running on S3, you must set the ASSET_STORAGE_S3_BUCKET
and ASSET_STORAGE_S3_REGION
configuration options to configure the bucket name and region.
If you’re running on AWS, Document Engine will try to resolve access credentials with the following precedence:
-
ASSET_STORAGE_S3_ACCESS_KEY_ID
andASSET_STORAGE_S3_SECRET_ACCESS_KEY
configuration options
We don’t recommend using credentials directly. Instead, consider using role-based permission management, depending upon the underlying platform.
If you’re not running on AWS, you must always set ASSET_STORAGE_S3_ACCESS_KEY_ID
and ASSET_STORAGE_S3_SECRET_ACCESS_KEY
.
Other S3-Compatible Storage Providers
When using an object storage provider other than Amazon S3, you must always set ASSET_STORAGE_S3_ACCESS_KEY_ID
and ASSET_STORAGE_S3_SECRET_ACCESS_KEY
. In addition, you can configure the following options:
-
ASSET_STORAGE_S3_HOST
— Host name of the storage service. -
ASSET_STORAGE_S3_PORT
— Port used to access the storage service. The default port is443
. -
ASSET_STORAGE_S3_SCHEME
— A URL scheme used when accessing the service, eitherhttp://
orhttps://
. The default ishttps://
.
For more details about using Google Cloud Storage as the storage backend, take a look at the Google Cloud Storage interoperability guide.
Bucket and Key Policy
If you’re using AWS S3, the IAM identity used by Document Engine needs the following permissions:
-
s3:ListBucket
on the configured bucket -
s3:PutObject
on all objects in the bucket (<bucket-arn>/*
) -
s3:GetObjectAcl
on all objects in the bucket (<bucket-arn>/*
) -
s3:GetObject
on all objects in the bucket (<bucket-arn>/*
) -
s3:DeleteObject
on all objects in the bucket (<bucket-arn>/*
)
If you’re using Server-side encryption with Key Management Service, the following actions must be allowed on the encryption key:
-
kms:Decrypt
-
kms:Encrypt
-
kms:GenerateDataKey
Timeouts
Note that all operations on the S3 bucket have a timeout of 30 seconds.
Azure Blob Storage
Document Engine can store your assets in Azure Blob Storage.
Configuration
To configure Azure Blob Storage as the default asset store, set ASSET_STORAGE_BACKEND
to azure
in your Document Engine configuration.
You also need to provide the following configuration options:
-
AZURE_STORAGE_ACCOUNT_NAME
-
AZURE_STORAGE_ACCOUNT_KEY
-
AZURE_STORAGE_DEFAULT_CONTAINER
Alternatively, instead of providing AZURE_STORAGE_ACCOUNT_NAME
and AZURE_STORAGE_ACCOUNT_KEY
, you can supply a connection string by setting AZURE_STORAGE_ACCOUNT_CONNECTION_STRING
.
Here they are as Helm values:
assetStorage: # `ASSET_STORAGE_BACKEND`: `built-in`, `s3` or `azure` assetStorageBackend: azure # S3 backend storage settings, in case `pspdfkit.storage.assetStorageBackend` is set to `s3 azure: # `AZURE_STORAGE_ACCOUNT_NAME` accountName: "<...>" # `AZURE_STORAGE_ACCOUNT_KEY` accountKey: "<...>" # `AZURE_STORAGE_DEFAULT_CONTAINER` container: "<...>" # `AZURE_STORAGE_ACCOUNT_CONNECTION_STRING`, takes priority over `accountName` and `accountKey` #connectionString: "" # `AZURE_STORAGE_API_URL` for custom endpoints #apiUrl: "" # External secret name #externalSecretName: ""
We recommend using Azurite with Document Engine in development and test environments when the ASSET_STORAGE_BACKEND
is set to azure
.
When using Azurite, you can also configure the URL for the Azure Blob Storage service by setting AZURE_STORAGE_API_URL
to the address of the Azurite deployment.
Which Storage Backend Should I Use?
The choice of storage backend depends on the PDF dataset that will power your application, and it impacts the general performance of Document Engine.
If you have a relatively stable number of PDF files (i.e. an amount that only changes a few times a month) with a size of lower than 5 MB each, you can safely use the built-in storage, with the main advantages being that:
-
You don’t have to worry about another piece of infrastructure.
-
Backing up the Document Engine PostgreSQL instance will also back up your assets.
For larger and more frequently changing files, we recommend using the S3-compatible asset storage backend, which provides more efficient support for concurrent uploads and downloads.
Using the S3-compatible backend means you need a separate backup routine, but you should consider that:
-
As Document Engine stores files by their SHA checksums, most of the time, a daily, incremental backup will suffice.
-
Unless you use a backup solution that orchestrates a point-in-time backup across different storage types (e.g. AWS Backup), schedule the asset storage backup right after the PostgreSQL database backup to avoid data drifting between the two.
Serving Files from Existing Storage in Your Infrastructure
If you already have a storage solution for PDF files in your infrastructure, Document Engine can integrate with it as long as the PDF files can be accessed via an HTTP endpoint. When integrating Document Engine and the file storage, you’ll need to add documents from a URL.
All PDF URLs should be considered permalinks, as PSPDFKit will always fetch the file when needed (keeping only a local cached copy that can expire at any time).
Never accept arbitrary user input as a URL for a PDF. Malicious users might leverage this to make Document Engine perform a request on their behalf. This kind of attack, known as Server-Side Request Forgery (SSRF), can be used to interact with services that assume the local network is secure, e.g. cloud automation infrastructure.
To achieve the best possible performance, ensure Document Engine instances and the file store sit in the same network (physical or virtual). This minimizes latency and maximizes download speed.
As of version 2019.4, it’s possible to perform a document editing operation on a document with a remote URL, but the resulting PDF file will need to be stored with any of the supported storage strategies. If you need to copy the transformed file back to the file store, you’ll need to do that manually by fetching the transformed file first.
If your file store requires authentication, we recommend introducing an internal proxy. When adding a document with a URL, the URL would point to the proxy endpoint, where your custom logic would be able to support the required authentication options and redirect to the file store URL of the PDF file. For more information and some sample code, visit the relevant guide article.
Migration between Asset Storage Options
It’s possible to migrate from one storage backend to another one by executing the migration command as described below. To prevent data loss, a migration doesn’t delete files from the original storage backend.
Asset storage backend migrations are incremental. You can interrupt the migration process at any time and resume it later on. This is useful when you have many documents and you’d like to perform the migration only during the time of low load on your system. You can perform the migration while Document Engine is running.
Before you start the migration process, make sure to set the ENABLE_ASSET_STORAGE_FALLBACK
configuration option to true
and to specify the storage fallbacks you want enabled. This will allow Document Engine to serve assets that haven’t yet been migrated from the old storage backend.
Remember to set it back to false
when you’ve finished migrating all the documents, as it introduces a slight decrease in performance of fetching the assets.
At any point, you can inspect how many documents are stored in each asset storage backend from the Storage tab in the Document Engine dashboard.
All configuration options mentioned in this section are also configurable in the Helm chart values.
Migrating to S3 from Built-In Storage
To migrate from the built-in asset storage to S3, follow these steps:
-
Set the
ENABLE_ASSET_STORAGE_FALLBACK
configuration option totrue
. -
Enable the built-in database storage as a fallback by setting
ENABLE_ASSET_STORAGE_FALLBACK_POSTGRES
totrue
. -
Set the
ASSET_STORAGE_BACKEND
configuration option tos3
and configure the rest of the S3 options. -
Run the migration script by executing the
pspdfkit assets:migrate:from-built-in-to-s3
command in the Document Engine container.
-
If you use
docker-compose
, run the following command in the directory where you have yourdocker-compose.yml
file:docker-compose run pspdfkit pspdfkit assets:migrate:from-built-in-to-s3
. -
If you don’t use
docker-compose
, first find the name of the Document Engine container usingdocker ps -a
. This will list all running containers and their names. Then, run the following command, replacing<container name>
with the actual Document Engine container name:docker exec <container name> pspdfkit assets:migrate:from-built-in-to-s3
.
-
When all your documents have been migrated, set the
ENABLE_ASSET_STORAGE_FALLBACK
option back tofalse
.
Migrating to Built-In Storage from S3
To migrate from the S3 asset storage to the built-in storage, follow these steps:
-
Set the
ENABLE_ASSET_STORAGE_FALLBACK
configuration option totrue
. -
Enable the S3 asset storage as a fallback by setting
ENABLE_ASSET_STORAGE_FALLBACK_S3
totrue
. -
Set the
ASSET_STORAGE_BACKEND
configuration option tobuilt-in
. Do not remove any of the S3 configuration options. -
Run the migration script by executing the
pspdfkit assets:migrate:from-s3-to-built-in
command in the Document Engine container.
-
If you use
docker-compose
, run the following command in the directory where you have yourdocker-compose.yml
file:docker-compose run pspdfkit pspdfkit assets:migrate:from-s3-to-built-in
. -
If you don’t use
docker-compose
, first find the name of the Document Engine container usingdocker ps -a
. This will list all running containers and their names. Then, run the following command, replacing<container name>
with the actual Document Engine container name:docker exec <container name> pspdfkit assets:migrate:from-s3-to-built-in
.
-
When all your documents have been migrated, set the
ENABLE_ASSET_STORAGE_FALLBACK
option back tofalse
and remove all the S3 configuration options.
Migrating to and from Azure Blob Storage
We currently don’t support batch migrations of assets to or from Azure Blob Storage. That said, you can still migrate an individual document’s assets to or from Azure. Learn more about this here.
Per-Document Storage
In addition to configuring a default storage backend for all documents by setting the ASSET_STORAGE_BACKEND
variable, you can upload documents to specific storage backends so long as those backends are enabled as fallbacks in your Document Engine configuration.
Enabling Fallbacks for Asset Storage
To use multiple asset stores in your Document Engine instance, you can configure the main asset store by setting ASSET_STORAGE_BACKEND
to built-in
, azure
, or s3
.
Once configured, the storage backend set in ASSET_STORAGE_BACKEND
will be used as the default storage for all documents. To store a specific document in a different asset storage backend than the configured default, the other asset storage needs to be enabled as a fallback.
For example, if ASSET_STORAGE_BACKEND
is set to azure
, then by default, all documents and their assets will be stored in Azure Blob Storage using the configured Azure credentials. However, you can configure a specific document to be stored in AWS S3. You can set this when uploading the document. To do this, S3 needs to be enabled as a fallback asset store.
To enable fallback asset storage, you need to set ENABLE_ASSET_STORAGE_FALLBACK
to true
. After that, enable the specific fallbacks you want by setting any of the following to true
:
-
ENABLE_ASSET_STORAGE_FALLBACK_POSTGRES
-
ENABLE_ASSET_STORAGE_FALLBACK_S3
-
ENABLE_ASSET_STORAGE_FALLBACK_AZURE
In addition to enabling the specific fallback, you need to also set any relevant configuration options for all the storage backends you have enabled. For example, if you enable S3 as an asset fallback, you need to provide the relevant configuration for S3, including the default S3 bucket.
Enabling and using fallback storage backends introduces a slight decrease in performance when fetching assets.
Uploading Documents to Different Storage
You can specify the storage
option when uploading a document to Document Engine. This way, documents can be stored in different storage backends — as long as the storage backend is enabled as either the default storage or as a fallback. Learn more about the various options you can set when uploading a document from our API Reference.
Here’s an example of a request uploading a document and specifying the S3 bucket to use for that document:
// With Document Engine running on `http://localhost:5000`. curl -X POST http://localhost:5000/api/documents \ -H "Authorization: Token token=secret" \ -H "Content-Type: multipart/form-data" \ -F 'file=@blank.pdf' \ -F 'storage={ "backend": "s3", "bucketName": "a-different-bucket-from-default-s3-bucket", "bucketRegion": "us-west-2" }'
Migrating a Document’s Assets to Different Storage
You can migrate all the assets associated with a document (PDFs, images, file attachments, etc.) and all its layers to another storage backend
by making a request to /api/documents/{documentId}/migrate_assets
.
Here’s an example curl
request to migrate a document’s assets to the built-in
storage:
// With Document Engine running on `http://localhost:5000`. curl -X POST http://localhost:5000/api/documents/{documentID}/migrate_assets \ -H "Authorization: Token token=secret" \ -H "Content-type: application/json" \ -d '{ "storage": { "backend": "built-in" } }'
Learn more about migrating assets from our API Reference.
Multiple S3 Buckets
Documents can be uploaded (or migrated) to many different S3 buckets so long as the instance role associated with your Document Engine nodes (or the AWS credentials configured for Document Engine) has the required permissions to access all the buckets you intend to upload or migrate documents to.
This feature is currently only available for S3. With Azure Blob Storage, all documents need to be stored in the default configured storage account,
AZURE_STORAGE_ACCOUNT_NAME
.
MinIO
With Helm
The Document Engine Helm chart has an optional dependency on MinIO, an S3-compatible object storage implementation. To enable it, use the following values:
assetStorage: assetStorageBackend: s3 s3: accessKeyId: "pspdfkitObjectStorageRootKey" secretAccessKey: "pspdfkitObjectStorageRootPassword" bucket: "document-engine-assets" region: "us-east-1" host: "minio" port: 9000 scheme: "http://" minio: enabled: true fullnameOverride: minio nameOverride: minio auth: rootUser: pspdfkitObjectStorageRootKey rootPassword: pspdfkitObjectStorageRootPassword defaultBuckets: "document-engine-assets"
With Docker Compose
To run the MinIO Docker container, use the following command:
docker pull minio/minio docker run -p 9000:9000 minio/minio server /export
After running these commands, you’ll see the AccessKey
and SecretKey
printed out in the terminal, which you can use to access the MinIO web interface at http://localhost:9000/minio
.
You can now configure docker-compose.yml
, like this:
environment: ASSET_STORAGE_BACKEND: S3 ASSET_STORAGE_S3_BUCKET: <minio bucket name> ASSET_STORAGE_S3_ACCESS_KEY_ID: <minio access key> ASSET_STORAGE_S3_SECRET_ACCESS_KEY: <minio secret access key> ASSET_STORAGE_S3_SCHEME: http:// ASSET_STORAGE_S3_HOST: pssync_minio ASSET_STORAGE_S3_PORT: 9000
MinIO supports emulating different regions. It defaults to us-east-1
. If you’ve changed your MinIO configuration to a different region, make sure to set ASSET_STORAGE_S3_REGION
accordingly.
Azurite
Azurite is an open source emulator from Microsoft for testing Azure Blob Storage actions in development and test environments. Our recommended solution when using Azure Blob Storage in production is to use Azurite in development to get closer to dev/prod parity.
To run Azurite in Docker, run the following commands:
docker pull mcr.microsoft.com/azure-storage/azurite docker run -p 10000:10000 mcr.microsoft.com/azure-storage/azurite
You can then configure Document Engine to use the default storage account on the Azurite instance. Learn more about the default storage account from Microsoft here.
In your Docker Compose file, for example, you can have this:
environment: AZURE_STORAGE_ACCOUNT_NAME: devstoreaccount1 AZURE_STORAGE_ACCOUNT_KEY: Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw== AZURE_STORAGE_DEFAULT_CONTAINER: pspdfkit-dev AZURE_STORAGE_API_URL: http://localhost:10000/devstoreaccount1