AzBlobChecker Deployment and Usage Guide You were provided access to AzBlobChecker to scan your Azure Storage account. The scanning process will examine each of your objects looking for the signs that the file might be impacted by the issue, and that the object might need to be reviewed. This document is targeted at the technical operations team that will be performing the scan. Some prior knowledge of Azure, PowerShell and the Azure CLI is expected.
AzBlobChecker Deployment and Usage Guide
You were provided access to AzBlobChecker to scan your Azure
Storage account. The scanning process
will examine each of your objects looking for the signs that the
file might be impacted by the issue, and
that the object might need to be reviewed.
This document is targeted at the technical operations team that
will be performing the scan. Some prior
knowledge of Azure, PowerShell and the Azure CLI is expected.
Contents Getting started
..............................................................................................................................................
3
Deploying the application
.............................................................................................................................
7
Sample Script to deploy to Azure Container Instances (recommended)
................................................. 7
Sample Script to deploy to an AKS Cluster
.............................................................................................
10
Environment Variable Definitions for Sourcerer and Checker
...............................................................
10
Monitoring & Metrics
.................................................................................................................................
12
Auditing the Application
.............................................................................................................................
21
To load the results of Blob Inventory into Azure Data Explorer
.............................................................
21
To load the checker Log Table into Azure Data Explorer
........................................................................
22
Sample ADX Queries
...............................................................................................................................
22
Other sourcing Algorithms
......................................................................................................................
24
Troubleshooting ACI containers that do not seem to be doing
anything .............................................. 26
How many sourcerers can I run at the same time?
................................................................................
27
Restarting the sourcing process
..............................................................................................................
28
How many checkers can I run at the same time?
...................................................................................
31
Authentication with SharedKey
..............................................................................................................
31
Getting started
Prerequisites We have designed AzBlobChecker to work with the
largest possible number of storage accounts. Please
use the following matrix to identify any known storage account
features that AzBlobChecker will not
work with. If you are using these features, please contact your
Microsoft team.
Account feature Supported (Y/N)
Blob (Hot tier) Y
Blob (Cool tier) Y
Blob (Archive tier) N – Archive Objects will be queued for
checking, but the checker will not check them.
Blob encrypted with Microsoft Managed Key Y
Blob encrypted with Customer Managed Key Y
Blob encrypted with Customer Provided Key N
Storage account has SAS Enabled Y
Storage account has Shared Key Disabled N
Storage account requires Managed Identity N
Storage account requires Service Principal Identity
N
Storage account in Public Cloud Y
Classic (v1) storage accounts Y
Storage accounts GPv2 Y
Y - requires modification of deployment script, not all features
supported
Versioning Y – only the current version will be checked
Snapshots Y – only the current snapshot will be checked
Soft delete Y – only non-deleted files will be checked
Components of AzBlobChecker There are six major components of
running AzBlobChecker:
Figure 1: AzBlobChecker architecture and components
1. The application takes in a Target Storage account, this is the
account you want to scan.
2. The results of the application run are written to the Ops
Storage account. This account contains
storage tables and queues used to keep track of and log the
progress of the application.
a. Note: The deployment scripts will automate the process of
creating this account, and when
the application is run, it will automatically populate any required
assets (queues, tables,
blobs) into this account.
3. An Application Insights instance for monitoring progress while
the application is running. This gives
you near real-time insights into the progress of the
application.
a. Note: The deployment scripts will automate the process of
creating this resource.
4. An Azure Workbook that summarizes the key metrics you need to
watch while AzBlobChecker is
running.
5. The sourcerer is a .NET core application deployed as a Docker
container that will iterate over each of
the objects in your target storage account and place a message in a
queue to have the object
scanned. You can add/remove instances of the sourcerer at any time
during the sourcing process to
speed up/slow down the sourcing process.
6. The checker is a .NET core application deployed as a Docker
container that will scans each object
identified by the sourcerer. You can add/remove instances of the
checker at any time during the
checking process to speed up/slow down the checking process.
7. The watchdog is a .NET core application deployed as a Docker
container, it monitors the queue
depth and reports the status to Application Insights. You will only
need one instance of this running.
8. The Azure Container Registry where Microsoft publishes the
Docker containers for you to
download. Credentials to download assets from this registry will be
provide by your Microsoft
contact.
When you deploy AzBlobChecker the deployment script will deploy the
Ops resource group and all the
resources inside of it. This will contain a Storage account,
Application Insights instance, Azure
Workbook, an instance of the watchdog and instances of the
sourcerer and checker for each Target
Storage account you want to scan. The resource group will be named
as follows “rg-bc-<target storage
account name>”. We recommend this so that the load on the Ops
Storage account and the Application
Insights instance is more manageable, and you can add/remove
instances of the sourcerer/checker
based on the unique requirements of each Target Storage
account.
Note: the above diagram and the process outlined in this doc
assumes you are
deploying the Docker containers to Azure Container Instances,
however you can
deploy these Docker containers to whatever platform makes the most
sense for your
environment, for example you might choose to deploy to the Azure
Kubernetes
Service.
The account to be scanned is called the “target account” for
purposes within this document.
AzBlobChecker (the tool) should ONLY be given READ access to the
target account as it does not need to
modify any data. The tool keeps track of everything it does in an
operations storage account and
Application Insights.
AzBlobChecker is designed to work against a single target account.
If you need to scan multiple
accounts, you can run multiple deployments AzBlobChecker in
parallel, up to your subscription limits.
Scanning a storage account requires 2 steps:
1. Listing all the files in the storage account
2. Checking each file
Each of these steps is completed by a different Docker container,
allowing you to add/remove running
instances of each container, reducing the time it takes to scan the
storage account and optimizing the
number of instances needed while keeping below the limits of the
source storage account.
Listing all the files in the storage account The first step is
building a list of all the files that need to be scanned. We call
this process “sourcing”,
and it is done by the “sourcerer”. The output of the sourcerer is
an Azure Storage Queue, named “<your
account name>-online” for Hot/Cool objects in the target storage
account and “<your account name> -
offline” for any Archive objects in the target storage
account.
Note: The queues will only be created if objects of that type are
found. Checking offline objects is not
supported by the tool at this time.
What does the sourcerer use to track the scanning process?
The sourcerer will also create an Azure Storage Table and Queue
named “<your account name>”. This
storage table will provide a durable log of the entire scanning
process. Both the queue and the table will
contain one message/record per object found in the target storage
account.
How does the sourcerer work?
First each container in the account is listed using the List
Containers API and a message is inserted into
the <your account name>-checkpoint queue.
For each container message in the <your account
name>-checkpoint queue, the ListBlob API is called to
list all the contents of the container. This call is made using the
optional delimiter attribute set to “/”.
This allows searching for any 'folders' in the blob names.
If a folder is found a new message is added to the <your account
name>-checkpoint queue telling the
system to scan that folder.
Scanning in this manner breaks the task of listing all the contents
of the account into smaller tasks by
container & folder. These smaller tasks can be distributed to
many instances of the sourcerer running in
parallel.
How does the sourcerer ensure each object is only sourced one
time?
To prevent a target account from being sourced multiple times by
accident, a blob lease is taken on a
blob in the operations storage account. The blob is named with your
account name and is placed in a
container called “leaderelector”.
How do I know when the sourcerer is done listing all of the files
in my target storage account?
The sourcerer is done when the objects per second sourced drops to
zero and the sourcing queue depth
drops to zero. See the Azure Workbook deployed into your ops
account to monitor these metrics.
Checking each file Now that we have a list of all the objects in
the target account, we can talk about checking these
objects.
• The checker pulls a message from the <your account
name>-online queue and starts the
checking process on that object.
• If the object is found to have the characteristics of those that
need further review, it is flagged
by placing a message in the <your account name>-online-review
queue and the <your account
name>review table.
• In the <your account name> table record, each object's
status is updated so that you know that
the object was scanned.
• The checkers are done when the objects per second checked drops
to zero and the checking
queue depth drops to zero. See the Azure Workbook deployed into
your ops account to monitor
these metrics.
you to review the issue and take the appropriate action.
Deploying the application The application is shipped using Docker
containers. You can run these containers on whatever platform
makes the most sense for your organization. We have provided sample
deployment scripts to ease your
deployment. The scripts contain the details of each deployment
step; however, this can change based
on how you deploy. This document includes important configuration
and monitoring details that are the
same regardless of how you deploy.
Sample Script to deploy to Azure Container Instances (recommended)
To run the script, you must:
• Have access to the subscription where the target storage account
is located and permission to
generate a read/list SAS for the account. The deployment script can
generate this SAS for you if
you have permissions to the account. If you do not have permissions
to the account to generate
the SAS, work with an administrator of the storage account to
create a read/list SAS with the
required permissions (see Settings for account SAS for more detail
on the required permissions).
• Have access to create the Ops resource group with a storage
account, Application Insights
instance, workbook, and Application Container Instances within your
Azure subscription.
• PowerShell version 7.1 or better. Run “$PSVersionTable” to
check.
o Instructions on how to install PowerShell here
• Azure CLI version 2.29 or better. Run “az --version command” to
check.
• Note: The PowerShell version of the Azure Cloud Shell meets the
technical requirements for both
PowerShell and the Azure CLI. This allows you to run the deployment
scripts from your browser, if
you have the appropriate credentials.
Step 1: Download the script
Execute this command to download the script:
Invoke-WebRequest "https://aka.ms/AAebeuk" -OutFile
aci-deploy.ps1
Step 2: Run the script in Interactive Mode
To run the script in Interactive Mode where it will prompt you for
all the configuration information,
execute this command:
.\aci-deploy.ps1
• Note: When you run the script for the first time you will be
prompted to provide
o Information about the Azure Container Registry where you will get
AzBlobChecker from.
These will be provided by your Microsoft Contact.
o # of required sourcerers – We suggest starting with 2
sourcerers
o Details about the target storage account
o Details about where you want the ops resources
Step 3: Monitor the application
After your initial deployment the script will provide you a link to
your Azure Workbook dashboard where
you can monitor the application. Pull up this dashboard and let the
application run for at least 15
minutes. It takes this long for the telemetry to start flowing into
the dashboard.
Step 4: Change the number of running sourcerers or checkers
After the script is run for the first time, it will provide you a
parameterized way to re-run the script
without needing to walk through the Interactive Mode again. This
makes it easy to re-run the script
when you need to add/remove running instances of the sourcerers or
checkers.
You can add instances if you have excess capacity on your target
storage account/ops queues or remove
instances if you need to free up resources for other users of your
target storage account. You can also
set the number of instances to zero to completely stop them if you
need to pause or if that step in the
process has been completed.
The script will also generate a log file, providing you a record of
each run of the script you make.
A full deployment to ACI will consist of a resource group with the
following:
An ops storage account
An Azure Workbook
One instance of the watchdog
Figure 2: Resources created in default deployment
Sample Script to deploy to an AKS Cluster Download
Environment Variable Definitions for Sourcerer and Checker Note:
The below settings assume a default deployment configuration. You
can select other options using
the additional environment variables described in the
appendix.
Category Setting Notes
Monitoring ApplicationInsights__ConnectionString The connection
string for the AI instance the container should log to Note: This
is the recommended approach, working in all regions.
Authentication with Shared Access Signature
AppSettings__TargetAuthenticationMode SharedAccessSignature
AppSettings__TargetStorageAccountEndpoint Typically
"blob.core.windows.ne t" for target storage accounts in public
Azure regions
AppSettings__TargetStorageAccountName The name of your target
storage account, without the endpoint (i.e., myaccount)
AppSettings__TargetStorageAccountSas An account SAS (more info).
See Settings for account SAS for details on how to create the SAS
token with the required permissions.
Operations account info
AppSettings__OpsConnectionString Enter the connection string for
the Ops storage account. The Ops storage account is used for any
blobs, tables and/or queues the application needs to create.
Sourcerer configuration
AppSettings__SourcingStrategy Queue
AppSettings__LastModifiedOnValidationThresho ld
Files older than this date will not be checked. This value should
be "2019- 10-31"
Settings for account SAS
Use the following options to create the account SAS, and select the
SAS token value for the
AppSettings__TargetStorageAccountSas parameter:
Figure 3: Target account SAS settings
Monitoring & Metrics To keep track of progress, it is essential
to understand how the application stack is performing. To
always track application status, AzBlobChecker utilizes Metrics
reported through Application Insights.
Each minute, several metrics are being captured and sent as
individual time series, enriched with many
object-level dimensions allowing a very flexible way to slice
output.
4 main metrics are currently being captured:
• Objects listed is emitted for each object listed from the target
storage account, through any of
the provided listing methods and offers the possibility to query
for the total amount of objects
and the related total size present in the account.
• Objects to check is emitted for each object that needs to be
checked and complies to the pre-
validation and filters provided to the tool. This metric offers the
possibility to query for the total
amount of objects to be checked and the related size of those
objects present in the account.
• Objects checked is emitted for each object that was ran through
the checking logic regardless of
the result of that check. This metric offers, combined with the
"Objects to check" metric and
easy way to evaluate progress and checking performance.
• Objects to review is emitted for each object that was ran through
the checking logic and that
requires additional validation. This is the most important metric
to leverage as it can be used to
evaluate how many objects need manual intervention.
Note: Application insights metrics are lossy (i.e., it might not
record every transaction). It is used to
monitor progress of the sourcerers and checkers in a manner that is
available to all Azure customers.
Because metrics in Application insights are lossy, it is possible
that the aggregates such as “Objects
checked” will not match the number of objects in a scanned account.
This is an expected behavior. Please
see Auditing the Application to query the detailed logs.
Note: The Application insights metrics that we are emitting from
the application are not real time,
typically metrics arrive within 5 minutes.
Enriched with queryable properties:
• Timestamp with a granularity of 1 minute
• valueCount which is summarizing the total amount of objects
represented in a given time
window
• valueSum which is summarizing to total object size represented in
a given time window
• valueMin which is summarizing to minimum object size represented
in a given time window
• valueMax which is summarizing to maximum object size represented
in a given time window
And these (custom) dimensions:
• Storage Account indicating to which target storage account the
metric belongs
• Storage Class indicating whether the metric is about objects in
an Online or Offline tier
• Access Tier indicating whether the metric is about objects in the
Hot, Cool or Archive tier
• Blob Type indicating whether the metric is about an Append Blob,
Page Blob or Block Blob
Key Metrics to keep an eye on While the application is running you
want to keep an eye of the following key metrics. If you
deployed
using the provided ACI Deployment script then all of these queries
have been deployed into an Azure
Workbook for you.
Current Status and Estimated Completion
This query provides you an estimate on when the job will complete.
When the workbook is deployed the
deployment script queries Azure Monitor to find out the current
number of objects and bytes in your
target storage account. When the query is run it pulls the number
of files and bytes currently processed
and does a simple linear estimate based on when the process
started. Since it is a linear estimate it
doesn’t account for when you scale up/down the sourcerers/checkers
or if you have to pause the run for
whatever reason.
The query:
Note: you need to fill in the first three parameters if you are
manually running this query. If you
used the deployment script these values will be populated for you
when the Azure Workbook is
created.
The results
• Name – this is the name of the metric that you are looking at,
see above for a description of
what each mean.
• Storage Account – the target storage account name
• Objects – Total number of objects that have been processed by
that metric
• Duration – Time delta between when the last telemetry record for
that metric and the first
telemetry record have arrived
• Objects Per Second – Number of Objects processed divided by
duration for that metric
• Object Size Per second – Number of bytes processed divided by
duration for that metric
• Object Size – Total number of bytes that have been processed by
that metric
• Total ETA Objects – Assuming you continue to process at the same
speed how long will it take in
total to complete the job, estimated by number of objects.
• ETA Objects - Assuming you continue to process at the same speed
WHEN will the job complete,
estimated by number of objects.
• Total ETA Bytes – Assuming you continue to process at the same
speed how long will it take in
total to complete the job, estimated by number of bytes.
• ETA Bytes - Assuming you continue to process at the same speed
WHEN will the job complete,
estimated by number of bytes.
• Pct Complete Objects – Total number of objects processed divided
by total number of objects in
the storage account.
• Pct Complete Object Size – Total number of bytes processed
divided by the total number of
bytes in the storage account.
• Start – Date/Time when the first telemetry item for that metric
arrived.
• Stop - Date/Time when the last telemetry item for that metric
arrived. Good for measuring the
lag between now and the data you are looking at from Application
Insights. Also good for
knowing when a task (i.e. sourcing or checking) is complete, i.e.
if you are no longer getting
telemetry for 15 minutes or so it is probably done.
Note: We calculate the above on both objects and bytes. Object
measures are good for estimating
sourcerer performance/completion (i.e. “objects listed” and
“objects to check”). Bytes or Size
measures are good for estimating checker performance/completion
(i.e. “objects checked”).
Objects processed per second
This query provides you a time chart of the number of objects
processed. This allows you to see the
impact of adding/removing sourcerers/checkers over time.
The query:
Note: you need to fill in the first parameters if you are manually
running this query. If you used
the deployment script these values will be populated for you when
the Azure Workbook is
created.
The results
Here you can see the blue lines (“objects to check” & “objects
listed”) created by the sourcerer
telemetry, and the orange line (“objects checked”) over time.
Hovering over a point on the chart give
you the average number of objects processed per second, using a 5
minute window.
Target Account Egress
This query provides you a view of the amount of data egressing out
of the target account. This is
inclusive of any egress created by AzBlobChecker and any other
users of this storage account. If you pull
data too fast the egress will exceed the account limit and all
users can get throttled. Typically you want
to keep egress below 375 GiB per min. If you have excess egress
capacity and want to speed up the
checking process you can add additional checkers. If you are using
too much egress you can remove
checkers to reduce the load on the target account.
The results
Hover over the chart to see the amount of egress per min.
Ops Account Queue Transactions
This query provides you a view of the amount queue transactions
hitting the ops account. Both the
sourcerer and checker contribute to this number, however, typically
sourcing too fast is the key driver of
this metric. To avoid throttling you want to keep this metric below
120,000 transactions per minute.
The results
Hover over the chart to see the number of transactions per
min.
Sourcing Queue Depth
This query provides you a view of the total number of messages that
are in the sourcing queue (i.e.
<target storage account name>-checkpoint queue). This is the
number of folders that the sourcerer has
found but has not looked in yet. This number will grow and shrink
while the sourcerer is running based
on how the data is structured in the target storage account. When
this number is 0 for 15 minutes and
you are no longer seeing any objects processed, then it is safe to
shut down your sourcerers.
The query
The results
Hover over the chart to see the number of transactions per
min.
Checking Queue Depth
This query provides you a view of the total number of messages that
are in the checking queues (i.e.
<target storage account name>-online and <target storage
account name>-offline queue). This is the
number of files that the checker needs to process. This number will
grow and shrink while the sourcerer
is running based on how the data is structured in the target
storage account. When the online queue is 0
for 15 minutes and you are no longer seeing any objects sourced,
then it is safe to shut down your
checkers.
The results
Hover over the chart to see the number of messages in the queue at
that point in time.
Auditing the Application The Application Insights metrics above are
good to get a quick, high-level overview of the progress of
the application. However, the application keeps detailed logs in an
Azure Storage Table named with the
target storage account name (in the Ops storage account) detailing
what it does to ensure that every file
gets checked.
You can query this table directly if you want to look at the status
on a handful of files. However, if you
want to do more complex aggregations/summaries, the recommended
practice is copying this table into
Azure Data Explorer.
Note: We recommend waiting until after the checking process is
completed to load the data into ADX, as
the loading process will impact the performance of the checking
process.
You can load the results of Blob Inventory in Azure Data Explorer.
This allows you to quickly identify any
files that might have been missed during sourcing.
To load the results of Blob Inventory into Azure Data Explorer
Note: Blob Inventory is not enabled by default on storage accounts,
more info here.
This can be done using the “one-click”/lightspeed ingestion tool in
Azure Data Explorer. Review the
general process of using lightspeed here. Specific directions for
our use case are below.
1. Create a cluster and database, instructions here.
2. Open the “Query window” for your database.
3. Right-click on the database and select “Ingest new data” (this
will bring up the one-click wizard).
4. The destination cluster name and database name should be
populated. Select “create new
table” and give it a name (i.e., “BI”). Then, press “next”.
5. Source Type should be “from blob container”.
6. Ingestion Type should be “historical data”.
7. Select source – this should point to your target storage
account. Use either a SAS URL or the GUI
to select the proper account/container.
8. Under File Filters, set the folder path (i.e.,
“2021/09/30/12-55-33/inventoryrule”) and the File
extension (i.e., “.csv”).
9. Select any one of your CSVs as the schema defining file and
press “next”.
10. On the schema step, check the data types to ensure that they
are correct. “Content length” is a
long, “LastAccessTime” is a datetime, “AccessTierChangeTime” is a
datetime and ensure that
“ignore first record” is selected.
11. Press “Next”. This will generate a light speed command. Follow
the onscreen instructions to
download the lightspeed tool and run the generated command. An
example command looks like
into ADX.
To load the checker Log Table into Azure Data Explorer 1. Create a
cluster and database if you have not done so already. The
instructions are here.
2. In your cluster, create a table to store the data. Execute the
query below to create the table.
3. Grant Access
a. NOTE: we are using a service principal, but you can use the
configuration best suited for
your environment.
b. Create a Service Principal, instructions here.
c. Give your Service Principal “Database Ingestor” permissions in
your DB, instructions
here.
4. Create an instance of Azure Data Factory, instructions here
(just this section not the whole doc).
5. Once in Azure Data Factory Studio, on the “Home” tab, select
“Ingest”.
6. Select “Built-in copy” and “run once now”. Press “next”.
7. The “Source” is Azure table storage. Create a new connection to
your ops account (if needed)
and select the table where the name matches the target storage
account name.
8. Press “Next”, and “Next” again when prompted.
9. Now, you will be asked for the target. Select “ADX” and create a
new connection (if needed)
using the service principal we created earlier.
10. Next, select the target table that we created earlier (i.e.,
“checker”). Select “Skip column
Mapping” and press “Next”. Select “Next” again when prompted.
11. You should now be on the settings screen. Here you can leave
all the defaults and press “Next”.
12. Then, select “Next” on the summary screen.
13. After some validation and deploying, press “monitor” to watch
the progress.
Note: In sample runs, it took approximately 2 hours and 15 minutes
to ingest 24 million rows from table
storage into ADX.
Sample ADX Queries With everything loaded you can now quickly query
ADX for any information you need.
LightIngest.exe "https://ingest-
Timestamp:datetime, AccessTier:string, Account:string,
Checksum:string,
Container:string, CreatedOn:datetime, Endpoint:string,
Findings:string,
IsValid:bool, IsValidated:string, LastModifiedOn:datetime,
Name:string,
NeedsValidation:bool, Path:string, RunId:string, Size:int64,
Files that need to be reviewed
Note: This list is also in a separate table (“<your account
name>-review”). You do not need to load
everything into ADX if this is the only data you are seeking.
Review the total objects in each container from the blob inventory
table and the checker log table.
Compare the object counts.
n(".")
| summarize count() by Ext, Type, NeedsValidation, IsValidated,
IsValid, Access
Tier
checker
let checker=checker
blobInventory
CheckerCount
Other sourcing Algorithms
• Sequential - Good for very small accounts (under 1 TiB or 100,000
files). While this algorithm is
simple, it can only be executed on one thread, making it
inefficient for use on accounts with
many objects.
o First, each container in the account is listed using the List
Containers API.
o Next, the ListBlob API is used to list all the contents of the
container.
o Set the container AppSettings__SourcingStrategy to
Sequential
• BlobInventory - good for the largest accounts where Blob
Inventory is available, and scales
based on the number of CSVs generated by Blob Inventory.
o Set the container environment variables
AppSettings__SourcingStrategy to BlobInventory
inventory manifest to use
• NOTE: for this sourcing strategy we recommend the following
queue
configuration
= 1
o First, you need to enable Blob Inventory on your account. Choose
the following options:
Rule Name - whatever you want
Container - whatever you want - NOTE: this is the container that
the inventory
will be placed in. I named my container “inventory”.
Object to inventory – Blob
Blob types - select all 3, Block blobs, Page blobs, and Append
blobs
Blob subtypes - don't select any. Scanning blob version and
snapshots is not
supported in this version of the tool.
Blob inventory fields - select everything EXCEPT metadata
Inventory frequency - select “Daily”
Prefix match - leave blank
o You will now need to wait around 24 hrs, depending on the size of
your account, to give
Blob Inventory a chance to run. Once it is complete, look in the
container you told it to
put your inventories in. You should see a folder for the year,
month, day, time, and rule
name that the inventory ran containing the Blob Inventory
results.
The first thing to look for is a “<rule
name>-manifest.checksum” file. This will be
the last thing that blob inventory writes. If you don't see it then
the blob
inventory is still processing.
Next, look for a “<rulename>-manifest.json” file. This
contains a summary of the
inventory process including total number of objects/bytes found and
a list of all
the CSVs generated.
on the amount of data in your account.
o The sourcerer will start by downloading the inventory file. For
each inventory file, it will
put a message in the “<your account name>-checkpoint”
queue.
• Each sourcerer instance will process one CSV at a time and will
keep track of its progress reading
each CSV in the “<your account name>csvrowcheckpoint”
table.
• NOTE: You can turn off blob inventory after it has generated the
necessary files the first time.
Troubleshooting ACI containers that do not seem to be doing
anything If deploying via ACI and the containers have deployed and
after a few minutes you do not see any logs
appearing in Application Insights:
o Check the restart Count, a number greater than zero for a
container that has just started is
typically a sign that the container is misconfigured.
Figure 4: Checking ACI restart count
o You can view the logs from within a running ACI instance by
navigating to the “Logs” tab, here
you should see the configuration that the running container
emitted.
Figure 5: Viewing ACI logs
How many sourcerers can I run at the same time?
• Sourcing and checking can run on the target account at the same
time.
• By default, most Azure subscriptions are limited to 100 ACI
instances per region, however this
varies by subscription type. If you are scanning multiple accounts
in the same region in parallel,
you might need to prioritize to ensure you stay under your
subscription limit.
• For accounts with larger files (i.e. over 4 MiB), sourcing should
be many times faster than
checking. So, we recommend running fewer sourcerers than
checkers.
• We recommend starting with a few sourcerer instances (i.e. 2),
monitoring their impact on your
account and then adding/removing instances.
• To determine if you have the right number of sourcerers:
o First, look at the queue transaction counts on the ops storage
account. You want to
keep this number under 120,000 transactions per minute per queue.
NOTE: You can
create an Azure monitor chart with a 1-minute time grain to
simplify monitoring this.
Figure 6: Queue transaction counts in Azure Monitor metrics
o Next, look at the number of messages in the <your account
name>-online queue. If this
number is going up, you are sourcing faster than you are checking.
NOTE: This number
will stop going up once all the items in your target storage
account have been listed. At
that time, you can turn off/delete all your running
sourcerers.
Figure 7: Browsing queue messages in Storage browser
(preview)
Restarting the sourcing process To restart the sourcerer process
from the beginning you will need to:
• Stop all running instances of the sourcerer
o See the deployment script for a sample based on your deployment
model
• Break the lease on the blob
o In the Azure Portal find your storage account, in the left menu
select Data Storage, Then
Containers, select the leaderelector container
Figure 8: Browsing blob containers in the Azure portal
o You should see a blob with your account name and a “Lease state”
of “leased”
Figure 9: Viewing leased blobs
o Select the 3 dots and then “Break Lease”
Figure 10: Breaking a blob lease
• Clear out the contents of:
o The <your account name>-checkpoint queue
o The <your account name>-online and <your account
name>-offline queues
o The <your account name>csvrowcheckpoint table (if you are
using the blob inventory
sourcerer)
• The process is the same to clear out the above queues
o Select “Data Storage” then “Queues” then the name of the queue
you want to clear
Figure 11: Selecting a queue to clear in the Azure portal
o Press the “Clear queue” button
Figure 12: Location of the Clear queue button
How many checkers can I run at the same time?
• Sourcing and checking can run on the target account at the same
time.
• By default, most Azure subscriptions are limited to 100 ACI
instances per region, however this
varies by subscription type. If you are scanning multiple accounts
in the same region in parallel,
you might need to prioritize to ensure you stay under your
subscription limit.
• For accounts with larger files (i.e., greater than 4 MiB),
sourcing should be many times faster
than checking. So, recommended best practice is running more
checkers than sourcerers.
• We recommend starting with a few checker (i.e. 5), monitoring
their impact on your account
and then adding/removing instances.
• The checking process is typically limited by the egress limits
for the target storage account. It is
very important to keep an eye on the amount of data being pulled
from the target account to
ensure enough room for any other users of the data in the
account.
• By default, v2 storage accounts have a 50 Gbps egress limit. You
can pull the sum of blob egress
summarized by minute from Azure Monitor. (You are looking to stay
under 375 GiB per min.)
Figure 13: Viewing account egress in Azure Monitor metrics
If your account has lots of very small files, you will likely not
hit the egress limit on your target
storage account. You should also monitor the queue limits on your
ops storage account (see the
sourcerer section above for details).
Authentication with SharedKey
• To authenticate with a SharedKey, set the following environment
variables on the containers
o AppSettings__TargetAuthenticationMode - SharedKey
target storage accounts in public Azure regions
o AppSettings__TargetStorageAccountName- the name of your storage
account, without
the endpoint (i.e. myaccount)
more info
• AppSettings__MaximumQueueRetrievalBackOffDurationInSeconds - time
to sleep before polling
the queue again to look for new messages, Default Value=60
• AppSettings__MaximumNumberOfConcurrentMessageHandlers – This is
the number of parallel
message handlers to run. The more message handlers used, the more
resources your container
needs and typically the more messages the container can process.
Default Value=32
• AppSettings__MaximumNumberOfRetriesOnFailure – This variable
represents how many times