Archive Development Integration Knowledge search Workflow

Model for Enterprise migration, archiving and content reuse

As part of an archiving project I created a block model that describes a low effort, high accuracy, methodology using Azure for velocity migrations from varied sources into standardised destinations for the purposes of archiving and unlocking latent knowledge.

The block model was used as a blueprint for our enterprise archive service which is destined to hold petabytes of content. Reproduced below is a copy of the model. If you are interested in the background to the project and the some of the choices and decisions I made along the way, then scroll on to the narrative section below the block model.

In its simplest form the model is:

Shows overview of model. Sources on the left with content flowing through connectors into an orchestration and migration engine with outputs in Azure. Below the left to right flow are the components that describe the user interface
Block model overview

The block model contains product names and suggestions, and the flow is from top to bottom:

Block model - this is the subject of the blog post
Block model

A3 PDF version of the block model:

Model for enterprise migration, archiving and content reuse by Simon Denton is licensed under CC BY-NC-SA 4.0

Narrative

Background

Back in 2019 I was set the challenge of developing an approach to archiving aged content from a document management system (DMS). Initially I advanced down a well-trodden path and I started to consider purchasing the archive module provided with the DMS. Whilst this would be a quick(ish) win, it would mean that the content would still be effectively locked in the DMS. Earlier that year I had become involved in Project Cortex and I through that started to learn the value of being able to connect to different content sources to unlock latent knowledge. I investigated our connection options for the DMS and I also looked at our options for other content repositories, like file shares. I started to settle on the notion of archiving all aged content, regardless of its source, to a single location. One path could have been to use the DMS archive module for everything but whilst it was good for archiving, it did not lend itself to unlocking latent knowledge. I wanted to put the content somewhere that offered flexibility both in terms of the sources it supported and the way we could approach the problem of unlocking latent knowledge.

The challenge with latent knowledge is knowing what to ask and how to ask it. Traditionally we have relied on search to answer questions. That works to a greater or lesser extent. It does rely on asking relatively simple questions and search being set up with the type of questions that will be asked being known at the outset. Opportunities for tuning and refinement are limited – typically a new search instance would be needed to answer a new type of question. I started to think laterally and obtained consent to allow the initial project scope to creep whilst I pursued a different path.

At Microsoft Ignite 2019, I had come across Project IDA. Project IDA was a demonstration project that showed how Azure Cognitive Search could be used to apply AI models to unlock knowledge. Project IDA required that the content to be mined was stored in Azure. The flexibility of Azure Cognitive Search meant that as new AI methods and questions emerged the content could be reprocessed. Out of the box Azure Cognitive Search offered a broader range of ways to ask questions, for example natural language, that reduced the need to know what was going to be asked and how it was going to be asked. My insider knowledge from Project Cortex hinted at a future whereby it would be possible to connect Microsoft Search and Project Cortex to content in Azure. That would yield the best of both worlds. Microsoft Search and Project Cortex could be used to connect people to content and content to people based on signals from the Microsoft Graph. Azure Cognitive Search could be used to ask the questions of our content that we did not yet know or have a means to ask.

The scope of the project changed. From archiving aged content from a document management system to archiving aged content from any system and being able to unlock latent knowledge. This presented a new challenge of how to get the content from various sources into an archiving pipeline whilst retaining the principles of archiving.

Defining what we mean by ‘archive’

  1. As I was proposing to move away from using the source services as the archiving host, I needed a set of high-level principles that defined what we meant by ‘archive’. These principles would be used to guide the design of my archiving solution. I settled on twelve principles:
  2. Archiving is the process of removal and transfer of content for long-term retention
  3. Archived content will be stored and retrievable
  4. Archived content will never be transformed, updated, overwritten or erased for a specified retention period
  5. With the exception of audit data, the archiving of metadata associated with content intended for archiving will be defined on a case-by-case basis
  6. Where captured by the source, audit data e.g. modified by, user etc. will be archived with the content
  7. Metadata including audit data may be transformed for the purposes of storage but will never be updated, overwritten or erased for a specified retention period
  8. Archived content shall be limited to records
  9. Content will remain discoverable, accessible and retrievable for the specified retention period
  10. Archiving processes will be defensible and practical
  11. Archiving will be triggered by a change in state of the content e.g. project closure
  12. Archiving will not be piecemeal e.g. a project will be archived in its entirety Archiving should not be confused with knowledge management, backup or preservation

Of the 12, item 7 ‘Archived content shall be limited to records’ was the hardest to design for. Defining what is and what is not a record became an overly complex task. At times I felt like Alice must have when she fell through the rabbit hole into Wonderland. Item 7 was modified to remove the record limitation in order to include all content. The consensus was to hoard everything until the specified retention period elapsed.

Item 12 served several purposes including setting the boundaries. For example the pipeline would archive content but it would not preserve the content using immutable formats and media. Nor would it be used to replace our existing backup provisions or manage the discovered knowledge.

Core principles

Once the definition of archiving was accepted, I set about establishing some core principles. The aim of the principles was to define the next level of detail which would inform the design. We settled upon sixteen principles:

  1. A ‘chain of custody’ will be established for every item processed by the archive.
  2. Archived content can only be disposed of after the specified retention period has expired and the owner has consented to its disposal.
  3. All content will have an owner who is a current member of staff.
  4. Archiving, storage, and retrieval will incur costs which will be billed to content owner’s business unit.
  5. Archiving of Project related content will commence 3 months after a Project has been identified as ‘Closed and Dead’ or ‘Inactive’ in the ERP systems.
  6. Where it is not possible to associate a trigger for archiving, archiving will commence 3-months after the content was last accessed.
  7. Upon Archiving content will be held in a ‘Hot’ storage tier for 1-month, a ‘Cool’ storage tier for 2-months and then tiered down to the ‘Cold’ tier. Tiering will be used to reduce costs.
  8. Archiving will be applied at the highest possible node e.g. Project root folder, Department folder and will be uniformly applied to all content below it.
  9. It will be possible to search for content using basic attributes.
  10. Permissions used in the source will not be applied to content in the archive. Archived content will only be accessible by the content owner who will be able to control access.
  11. Retrieval will only return the latest known version from the archive. Prior versions will be provided on request
  12. Retrieval will not include the application of the original metadata or permissions.
  13. Metadata will be supplied independently of the archived content.
  14. Costs and retrieval times associated with each tier are aligned with those of the storage provider. Expedited retrieval will incur additional costs.
  15. Bulk retrieval will be actively discouraged.
  16. Item retrieval will be as simple and easy.

Item 10 ‘Archived content will only be accessible by the content owner’ would simplify any migration and served to reset the information boundaries and security. In doing so it would have the potential to reduce volume of content available for knowledge mining. Through the design we included a concession whereby we could add specific groups e.g. knowledge workers who could, with the consent of the content owner, have access.

The principles where further distilled into seven requirements:

  1. A light touch methodology in which content can be migrated to a range of destinations with minimal impact upon the business.
  2. No losses of records.
  3. Ability for the business to search and discover their archived content.
  4. Ability to filter content as part of the migration so as to avoid ‘garbage out – garbage in’.
  5. Ability to rehydrate content in a timely manner to met legal or compliance obligations.
  6. Business consent to move sensitive content and approval for storage in the archive.
  7. Preservation of all versions with a full audit trail and permissions unless directed otherwise by the business on a case-by-case basis.

Migration is a mapping problem

I had determined that Azure would be used to host the archive and I created a block diagram that described the processes we would employ. The block diagram would become the blueprint for the design. I reviewed the vendor landscape to see if I could locate a product that could be used. There where a number of potential candidates. Some offered most of the pipeline, some just a connector and others a managed service. It was evident that we would need to use more than one vendor to establish the archive. Even then the solutions had catches like slow migration speeds, volume-based pricing, propriety content silos or spiraling add on costs. In order to give us the flexibility and control we needed, we opted to build the archive ourselves. This approach suited our situation. It also helped with the first migration, which would be from a DMS for which we had intimate knowledge of the schema. Through the process I learned that migration is first and foremost a mapping problem. After that it becomes a velocity problem. We focused our search on tools that could connect to the different sources. We wanted something that we could reuse. We settled on Xillio.

Xillio would be used to connect the sources to the archive and the archive to SharePoint. Initially the archive would be the final destination for the content, but we decided that we would also retire some of the sources when the migration was complete. Several of the sources would contain active projects that we had to move to SharePoint so they could continue. Handling live project content presented another challenge in that we would have to develop the ability to shadow and delta copy.

Speed kills, no matter who is driving

The ability to extract and deposit large volumes of content at velocity was problematic. Our initial migration was for around 300TB of content. Plan A was to use Xillio to transfer the content over the wire from our data centre to our Azure tenant. A mere 19 miles as the crow flies. A combination of infrastructure and data egress and ingress limitations constrained the available velocity. We opted for Plan B which was to use Azure Databox to move the content to a temporary staging area in Azure in its source form and then use Xillio to perform the ingestion into the archive. Future large migrations will use Databoxes rather than the wire. We moved more in a week than we did in the preceding month.

Whilst the Databoxes solved the migration to the archive, we are still experiencing speed limiting issues with the migrations to SharePoint. About 80TB of the 300TB is destined for SharePoint and it is slow going. In part because SharePoint is less forgiving when it comes to items like file paths, characters, large (over 2GB) and zero size files. As a result, the content needs an additional level of pre-migration sanitisation. In part because getting content into SharePoint is subject to some severe rate limiting that can be worked around to a point using various tricks and multiple accounts but only to a point. This is an area that we will revisit in future migrations.

Throughput is also a constraint of Azure Cognitive Search. As part of the archive pipeline we create an initial search index. We create the index because we’ve found that staff follow one of two search patterns. They tend to search by navigation for recently migrated content. They know where it was in the source or they know a key identifier from the topmost node i.e. the project number and they can navigate the archive to locate it. After a while navigational recall fades and they then need to search by keywords. Additionally, the index helps with the first pass of extracting latent knowledge. The throughput of Azure Cognitive Search is governed by a number of factors including the skills used to crack the content. We have found that the key governing factor is that one node (one replica in one partition) seems to process around 250GB of content per day. Working with Microsoft we have redesigned our approach to search and have accelerated the creation of the indexes.

Progress

We are on track to complete the migration of the initial 300TB. After that we will be focussing on archiving from SharePoint and tackling the many squirrel stores of USBs and NAS drives that lurk in the cupboards! We will be integrating Microsoft Search and Project Cortex as part of our unlocking of latent knowledge.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: