Recent Professional Experience

Since 2020

Many engagements and projects have been done repeatedly and indicated accordingly.

NOTABLE ENGAGEMENTS

Insurance Client

Client had a 6-person legal team which manages many law firms, each engaged in legal action on behalf of the client. Each law firm has a unique contract and submits legal bills electronically which must conform to the contract specifications. Review of the legal bills often finds discrepancies in the form of overbilling. The legal team manually spot checks individual bills for contract compliance but only has the bandwidth to review 5% of the incoming bills. Client wanted to leverage Generative AI (GenAI) to automatically review as many bills as possible in real-time.

  • Configured an instance of Anthropic Claude 3.5 Sonnet in AWS Bedrock.

  • Worked with legal team to export all billing contracts to PDF and store them in S3.

  • Created several Lambda functions in Python to do the following:

    • Load a PDF contract from S3 and convert it to text with AWS Textract. Store the resulting text file in S3.

    • Load a billing contract text file and make an inference request to the AWS Bedrock model with the contract text and a predefined prompt to derive the billing rules for the contract. Store the billing rules as text in DynamoDB.

    • Load a LEDES-formatted legal bill from S3 and convert it into JSON. Store the JSON file in DynamoDB.

    • Load a JSON-formatted legal bill from DynamoDB, load the legal firm’s billing rules from DynamoDB, and provide the legal bill and billing rules to the AWS Bedrock model with a predefined prompt. The prompt directs the response to be returned in JSON format with a field indicating noncompliance and a list of noncompliant line items. Store the inference response for noncompliant bills in DynamoDB.

    • Send an email notification to the legal team group address for noncompliant bills.

  • Performed model evaluation to ensure model correctly extracted billing rules and identified legal bills with noncompliant line items and did not generate false positives.

  • Created a web application using Python and Flask to provide a web UI for the legal team to review flagged legal bills.

  • Created a Terraform configuration to create the S3 buckets and event triggers, the Lambda functions and associated CICD pipelines, the AWS Steps Functions state machine to orchestrate the workflows, and the infrastructure for hosting the web application.

  • With this design, multiple legal bills can be verified for compliance in parallel since each legal bill will trigger an isolated execution of the verification workflow.

  • Provided extensive documentation, infrastructure diagrams, and process flow diagrams.

The resulting workflow now automatically reviews 100% of the incoming legal bills in real-time, flagging incidents of overbilling for further legal review. New contracts saved to S3 are automatically converted into billing rules. This has yielded immediate cost savings and has freed up the legal team to concentrate on other work. Further, the entire infrastructure is serverless and the cost of infrastructure and inference is immaterial, particularly compared to the cost savings.

Pharmaceutical Client

Client wanted a full inventory of AWS resources across all accounts and regions, refreshed daily. The inventory would be used to determine compliance with certain requirements and noncompliant resources automatically remediated.

  • Designed a Python application to scan for resources in all AWS accounts and regions, check resources against a set of compliance rules, automatically perform remediations on resources failing compliance checks, and report on what actions were performed.

  • Solution was deployed as a State Machine in AWS Step Functions with the same Python application executing in the Scan state, the Evaluation state, and the Remediation state. Most work was performed in Parallel tasks so scanning for VPCs, for example, could be run concurrently in all accounts and all regions. Inventoried resources are persisted in ElastiCache (Redis).

  • Remediations included such actions as:

    • Enabling missing VPC flow logs and Transit Gateway flow logs

    • Enabling missing Route 53 query logging for hosted zones

    • Applying missing tags to EC2 instances

    • Enabling AWS Backup for unprotected resources

    • Ensuring versioning and encryption is enabled for all S3 buckets

  • Application architecture was extensible through custom plugins. Adding a new remediation was as simple as writing a new Python class. The application would automatically discover the new plugin and incorporate its functionality.

  • Inventory reports are generated in Markdown, text, CSV, and HTML formats.

  • Wrote extensive architectural and operational documentation (100+ pages) detailing the design and use of the Python application.

  • Spent several weeks mentoring and supervising engineers upgrading and deploying new releases of the application.

The application now runs every night in under 1 hour and generates a comprehensive inventory report which is sent to management. Non-compliant resources are automatically remediated. The inventory is used by other teams – Security, mostly – to easily perform organization-wide audits. Application was so successful that a dedicated internal team was created to manage and extend the functionality of the application.

Life Sciences Client

Client had implemented a complex 80-step workflow of Lambda functions, containers, and EMR jobs using a combination of S3 events, SQS queues, and SNS topics. Recommended that workflow be redesigned using Step Functions to improve maintainability, extensibility, and reliability.

  • Worked with Data Science team and engineers to identify logical groups of workflow steps. Implemented each group in Step Functions as a State containing multiple Tasks.

  • Identified steps which could be executed in a Parallel task, improving overall execution time.

  • Implemented robust error handling, including exponential retry, and detailed logging.

  • Eliminated all SQS queues and SNS topics.

  • Created a Terraform configuration and a dozen modules to deploy the State Machine and all States and Tasks. Configuration used the remote state of an existing Terraform configuration to import ARNs of Lambdas and other resources.

  • Provided LucidChart diagrams to document all States and Tasks.

  • Trained the Data Science team and engineers on Step Functions to help them better modify the workflow in the future.

The original, somewhat unwieldy, workflow became vastly more maintainable and extensible. Failed steps could be retried, where previously the entire workflow needed to be rerun. Team members were automatically notified when the workflow stalled, allowing them to quickly resolve the issue and continue the work.

Restaurant PoS Client

AWS requested a Migration Acceleration Program (MAP) assessment for an existing customer. A MAP assessment determines a customer’s cloud readiness and outlines a migration plan for leveraging AWS resources over multiple years based on the customer’s growth expectations.

  • Worked with customer to understand current architecture, points of failure, and operational complexities. Initial state consisted of an EKS cluster, Aurora Postgres cluster, SQS queues, Lambda functions, and an on-prem EKS cluster.

  • Designed a multi-phase approach to improve governance and compliance, availability, scalability, and cost-effectiveness:

    • Phase 1: Implemented multi-account AWS best practices in AWS Organizations and Control Tower, including creating a Landing Zone, standard OUs, and moved existing accounts into the new structure.

    • Phase 2: Eliminated a bottleneck with a FIFO queue that would inefficiently enqueue multiple identical messages by leveraging WebSockets and API Gateway.

    • Phase 3: Extended the use of API Gateway to eliminate direct Internet access to an SQS queue, improving security.

    • Phase 4: Replaced regional SQS queues altogether with DynamoDB global tables and streams, improving scalability and preparing for higher availability across a new region.

    • Phase 5: Migrated away from the expensive Aurora Postgres cluster to DynamoDB global tables (the schema supported this) leading to improved performance, lower costs, and better scalability and availability options.

    • Phase 6: Enabled a 2nd region for all DynamoDB global tables and duplicated other infrastructure into the new region providing a truly highly-available infrastructure spanning 2 regions with the option to add others. Added CloudFront distributions in front of API Gateway to reduce network latency and Route 53 latency-based routing and health checks.

The MAP assessment was completed successfully, and the customer had an actionable plan to cost-effectively scale their cloud infrastructure for at least 10 years while also improving availability. The expected monthly cost after Phase 6 was less than the customer’s current AWS bill due to using serverless and capacity-based resources.

Pharmaceutical Client

Client needed to rapidly backup and isolate 2 petabytes of data spread across 600+ S3 buckets in over 100 AWS accounts and 14 regions in support of urgent business continuity requirements.

  • Decided on automating all work due to the repetitive nature of applying the same changes across many S3 buckets.

  • Wrote Python scripts to do the following:

    • Inventory all S3 buckets in all AWS accounts in all active regions.

    • Apply versioning and encryption to all source S3 buckets.

    • Create S3 Batch Operations to encrypt all existing unencrypted objects in all source buckets.

    • Create a corresponding destination bucket in S3 for every source bucket. The destination buckets were in an isolated AWS account, outside of the primary AWS Organization, in an unused region.

    • Apply versioning and encryption to all destination buckets.

    • Create a lifecycle policy on all destination buckets to move objects to Glacier Deep Storage after 180 days.

    • Enable Object Lock with Compliance and a retention period of 90 days for all destination buckets. There was an additional script to extend the retention period after 90 days.

    • Enable S3 replication from all source buckets to all destination buckets.

    • Create S3 Batch Replication jobs for every source bucket to replicate all existing objects to their corresponding destination bucket.

  • The scripts were designed to be run repeatedly, in case of a failure at any step. For example, when creating a destination bucket for a source bucket, it would only be created if it didn’t already exist.

All data was successfully replicated over 50 hours, incurring a total client-approved cost of over $100,000 inclusive of data transfer costs, cost of object locks, and other AWS charges.

GOVERNANCE AND COMPLIANCE ENGAGEMENTS

Implemented AWS best practices in multi-account management. (6x)

  • Enabled AWS Organizations and designed an OU structure consistent with AWS best practices, accounting for any client-specific requirements.

  • Set up the Control Tower Landing Zone and enrolled all existing accounts in Control Tower. Resolved any errors which occurred during enrollment.

  • Worked with client to evaluate AWS Conformance Packs for compliance standards (HIPAA, GDPR, GxP, NIST, PCI-DSS), selecting a subset of guardrails to apply.

  • Used Python scripts to inventory AWS resources across accounts and regions to predetermine compliance with proposed guardrails then selectively applied guardrails to compliant accounts.

  • Federated client’s 3rd party IdP (Active Directory in Azure, Okta) to AWS SSO (now IAM Identity Center).

  • Created a Terraform configuration to manage AWS SSO:

    • Created groups in AWS SSO and mapped users to groups.

    • Created SSO permission sets and mapped permission sets to groups and accounts.

    • Created and applied Service Control Policies to OUs.

  • Configured delegated administrator accounts for selected AWS services. For example, GuardDuty and Security Hub are usually delegated to the Audit account.

  • Applied tagging policies to simplify cost allocation and ownership tracing.

  • Documented naming conventions for accounts and resources. Provided extensive documentation on OU structure, guardrails, and compliance requirements.

  • Provided final audit of compliance standards to executive leadership.

Implemented AWS best practices in centralized logging. (5x)

  • Enabled organization CloudTrail logging to S3 in Log Archive account.

  • Created a Terraform configuration to do the following:

    • Create regional S3 buckets in Log Archive account for additional logs (ALB, CloudFront, etc.) with the appropriate bucket policies.

    • Create Kinesis Firehose resources for logs which require them: API Gateway, CloudWatch log forwarding, etc.

    • Create S3 events per S3 bucket to send messages to an SQS queue monitored by a SIEM. Each bucket policy includes permissions for the SIEM to retrieve all objects.

  • Refactored existing modules and configurations in Terraform to send logs to the appropriate S3 bucket in the Log Archive account, per region.

  • Worked with teams across the organization to configure existing infrastructure to start forwarding logs to the appropriate S3 bucket in the Log Archive account.

  • Provided LucidChart diagrams and documentation for all buckets and log delivery flows.

Implement automated (self-service) AWS account provisioning using Terraform with Jira integration. (2x)

  • Created a custom Issue Type in client’s Jira environment. Added fields to capture account name, OU, region, VPC CIDR (when no AWS IPAM available), and optional features such as additional routes or Active Directory support.

  • Created a custom workflow with multiple states and transitions, allowing an Issue to be approved and then to update the state of the Issue as the account progresses through creation, enrollment, and infrastructure provisioning.

  • Created automations in Jira to interact with AWS (through SNS) and start or restart the account provisioning.

  • Created several Lambda functions in Python which do the following:

    • Make an HTTP callback to a Jira webhook to update the state of the Issue.

    • Receive the account details from the Jira Issue and start the account creation and enrollment process.

    • Write the account details to Parameter Store for use by various steps in the provisioning workflow.

    • Provision an infrastructure automation IAM role in the newly provisioned account.

  • Created a containerized version of Terraform which can fetch Terraform configurations from GitHub and apply them to the new account.

  • Created and deployed a Terraform configuration to provision the Lambda functions and associated CICD pipelines, AWS Step Functions state machines, EventBridge resources for acting on enrollment events, ECS infrastructure, Service Catalog resources, CloudWatch log groups and alarms, and numerous IAM roles and other required resources.

  • Provided extensive documentation and LucidChart diagrams for all parts of the solution.

AI/DATA ENGINEERING ENGAGEMENTS

Created data pipelines to populate a security-oriented data lake enabling real-time analytics and alerts. (2x)

  • Ingested event data from CloudTrail and S3 and aggregated it into S3 buckets to preserve original data.

  • Ingestion triggered various Lambda functions which transformed the native structures into Parquet format and stored the transformed data in the analytics data lake.

  • Real-time analysis of the data in the data lake revealed patterns in user access to S3 objects and other AWS resources. Suspicious activity would trigger alerts to the Security team.

  • Wrote all Lambda functions used in the various pipelines in Python.

  • Created Terraform configurations to provision all infrastructure, including the Lambda functions and associated CI/CD.

Designed and deployed real-time financial data ingestion and transformation, model training, and inference.

  • Worked with quantitative analysts to identify and map all sources of relevant financial data.

  • Implemented the feature stores and selected features to be used in the predictive model. Focused on minimizing features to improve model performance.

  • Designed streaming infrastructure (Kinesis) to ingest data from Bloomberg API and other sources. WebSocket subscriptions used primarily to facilitate real-time ingestion over persistent, low-latency network connections. Updated online feature store in real-time.

  • Performed sentiment analysis (AWS Comprehend, various FMs) in real-time from textual data obtained from web scraping, PDFs, press releases, and news articles.

  • Worked with analysts to train the model and deployed training jobs in SageMaker running on NVIDIA A100 compute resources. Batch training was done with historical data and online training (fine tuning) was performed using online feature store data as embeddings. Continual online training time was reduced to approximately 60 seconds.

  • Provided inference endpoint exposed as WebSocket (socket.io) with continual prediction streams. Subscribed client applications able to display up-to-date predictions of trends in real-time.

  • Designed and deployed all resources of this event-driven architecture. Created Terraform configurations to provision all infrastructure.

Deployed AWS Rekognition pipeline to flag offensive content.

  • Worked with the client’s engineering team to create several .NET Lambda functions which do the following:

    • Copy the file from a source S3 bucket to a scratch bucket.

    • Preprocess the file into multiple files, if needed: a multi-page PDF is processed into one image per page in the scratch bucket.

    • Make API calls to AWS Rekognition to determine whether an image is offensive or not.

    • Flag files which have been deemed offensive.

  • Created a Terraform configuration to provision the source and scratch S3 buckets, the Lambda functions, CICD pipelines for the Lambda functions, and a State Machine in AWS Step Functions to orchestrate the execution of each step in the workflow.

  • Provided a LucidChart diagram showing the infrastructure and the states of the State Machine.

CLOUDOPS/DEVOPS ENGAGEMENTS

Implemented cost-effective multi-account, multi-region networking. (3x)

  • Following AWS best practices, created a Shared Services account and environment-specific Networking accounts under the Infrastructure OU.

  • Reviewed and documented current use of IP space and proposed new CIDR ranges per region, account, and OU.

  • Created a Terraform configuration to create IPAM pools in the Shared Services account.

  • Created a Terraform configuration to provision infrastructure in the environment-specific Networking accounts. This configuration includes a Transit Gateway and flow logs, an egress VPC, and all associated routes, per region.

  • Configured peering between the regional Transit Gateways. Configured peering with other Transit Gateways as needed.

  • Created a baseline account configuration for all other accounts which creates a VPC, networking resources, and attaches the VPC to the TGW in the environment-specific Networking account. This design leverages the egress VPC in the environment-specific Networking account and avoids provisioning excess NAT gateways, saving significant costs.

  • Provided detailed LucidChart diagrams illustrating the network design and traffic flows.

Containerized an existing web service and provisioned associated infrastructure in ECS. (>20x)

  • Worked with client’s engineering team to select a base image and create a Dockerfile for containerizing the web service.

  • Created a Terraform configuration to provision an ECS cluster, a service in ECS for each web service, task definitions, and the Application Load Balancer and associated target groups, listeners and rules, and SSL certificates. Included CloudWatch log groups and alarms for critical metrics.

  • Created a Terraform configuration to provision an ECR repository and a CICD pipeline for building the container image, pushing it to the ECR repository, and updating the associated ECS service.

  • Mentored the engineering team on using Docker and building future applications and services in Docker.

Deployed an existing load-balanced web service behind API Gateway and CloudFront. (3x)

  • Worked with client’s engineering team to identify web service endpoints and map them to routes for API Gateway.

  • Created a Terraform configuration to provision the API in API Gateway, including all routes and custom domains.

  • Created a Terraform configuration to provision passthrough CloudFront distributions for all domains using the API Gateway as origin. CloudFront is used in this case purely for reducing API latency and no caching is enabled.

  • Provide extensive documentation and diagrams for final infrastructure.

Migrated an existing web service (lift & shift) to a new AWS Workloads account. (4x)

  • Worked with client’s engineering team to identify all infrastructure resources associated with the migrating service.

  • Created a Terraform configuration to provision all resources in the new Workloads account.

  • Created a Terraform configuration to provision a CICD pipeline for building and deploying the service in the new account.

  • Worked with the engineering team to migrate production to the new service and to decommission the old infrastructure.

Recreated an organization’s entire infrastructure using Terraform.

  • Reviewed all existing infrastructure and documented resources using LucidChart diagrams. Documented specific resource attributes in Confluence. Logically grouped infrastructure by application and service.

  • Over 900 resources were identified, including S3 buckets and policies, SQS queues and triggers, SNS topics and subscriptions, EC2 instances, load balancers, Lambda functions, ECS Fargate containers, DynamoDB tables, Amazon EMR workloads, Kinesis data streams, CloudWatch logs, and many others.

  • Led the CloudOps team in creating numerous Terraform configurations and modules to reproduce the existing infrastructure.

  • Managed the deployment of all configurations and decommissioned all old infrastructure post-migration.

Migrated from monolithic Terraform to modern Terraservices model.

  • Documented all application and service-specific resources. Identified common infrastructure which could be provisioned using Terraform modules.

  • Led the CloudOps team in creating Terraform modules for common infrastructure and creating and deploying a Terraform configuration for each application or service, leveraging the new Terraform modules.

  • New infrastructure (over 3,000 resources) was deployed side-by-side with existing infrastructure and migration to new infrastructure was achieved by changing DNS entries in Route 53 using Terraform.

  • Managed the decommissioning process of replaced infrastructure.

  • Mentored the CloudOps team on writing Terraform following the Terraservices model.

Refactored and centralized CICD pipelines from multiple 3rd parties (CircleCI, Jenkins, and GitLab) into a common, environment-specific CICD solution built on AWS services.

  • Created separate AWS accounts under the Deployments OU, per environment.

  • Created and deployed a Terraform configuration which provisioned required resources in each Deployments account, including IAM roles, cross-account policies, S3 buckets for artifacts, CloudWatch log groups and alarms, and others.

  • Created a set of Terraform modules to be leveraged by future configurations to provision CodePipeline, CodeDeploy, and CodeBuild resources along with associated IAM roles with cross-account permissions.

  • Refactored numerous build scripts from CircleCI, Jenkins, and GitLab into corresponding workflows using AWS resources.

  • Updated existing applications and services to use new CICD pipelines in the Deployments accounts.

  • Provided LucidChart diagrams and documentation for the new CICD infrastructure and mentored developers on the new CICD process.