# cyber professional (@permadeath.com)

Profile: https://sifa.id/p/permadeath.com
Headline: Infrastructure and security for AI and LLM research

## About

I’ve been a founding engineer, as well as infrastructure and security lead, for multiple AI and ML startups. I work directly with ML teams, AI researchers, and academic scientists to accelerate their development and get their work deployed to users.

My next role will focus on strategic technical work at a fast-moving and engineering-centric company. I’m particularly interested in working on developer velocity in distributed systems, AI agent governance and security, and scientific Python builds and tooling.

## Experience

- **Roboticist (MLOps, DevOps, and Build) at RAI Institute** (2024 – present)
  • Delivered platform engineering services at a hardware+software neolab with 10+ independent research teams, including sourcing GPU compute, supporting multi-arch and CUDA builds, writing microservices in Python, and owning observability and security.
  • Obtained executive and legal approval to streamline AI agent adoption, add new models and MCP servers, and clarify guidelines to safely scale to 50% of code org-wide written by AI.
  • Authored migration strategy for a C++/Python/ROS2 robotics monorepo to achieve 10-100x faster builds and a 90% reduction in build failures. Embedded with newly created Build team for 6 months to recruit key hires, develop a Bazel MVP, and secure researcher adoption.
  • Provisioned and operated Kubernetes AI training clusters (35,000 CPUs and 500 GPUs across cloud providers and on-prem). Provided on-call support for researcher workloads such as distributed training in Ray, Metaflow pipelines, and self-hosted GitHub Actions CI runners.
  • Owned high-urgency DevOps projects including cross-company reinforcement learning collaboration, security incident response, and EU data privacy compliance.
  • Earned a HackerOne bounty for discovering an access control bypass in GitHub Actions.
- **Founding Engineer (DevOps, Security, and Infrastructure) at Tome** (2022 – 2023)
  • Built backend infrastructure for automated analysis of venture capital investment contracts, including job scheduling, sandboxing, data storage, and failure handling.
  • Established continuous delivery and observability standards across all TensorFlow/Keras inference serving, Python API services, and data pipelines.
  • Obtained SOC 2 for a seed-stage LLM legaltech startup with a from-scratch compliance program including technical controls, penetration testing, and evidence collection.
  •  Closed "early adopter" deals with law firm security and risk teams as technical sales lead, enabling private model training on proprietary contract data.
- **Principal Infrastructure Engineer, Fidelity Labs at Fidelity Investments** (2020 – 2022)
  • Architected AWS SageMaker research environments, Snowflake data warehouse with PII deidentification, and Elasticsearch cluster with integrated sales conversion ML rankings.
  • Secured enterprise-wide approval for Terraform as an infrastructure-as-code tool and coached other teams on IaC adoption and best practices.
  • Increased production deploy frequency 10x with containerized CI/CD on Jenkins/Kubernetes.
- **Principal Infrastructure Engineer at Catchlight** (2020 – 2022)
- **Site Reliability Engineer at Quantopian** (2017 – 2019)
- **Senior Site Reliability Engineer at Quantopian Inc. (acquired by Robinhood)** (2019 – 2020)
  • Eliminated 90% of overnight PagerDuty incidents by migrating hedge fund trading from cron scheduling to Apache Airflow running on Kubernetes pods.
  • Refactored trading simulator from Python multi-tenant monolith to run as per-algorithm stateless containers on Kubernetes with isolated blast radius and 50% reduction in cost.
  • Designed on-demand cloud developer environments using Kubernetes + Helm + Buildkite.
- **Site Reliability Engineer at Harvard University** (2015 – 2017)
  • Wrote a Python framework for high reliability, high concurrency statistical processing of MRI brain scan data on HPC hardware (SLURM).
  • Architected an expanded data pipeline system to cope with the new demands of researchers collecting higher density data such as phone logs and GPS coordinates.
  • Replaced an existing cron job system with improved scheduling options, including on demand and backfill jobs.

## Skills

- Open Source Development
- CI
- Amazon Relational Database Service (RDS)
- Apache Airflow
- Large Language Models (LLM)
- Infrastructure as code (IaC)
- Bazel
- Kubernetes
- Software Development
- Amazon EC2
- Scalability
- Distributed Systems
- Data Engineering
- MLOps
- High Performance Computing (HPC)
- Infrastructure
- Datadog
- Python
- Site Reliability Engineering
- Writing
- Continuous Delivery
- Machine Learning
- Amazon Web Services (AWS)
- Amazon RDS
- Amazon S3
- Git
- Linux
- Open Source
- Open Source Software
- SQL
- Docker
- ElasticSearch
- Ansible
- Python (Programming Language)
- Bash
- Terraform
- Jenkins
- Amazon ECS
- Amazon Athena
- Conference Organization
- Continuous Integration and Continuous Delivery (CI/CD)
- DevOps
- Security
- Cloud Computing
- Programming
- Artificial Intelligence (AI)
- Conference Speaking
- Engineering Management
- Continuous Integration
- Technical Writing

## Languages

- Chinese (elementary)
- English (native)
- Spanish (limited_working)