Back

Playfriends

MetaverseGo

~6 min read

app.playfriends.gg
NestJSNode.jsAWS CDKECS FargateAgoraWeb3SQSFirebaseSolanaEVM

Overview

Playfriends is a Web3 SocialFi platform where users join live voice and video lobbies hosted by creators. Each lobby supports a main host and up to eight co-hosts, with guests able to interact through real-time chat and virtual gifting. Guests can book co-hosts privately, subscribe to exclusive content, or purchase services like casual conversation or in-game companionship, all transacted in platform diamonds or on-chain tokens.

Built by MetaverseGo, Playfriends reached over 1M unique users with a 64% day-one retention rate and a revenue mix spanning lobby gifting, private bookings, gaming, and cosmetics.

My Role

As IT Manager at MetaverseGo, I was responsible for the core API layer, third-party integrations, and cloud infrastructure. This covered two distinct services: a NestJS core API handling auth, payments, blockchain, and user management, and a separate Node.js lobby service handling everything inside a live session: room state, real-time events, gifting, bookings, and the Agora audio/video layer. I also owned the full AWS infrastructure using CDK as the IaC layer.

Agora: Live Audio and Video

Every lobby runs on Agora RTC. The lobby service is responsible for issuing tokens, tracking presence, and reacting to Agora's webhook events in real time.

Role-Scoped Token Lifecycle

Tokens are issued per role with distinct privilege sets and lifespans. A host token grants full publish privileges (audio, video, data stream) and is valid for 8 hours. Co-host tokens carry the same publish capabilities but refresh every 2 hours. Audience tokens carry zero publish privileges and expire after 1 hour. The distinction matters: audience members can only receive streams, they cannot inject audio or video into the channel regardless of client-side behavior.

Each token generation produces both a short-lived access token (the Agora RTC token passed to the SDK) and an encrypted refresh token stored server-side. The refresh token carries the channel and role context, encrypted with a server-held secret, so the server can re-issue an access token without the user re-authenticating.

Webhook-Driven Presence and Grace Period

Rather than polling, lobby presence is driven entirely by Agora's server-to-server webhooks, verified by HMAC signature before processing. When a quit event arrives, the user is not immediately removed from Firebase. Instead, they are marked as leaving with the noticeId attached. After a 10-second grace period, the server checks whether the user is still in the leaving state with the same notice ID before removing them. This handles brief network drops without flashing the user as offline to everyone in the lobby.

Cloud Recording

The lobby service integrates Agora's Cloud Recording API to capture sessions on demand. The flow acquires a recording resource, starts an individual-mode recording with a subscriber token, and pipes the output directly into a dedicated S3 bucket. Recordings are keyed under the lobby ID and are accessible post-session for moderation and content use.

Blockchain and Web3 Payments

Playfriends supports crypto top-ups and withdrawals across multiple chains through a unified abstraction layer, with SQS-backed async processing to keep the payment flow non-blocking.

Multi-Chain Support

The blockchain layer supports EVM chains (Ethereum, Polygon, Abstract L2), Solana, and TON through a shared interface. Each runtime has its own provider adapter: Ethers.js for EVM, @solana/web3.js for Solana, and a custom TonApi wrapper for TON. The Web3Sharedservice resolves the correct adapter at runtime based on the network's execution environment, so the rest of the application doesn't need to branch on chain type.

Abstract Chain (L2)

The platform added support for the Abstract blockchain, a Layer 2 built on Ethereum using ZKsync's ZK Stack. Abstract enables faster, cheaper transactions while inheriting Ethereum's security. Users can top up using USDC, USDT, or ETH on the Abstract network via WalletConnect, with contract addresses managed through the service settings system in MongoDB rather than hardcoded in the application.

Custodial Wallets and $FRENS Token

Each user has a custodial wallet managed by the platform. The native $FRENS token gives holders access to exclusive gifts, discounts, and in-platform perks. Every transaction (virtual gifts, private bookings, gaming purchases) feeds into a buyback and burn mechanism that supports token holders. Fiat on-ramps are also supported through Stripe and Coinbase, with real-time rate conversion sourced from CoinMarketCap and DexScreener.

Infrastructure

The platform runs entirely on AWS, provisioned and managed through AWS CDK. Every environment (staging, pre-prod, prod) is reproducible from code, with no manual console changes in production.

CloudFront and AWS WAF

All traffic, API and frontend, is fronted by CloudFront. Beyond static asset caching at AWS edge locations globally, CloudFront serves as the first line of defense: it terminates TLS and absorbs connection-level noise before it ever reaches the origin. AWS WAF is attached directly to the CloudFront distribution as a web ACL, meaning every request is inspected at the edge before being forwarded to the ALB.

WAF rules cover AWS Managed Rule Groups (common threats, known bad inputs, IP reputation lists) alongside custom rate-based rules scoped to financial and transaction endpoints. Any request that violates a rule is blocked at the CloudFront layer, never touching the ALB or the ECS tasks behind it. The frontend is deployed on Amplify and also served through CloudFront, keeping the entire delivery surface behind the same WAF posture.

ECS Fargate and Auto Scaling

Both services run as containerized workloads on ECS Fargate with no EC2 instances to manage. AWS handles the underlying compute, and tasks scale horizontally behind the ALB using target tracking scaling policies tied to CPU utilization. During traffic spikes, new Fargate tasks provision automatically within minutes; when load drops, they scale in just as cleanly. Each task is assigned a least-privilege IAM task role, keeping the blast radius tight if a service is ever compromised.

SQS FIFO Queues and Lambdas

Financially sensitive operations — gifting, Web3 wallet top-ups, referral reward credits, and on-chain transfer completions — are processed asynchronously through SQS FIFO queues. FIFO guarantees exactly-once processing within a deduplication window and strict ordering within a group, which is critical when crediting balances. Lambdas handle inbound webhooks from Stripe and Agora and run scheduled maintenance tasks, keeping that logic decoupled from the long-running ECS processes.

Networking and CI/CD

Both ECS services run in private VPC subnets with no direct inbound path from the internet. The ALB is the sole public entry point; outbound calls to external APIs route through a NAT Gateway for a fixed egress IP. Deployments run through CodePipeline and CodeDeploy — a branch merge builds the Docker image, pushes to ECR, and rolls out to ECS with zero downtime. Every environment is defined in CDK and reproducible from source.

Discord and Telegram

Both bots serve as identity bridges between external communities and the platform. The Telegram bot runs as a webhook-based Telegraf service and handles account linking via a time-limited OTP flow, plus push notifications for lobby activity. A custom in-memory rate limiter was built from scratch (60 req/min per user) since available npm packages for this were unmaintained. A Discord bot was designed for account verification via a !verify DM flow, linking a Discord ID to a Playfriends UID with duplicate-link protection on both sides.

Challenges and Solutions

Operating a live platform with over a million users, real-money transactions, and a public blockchain surface meant the hard problems arrived early and at scale. These are the four most significant incidents and what we did about them.

1. Sustained DDoS Attacks

At multiple points, the platform was hit by sustained volumetric attacks: floods of synthetic requests designed to saturate origin capacity and degrade service. The traffic patterns were coordinated and clearly not organic user behavior.

CloudFront absorbed connection-level floods at the edge before they reached the ALB. WAF rate-based rules throttled IPs generating abnormal request volumes, and we tightened thresholds on the most targeted endpoint groups. Geo-restriction rules were added for regions generating suspicious patterns, and the AWS IP Reputation managed rule group handled known malicious infrastructure automatically.

On the compute side, ECS auto-scaling meant the origin was not a fixed target. As synthetic load arrived, new Fargate tasks provisioned to absorb it, buying time for the WAF rules to identify and block offending traffic at the edge. The combination of edge-layer filtering and elastic compute kept the platform stable through each wave.

2. Brute-Force Attacks on Financial Endpoints

Malicious actors began systematically probing top-up, crediting, and blockchain transaction endpoints, attempting to exploit any gap in validation logic or replay already-confirmed transaction hashes for double-credit. The attack pattern was slow and distributed, designed to stay under naive rate limits.

The first fix was tightening WAF rate-based rules specifically scoped to financial and transaction route groups, with a significantly lower threshold than general API routes. At the application layer, every Web3 top-up flow was already going through SQS FIFO with a deduplication ID derived from the transaction hash, making replay attempts provably no-ops. We also added server-side idempotency enforcement on the credit endpoints: a transaction hash can only result in a credit once, enforced at the database write level with a unique index. Any duplicate submission returns the original result rather than crediting again.

Together, WAF drops the volume at the edge before it becomes compute cost, and the application-layer guarantees ensure that anything that does slip through cannot cause a double-credit regardless.

3. MongoDB Performance Degradation and the M30 Detour

Under load, several API routes began timing out. MongoDB Atlas monitoring showed that a handful of queries were doing full collection scans on hot paths: lobby lookups, user wallet reads, and transaction history queries that were filtering on unindexed fields. The immediate fix was to upgrade the cluster from M10 to M30 to buy headroom while we diagnosed properly. This stopped the timeouts but added roughly $400/month to the bill.

The actual fix was systematic. We ran explain() on every slow query surfaced by Atlas Performance Advisor and added compound indexes aligned to the real query shapes: multi-field filters got multi-field indexes, sort fields were included in the index definition, and covered queries were introduced for high-frequency reads that only needed a subset of fields. Aggregation pipelines were restructured to place $match stages as early as possible so the pipeline operates on the smallest dataset at every stage. Connection pooling configuration was also tightened to prevent pool exhaustion under concurrent load.

After validation in staging, we reverted the cluster back to M10. The queries that previously caused M30-level load ran cleanly on M10 with proper indexes. The month on M30 was a bridge, not a solution.

4. Firebase Cost: $3,000/month Down to $350/month

The lobby service uses Firebase Realtime Database as its real-time state layer — presence, chat, lobby events, and co-host slots all written to Firebase paths and pushed to connected clients via SDK listeners. This worked well at small scale. At production load, it became the most expensive problem on the platform: upwards of $3,000/month, driven almost entirely by download bandwidth charges. The root cause was a combination of over-broad listeners and unindexed query paths on both the server and client.

On the server side, lobby state writes had grown over time. Every lobby event was fanning out full lobby objects rather than writing targeted, minimal payloads to the specific paths that had actually changed. Listeners on the client were attached at high-level nodes (watching an entire lobby tree) when they only needed a single field. A listener on a high-cardinality node in Firebase downloads the entire subtree on every change event, and with hundreds of concurrent lobbies this added up to enormous bandwidth.

The fix was a joint effort between backend and frontend. On the backend, we decomposed writes to target only the specific paths that changed, rather than overwriting parent nodes. On the frontend, listeners were moved down to the lowest possible node in the tree, so a chat message arriving in a lobby did not trigger a re-download of the entire lobby state for every connected client. Firebase Security Rules were also audited and tightened, which as a side effect forced a cleanup of some overly broad read patterns that were being allowed silently. Missing .indexOn rules were added to the database configuration, eliminating the full-scan reads that Firebase was doing on unindexed query paths.

The result was a drop from ~$3,000/month to approximately $300-400/month, an 87% reduction, with no functional changes to the user experience.

Cost Impact

Across all four optimization tracks, overall infrastructure and third-party spend was reduced by approximately 65%. The numbers before and after:

AWS$900–1,200/mo~$380/mo
MongoDB$600–800/mo~$220/mo
Agora$1,100–1,500/mo~$480/mo
Firebase$2,000–3,000/mo$300–400/mo