Fighting spam at Jemi

Jun 07, 2023

The spike

Recently, founders have noticed an uptick in spam and fraud on their platforms. We had a similar experience last year.

Late November last year around American Thanksgiving traffic to Jemi and signups began to spike. Good news right?

Well, if the traffic is legitimate. Swift investigation began to tell us that it was not. We had experienced fraud and spam content before, but not to this degree.

Our team investigated the spike and found and something was certainly up. We hadn’t launched publicly recently and couldn’t tie the recent uptick in traffic to any legitimate source.

More investigation revealed that the uptick was pretty clearly spam content. The fact that the attack also occurred at a time when our company was vulnerable indicated the malicious intent behind the attack.

The discovery

We combed through the spam content that had infiltrated our site (there was a lot). We noted a few trends in terms of type of content:

  • Indonesian gambling sites
  • Pirated movies and torrents
  • World cup streams
  • Phishing sites
  • While the amount of content was overwhelming, it taught us a valuable lesson. Up until that point in our journey with Jemi, we had focused heavily on optimizing the sign-up and onboarding flow processes to make it as easy as possible to create a website from scratch.

    We also invested a lot into making our public websites performant and SEO-optimized for the modern web. However, this, combined with the fact that we offered a very generous free plan, made our platform attractive to bad actors who exploited it.

    As a platform scales, it must account for both the positive and negative effects of scale. No platform is invulnerable to attack. When I was on Uber's product team, dealing with large-scale fraud within the platform was a major issue that had to be addressed.

    Therefore, managing fraud often involves making the right trade-offs in the moment to move quickly, while also planning for more robust, long-term solutions.

    For Jemi, it was clear that we didn’t have the right safeguards in place.

    How it hit us

    The impact of the spam and fraud content on us was quite significant, especially for an early stage startup without unlimited resources. One surprising observation was that the spam sites not only proliferated quickly — page views also spiked significantly. This meant that bad actors were not only able to create these sites quickly, they were able to circulate the sites quickly and broadly through various channels.

    This significantly increased the pressure on our infrastructure. Scaling up to accommodate the increase in traffic wasn’t necessarily an issue — our infrastructure was robust and modern and able to handle the scale. The negative effects materialized in other ways.

    A big hit to us was the cost. The newly created spam sites were receiving a lot of traffic, which caused our bandwidth costs to spike significantly. Since a majority of spam sites were created through our free plan, this abuse meant $ down the drain for us.

    Our email service was heavily impacted as well. We rely heavily on email as a primary channel for customer communication, using it to notify users about everything from welcome emails to receiving new orders through their stores. As spam users created new accounts in droves, they often signed up with junk/throwaway emails that were invalid. As we continued to send emails to these invalid addresses, our email bounce rate continued to increase. Eventually, it became so severe that our email service, Postmark, had to stop sending our emails. This meant that our legitimate customers were being impacted by the influx in spam users.

    In short, the increase in spam content was damaging to our brand image. We recognized the need to act swiftly.

    Fighting back — on signup

    Given the speed at which certain accounts were being created, we concluded that some of the malicious actors must have been bots. We decided to fight back.

    A quick safeguard we implemented was an IP address-based rate limiter on account creation. If we detected multiple accounts being created back-to-back from the same IP address at a rate that was too suspiciously quick for a reasonable human to be doing so, we blocked the account creation attempt.

    Another quick safeguard was implementing Google reCAPTCHA as a safeguard to detect bots. The integration itself was quick enough, and it was nice that the newer version of reCAPTCHA just operated in the background and didn’t ask users to decipher hard to read text or match images (which could’ve impeded the signup UX).

    A more creative solution we took was implementing a Honeypot captcha. This approach uses a bit of CSS magic to detect and block bot signups. Essentially, an additional form field is added onto the sign up page itself. This field doesn’t collect actual user information, and is meant to be left blank. The field is then hidden with a bit of CSS, like this:

    <div style="display:none;" class="form-group"><input type="text" placeholder="ie website.com" name="url" class="form-control"></div>

    Bots often times will detect all form fields present on a given page and have logic to automatically fill them in. They can map out automated responses to name and email fields, for instance. We added a hidden field that asks for a website URL and didn’t modify in the metadata to indicate that this was a trick (just in case). Since this field is present in the HTML code and the bots can’t detect the fact that the field is hidden through CSS, they will erroneously fill the field in.

    We can then detect whenever the field value is present and pretty reliably determine that the sign up was completed by a non-human. That’ll lead to a block.

    If you go to our sign up page and poke around a bit with the CSS, you can see the honeypot CAPTCHA in action.

    Fighting back — detecting spam content

    Detecting websites that had spam content itself was a bit trickier. We decided to write a custom spam detection model and built that mainly by observing trends present in the spam websites we had seen.

    The model would assign a given a website a score based on the presence of certain problematic keywords (ie torrent or free download) and the presence of ASCII characters. If a website was above a certain threshold score, we could determine with a relatively high degree of accuracy that the website is spam and block it from being publicly accessible.

    Since a large amount of spam content had already infiltrated our platform, we also had to figure out how to remove that content at scale. We ended up building custom tooling that could search for the problematic content and delete it programatically. This prevented us from incurring any unnecessary additional cost from having the sites live.

    Wrapping up

    This journey proved to be a stressful one. Dealing with an attack like this as a small team is not easy — without the support of a larger org or tribal knowledge to fall back on — but it’s often times the fastest way to learn. I was proud of the way we made it through together as a team.

    In the future there are certainly improvements we can make to the spam detection model and adding in new features like email verification, but for now it was nice retaliating against these attackers with a parry of our own. If you ever want to chat about fighting spam, Jemi, or anything else, shoot me an email at jason@jemi.app.