6 months of working remotely at scrapinghub.com

ScrapingHub is a distributed company that builds tooling and a platform to extract data from the web. The company was incorporated in Ireland in 2010, from day 1 it was a distributed team with many staff in Uruguay and other parts of the world. Since then the company has grown and now has a team of 180 people located all around the world, last year the companies revenue was 12 million. I joined in March 2019, it was my first time to work remotely. Below I reflect on my past 6 months, the remote experience and some of the things I’ve been working on.

What is Scrapinghub?

The mission is “to provide our customers the data they need to innovate and grow their businesses.” The founders of the company are the creators of Scrapy, a very popular open source and collaborative framework for extracting data from the web. ScrapingHub started as a platform for running Scrapy spiders. It has since grown to include three more products:

  • Crawlera: A specially designed proxy for web scraping to ensure you can crawl quickly and reliably
  • Splash: A headless browser to enable customers to extract data from JavaScript websites
  • AutoExtract: delivers next-generation web scraping capabilities backed by an AI-enabled data extraction engine. This enables customers to crawl many websites without needing to write custom CSS and XPath selectors for each one.

What is your role?

I am a DevOps engineer on the AutoExtract product. I work closely with two backend developers to deliver an API that enables customers to access our AI-enabled data extraction engine. I’m dedicated to providing infrastructure services for the team, this includes things like compute, deployment mechanisms, monitoring, alerting and logging, and so on.

What is your background and what brought you to ScrapingHub?

I graduated from Computer Applications at Dublin City University 5 years ago. During my time in college, I had a great interest in infrastructure and automation. Upon graduating, I worked as a Cloud Engineer in a bank for a year before joining an ecommerce company under the title of Software Engineer.

At this company, I worked on a team that acquired data from the web. However, we never neatly solved the problem, our solution would never scale. This was due to the initial work to create XPaths and CSS selectors along with the ongoing maintenance and monitoring to ensure the sites haven’t changed and these are now broken. During my time in this role, I had heard of ScrapingHub as they are one of the leaders in the web scraping space. It was always a company I was curious to know more about and it was unusual to hear of an Irish company that is distributed.

After this, I was employee number one an Irish Start-Up, again with the title of Software Engineer. As you can imagine in a Start-Up environment there is lots of work to do and the role was a diverse one that included many different aspects.

My journey to ScrapingHub began with a call from a recruiter in Stelfox who told me about the role. He had my attention at the word ScrapingHub, I knew immediately this was a chance to experience remote working for the first time. During the conversation, my interest only began to grow. The recruiter described the AutoExtract product to me and I was curious to know more - they’ve figured a way to do web scraping without the need to write XPaths and CSS selectors, a problem I previously faced.

What was the interview and acceptance process like?

As ScrapingHub is a distributed company all interviews are done remotely, you do not meet a Shubber (someone who works for ScrapingHub) in person at any stage. For me, the interview process consisted of three interviews each lasting no more than 30 minutes.

My first interview was with a ScrapingHub recruiter. I got the impression that the interview focused on whether or not you and ScrapingHub are a good fit for each other. It’s a time for them to ask you questions to see if you align with their company values and for you to ask questions to see if its the right role for you.

My second interview was the hardest out of the three. It was with a member of the infrastructure team, it was very technical focused and enjoyable experience. For me, the interview was a stream of questions covering different key areas of the role. The questions started easy and then increased in difficulty. For example, the containers section started with “What flag do you pass to docker to expose a port?” and ended on “Describe to the best of your knowledge how containers work at a kernel level?”. When I failed to have a good answer for a question the interviewer was nice about it and discussed it in more detail, I liked this and I walked away from the interview having learned something.

The final interview was with the head of engineering. I believe this interview was a mix of both culture fit and technical ability. It took the form of a friendly conversation based on resume items and previous work along with a few technical questions thrown in along the way.

A couple of days later I heard back from the Stelfox recruiter to inform me that my interview had been successful. From this point on out, it was a standard procedure - receive a contract, provide personal details, etc.

Setting myself up for success

As I’m based in Ireland ScrapingHub provided me with a laptop that was shipped to my address before I was due to start. I was excited about the job and remote aspect of it. I believed having a good working area was going to be key to my success.

I spent a couple of days trawling second hand sites and researching office furniture and equipment. At the time I was living in a small 1 bedroom apartment in Dublin, Ireland and space was limited. I focused on what I felt were the key elements to making a reasonable space where I could work contently. I ended up with the following:

I’ve since made use of the remote flexibility and moved away from Dublin, Ireland to a bigger property in Cork, Ireland. In doing so I’ve improved my workspace by having a dedicated office with a standing desk and many, many, many, many cable management aids.

The first week

Three days before I was due to start I received an email containing my staff credentials. This introduced me to some of the tools used for management and communication:

  • JIRA - For managing work and tracking time.
  • Confluence An internal knowledge base for the team to collaboratively maintain.
  • GSuite - Email, calendar, and meetings.
  • Github - Git repositories
  • BambooHR - HR System

Before officially starting I signed into Slack and received a warm welcome from all my future co-workers. I also signed into my GSuite account where my calendar revealed a hint of what the first few days were going to be like.

One of my major concerns early on was keeping to a routine. To help with this I decided to continue waking up at the same time as I did in previous jobs. I wanted to use the time I saved by having no commute to complete any tasks that involved leaving the house. For example, on day 1 I did the grocery shopping for the week.

With the groceries unpacked and put away, it was time to get to work. My first morning started at 9.30am, it consisted of HR introducing me to my manager. He told me a little bit about the team and followed up with inviting me to team-specific slack channels and scheduling introductions with the team.

After this at 10 am was a general company HR onboarding session. New hires are brought on in batches so I wasn’t alone on these calls, for me there were 3 other new hires on the call. Immediately after this, we had a 1:1 call with HR to discuss our benefits (pension, shares, holidays, health insurance, etc.).

Before finishing up for lunch we had an IT onboarding session. This meeting was to ensure we could access all the necessary systems and that we knew different company policies - How do I change my password? What do I do if I forgot my password? How do I encrypt my disk? What do I do in the event a device is stolen? And so on.

ScrapingHub is very focused on wellbeing. For March they ran an event which focused on this. As you can see from my calendar, the next meeting was “Stay sane while working remote” with another later in the week of “Mindful Meditation” both of these events were March specials. As a newbie it was interesting to experience these, it gave me a glimpse of how ScrapingHub encourages camaraderie and enables people to get to know each other outside of work topics. Outside of March this space for general chit-chat is maintained, each week there’s a standing watercooler meeting for the different time zones for people to get together and chat about absolutely anything. Additionally, each month we have “Shub talks” where people show something they’ve been working on or share some piece of knowledge - I’m pleased to say within my 6 months at the company I have had the opportunity to speak at one.

Opensource has been a big part of ScrapingHub since day 1, this is demonstrated by Scrapy being launched as an opensource project by the company back in 2009. My final meeting of the day showed how opensource is still a very big part of ScrapingHub. Shubbers are encouraged to contribute to opensource as much as possible, we are given 5 hours per week to engage in anything opensource related. Today ScrapingHub has a handful of opensource projects, check out some of the more popular ones: Splash, ELI5, deep-deep, and SpiderMon. Additionally, ScrapingHub supports the opensource ecosystem by part-taking in Google Summer of Code.

Day two continued with more meetings. The morning section was introductions to the team and a system overview. Following these, the newbie group had an introduction to the founders of the company. By the end of the day, I had been assigned a ticket in JIRA and was working away on it.

What have you been working on?

Up until now, infrastructure at ScrapingHub ran on dedicated servers rented from Hetzner. Teams package their applications up as Docker containers and uses a ChatOps deployment mechanism to run them on Mesos. Additional services such as a database or kafka require input from the infrastructure team.

Previously to me joining the company, a decision was made to move AutoExtract onto Google Cloud Platform with their managed Kubernetes offering. The idea behind this was to be able to enable us to quickly increase compute to meet customer demand.

My main task was to make this happen. Thankfully due to the adoption of Docker and Mesos within the company this challenge I had a bit of a head start.

AutoExtract is build-up of multiple components that communicate together via Kafka. It is exposed to customers via an HTTP API. Breaking this down I established what was required to get the system running, this went as follows:

  • A mechanism to deploy and destroy a single component - Helm fits this use case nicely. I created a helm chart for each of the components. The chart is stored, versioned and released with the component’s source code. I extended the pre-existing ChatOp deployment mechanism to support communicating with helm.
  • HTTP/HTTPS connectivity to pods - Ingress Nginx solved this problem nicely. I paired it with cert manager to automatically provision SSL certificates from letsencrypt and External DNS to automatically create DNS records.
  • Monitoring of the application and cluster - Deploying Prometheus provided a straightforward way to achieve this.
  • Authentication for internal applications - OAuth2 Proxy integrated nicely with ingress nginx to provide authentication against the company’s GSuite accounts.
  • Kafka - I made use of Confluent.Cloud’s managed kafka offering for this. Allowed us to get Kafka on a consumption pricing model and avoid an ongoing maintenance overhead.

Today I’m happy to say that the AutoExtract product runs entirely on Google Cloud Platform.

What is your average week like?

Something that amazed me about working remotely is how much more productive I am. ScrapingHub enables me to have a large amount of autonomy over my work as well as my time. I am in control of how I wish to structure my day, this allows for lots of flexibility - nothing is stopping me taking an hour here and there throughout general working hours (9 - 5 pm) and replacing them with hours late into the evening.

My average week consists of a total of 3 mandatory meetings. Every Monday my manager does a 1:1 meeting with all members of the team, these last on average 15 minutes. On Tuesday we have a backend team sync up and on Thursday we have a full team sync up, each of these meetings last 1 hour and take a similar format to a daily standup with an allowance for more discussion where necessary.

For the rest of the week, I’m free to work away on whatever task has been assigned to me. Everyone is always available via Slack and when more in-depth communication is necessary ad-hoc video calls are started.

Frequently asked questions?

Do you ever find remote work isolating or miss human contact?

No, the team I’m on is very cohesive. Everyone is very supportive and has a clear invested interest in the success of the product and thus the task you may be working on.

For times when I wish to get out of the house, ScrapingHub has hotdesk spaces in WeWork in Dublin and Republic of Work in Cork. The Bank of Ireland Workbench spaces or public libraries can be useful spots to work from. For the most part, I only make use of them when I have plans to meet a friend for lunch.

In addition to this, it’s important to note that despite only being in the company for 6 months I was given the opportunity to meet all of my teammates in person along with the founders of the company and many members of the leadership team.

Do you find a cost saving in working from home?

Yes, I no longer spend any money on commuting or purchasing lunch. On average I’ve worked this out at about a €80 saving per week. In addition to this, I’ve recently moved away from Dublin and reduced my rent cost significantly.

Have you kept to your goal of wanting to keep a routine and doing something with the saved commuting time?

I would like to say yes, but sometimes extra sleep wins. I’m aiming to bring this back on track soon.

What has been your biggest personal challenge with switching to remote work?

Definitely communication, specifically video calls. I found seeing my face on screen for the first few days very odd. These days I’ve become comfortable with that and just struggle with knowing the correct time to speak or interrupt to add to something while someone else is talking.

What do you most enjoy about working remote?

I really like the hassle free element of it, I’m no longer setting alarm clocks, checking public transport times, wondering if I need to bring an umbrella today, and so on. Starting and finishing work is as simple as entering and exiting a room in my house.

How does knowledge sharing within your team occur?

For me, this is mainly done through Slack, the team are very supportive and willing to help as much as they can. I try to return this as much has I can and believe I have mostly been successful given the teams ability to adopt kubernetes.

If you’ve got any other questions feel free to reach out.

Interested in going remote with ScrapingHub?

I’ve enjoyed my time at ScrapingHub so far and I’m sure you might too. If you enjoyed my experience above and would like to try it out yourself take a look at the jobs page or research out to a ScrapingHub Recruiter.

Exporting Confluent Cloud Metrics to Prometheus

At Kafka Summit this year, Confluent announced consumption based billing for their Kafka Cloud offering, making it the cheapest and easiest way to get a Kafka Cluster. However, due to the Kafka cluster being multi-tenanted it comes with some restrictions, ZooKeeper is not exposed and the __consumer_offsets topic is restricted, this means popular tools like Kafka Manager and Prometheus Kafka Consumer Group Exporter won’t work.

kafka_exporter comes as a nice alternative as it uses the Kafka Admin Client to access the metrics. However, due to the authentication process required by Confluent Cloud it doesn’t work as is.

By forking kafka_exporter and upgrading the Kafka client one can get a successful output.

You can try out my build as follows:

$ docker run -p 9308:9308 -it imduffy15/kafka_exporter \
--kafka.server=host.region.gcp.confluent.cloud:9092 \
--sasl.enabled \
--sasl.username=username \
--sasl.password="password" \ 
--sasl.handshake \ 
--tls.insecure-skip-tls-verify \
--tls.enabled

and on querying http://localhost:9308/metrics all metrics documented here will be available. Prometheus can scrape this and alert and graph on the data.

Local development with virtual hosts and HTTPS

When doing development locally it might be necessary to access the application(s) using a virtual host (vhost) and/or HTTPS. This post describes an approach to achieving this on OSX using Docker, which avoids creating a large mess on your computer.

Domain Name Systems (DNS)

Domain Name Systems (DNS) are often referred to as the phonebook of the internet. People access information online through domain names, like ‘amazon.com’ or ‘rte.ie’. DNS is responsible for translating a domain name to an IP address.

When an HTTP resource is accessed using DNS, an additional header containing the domain name is provided to the HTTP server. The HTTP server uses this for routing the request to the correct vhost.

In order to have functioning vhosts, DNS is required. Dnsmasq is a popular DNS server, it can be configured to resolve a set of specified domains to a specified IP address. Using docker a dnsmasq server can easily be created:

$ docker run -it -p 53:53/udp --cap-add=NET_ADMIN andyshinn/dnsmasq:latest --log-queries --log-facility=- --address=/dev.ianduffy.ie/127.0.0.1

The provided command line arguments instruct dnsmasq to resolve all requests for dev.ianduffy.ie and *.dev.ianduffy.ie to 127.0.0.1. Using dig - a DNS lookup utility this configuration can be validated:

$ dig @127.0.0.1 dev.ianduffy.ie +short
127.0.0.1

Additionally, all wildcards of dev.ianduffy.ie also resolve:

$ dig @127.0.0.1 random-string.dev.ianduffy.ie +short
127.0.0.1

While this works, the OSX is not configured to use this DNS server for it’s lookups so attempting to resolve dev.ianduffy.ie in the browser will fail. It’s possible to specify nameservers for a specific domain name, this can be achieved by creating a file within /etc/resolver with a filename that matches the domain.

For example, if /etc/resolver/dev.ianduffy.ie is created with the following contents:

nameserver 127.0.0.1

Now all queries to dev.ianduffy.ie will be resolved by doing a lookup against the DNS server at 127.0.0.1.

HTTP

A HTTP server will be required for routing the requests based on a vhost and supplying HTTPS. Nginx is a good fit for this, even-more-so as Jason Wilder from Microsoft has created a container image that exposes the nginx reverse proxy functionality via environment variables.

nginx-proxy can be started using the following:

$ docker run -it -p 80:80 -v /var/run/docker.sock:/tmp/docker.sock:ro jwilder/nginx-proxy

nginx-proxy will look for VIRTUAL_HOST environment variables on other docker containers and route to them accordingly. To demonstrate this, a container running httpbin which provides data for debugging HTTP requests can be created, with a VIRTUAL_HOST environment variable specified.

$ docker run -e VIRTUAL_HOST=httpbin.dev.ianduffy.ie kennethreitz/httpbin

This service can now be accessed via httpbin.dev.ianduffy.ie. Alternatively, if you do not have the dnsmasq service from earlier running, the service can be accessed by passing the header “host” with value “httpbin.dev.ianduffy.ie”. This can be tested with an HTTP client like curl

$ curl http://httpbin.dev.ianduffy.ie/headers
{
  "headers": {
    "Accept": "*/*",
    "Connection": "close",
    "Host": "httpbin.dev.ianduffy.ie",
    "User-Agent": "curl/7.54.0"
  }
}
$ curl -H "host: httpbin.dev.ianduffy.ie" http://127.0.0.1/headers
{
  "headers": {
    "Accept": "*/*",
    "Connection": "close",
    "Host": "httpbin.dev.ianduffy.ie",
    "User-Agent": "curl/7.54.0"
  }
}

HTTPS

In some scenarios, HTTPS might be required. mkcert provides locally trusted SSL certificates and automatically OSX, Linux, and Windows system stores along with Firefox, Chrome and Java.

With mkcert installed, certificates can be generated with the following command:

$ mkdir certs
$ mkcert -cert-file certs/dev.ianduffy.ie.crt -key-file certs/dev.ianduffy.ie.key -install dev.ianduffy.ie *.dev.ianduffy.ie

By mounting these certificates, as a volume on a container running nginx-proxy HTTPS will be enabled.

$ docker run -it -p 80:80 -p 443:443 -v $(pwd)/certs:/etc/nginx/certs -v /var/run/docker.sock:/tmp/docker.sock:ro jwilder/nginx-proxy

Executing curl to https://httpbin.dev.ianduffy.ie will now respond successfully. Additionally, the -v flag can be specified to tell curl to be verbose and it will display information about the SSL certificate.

$ curl -v https://httpbin.dev.ianduffy.ie/headers

Placing a vhost in front of a local application

Most developers will want to run their application on the host and continue with the standard development workflow. To accommodate for this, a container that forwards traffic to a host port can be created.

socat is perfect for this use case, it enables us to specify a source IP address and port and a destination IP address and port. In the example below, all traffic for test.dev.ianduffy.ie will be forwarded to the docker host on port 9000.

$ docker run -it --expose 80 -e VIRTUAL_HOST=test.dev.ianduffy.ie alpine/socat tcp-listen:80,fork,reuseaddr tcp-connect:host.docker.internal:9000

Bring it all together with docker-compose

All of the containers described above can be brought together in a single docker compose file. This provides ease of bringing the system up with a single command.

nginx-proxy:
  container_name: nginx-proxy
  image: jwilder/nginx-proxy
  ports:
    - 80:80
    - 443:443
  volumes:
    - /var/run/docker.sock:/tmp/docker.sock:ro
    - ./certs:/etc/nginx/certs:ro
dnsmasq:
  container_name: dnsmasq
  image: andyshinn/dnsmasq:latest
  command: --log-queries --log-facility=- --address=/dev.ianduffy.ie/127.0.0.1
  ports:
    - 53:53
    - 53:53/udp
  cap_add:
    - NET_ADMIN
socat:
  image: alpine/socat
  command: tcp-listen:80,fork,reuseaddr tcp-connect:host.docker.internal:9000
  environment:
    - VIRTUAL_HOST=dev.ianduffy.ie
    - VIRTUAL_PORT=80
  expose:
    - 80

This enables VHosts and HTTPS for applications running on the host without creating too much of a mess as its all contained within Docker.

Managing access to multiple AWS Accounts with OpenID

Many organisations look towards a multiple account strategy with Amazon Web Services (AWS) to provide administrative isolation between workloads, limited visibility and discoverability of workloads, isolation to minimize blast radius, management of AWS limits and cost categorisation. However, this comes at a large complexity cost, specifically around Identity Access Management (IAM).

Starting off with a single AWS account, and using a handful of IAM users and groups for access management, is usually the norm. As an organisation grows they start to see a need for separate staging, production, and developer tooling accounts. Managing access to these can quickly become a mess. Do you create a unique IAM user in each account and provide your employees with the unique sign-on URL? Do you create a single IAM user for each employee and use AssumeRole to generate credentials and to enable jumping between accounts? How do employees use the AWS Application Programming Interface (API) or the Command Line Interface (CLI); are long-lived access keys generated? How is an employee’s access revoked should they leave the organisation?

User per account approach

All users in a single account

using STS AssumeRole to access other accounts

Design

Reuse employees existing identities

In most organisations, an employee will already have an identity, normally used for accessing e-mail. These identities are normally stored in Active Directory (AD), Google Suite (GSuite) or Office 365. In an ideal world, these identities could be reused and would grant access to AWS. This means employees would only need to remember one set of credentials and their access could be revoked from a single place.

Expose an OpenID compatible interface for authentication

OpenID provides applications with a way to verify a users identity using JSON Web Tokens (JWT). Additionally, it provides profile information about the end user such as first name, last name, email address, group membership, etc. This profile information can be used to store the AWS accounts and AWS roles the user has access to.

By placing an OpenID compatible interface on top an organisation’s identity store users can easily generate JWTs which can be later used by services to authenticate them.

Trading JWTs for AWS API Credentials

In order to trade JWTs for AWS API Credentials, a service can be created that runs on AWS with a role that has access to AssumeRole.This service would be responsible for validating a users JWT, ensuring the JWT contains the requested role and executing STS AssumeRole to generate the AWS API Credentials.

Additionally, the service would also generate an Amazon Federated Sign-On URL which would enable users to access the AWS Web Console using their JWT.

Example Implementation

Provided below is an example implementation of the above design. One user with username “demo” and password “demo” exists. Please do not use this demo in a production environment without https.

To follow along, clone or download the code at https://github.com/imduffy15/aws-credentials-issuer.

OpenID Provider

Keycloak provides an IAM along with OpenID Connect (OIDC) and Security Assertion Markup Language (SAML) interfaces. Additionally, it supports federation to Active Directory (AD) and Lightweight Directory Access Protocol (LDAP) servers. Keycloak enables organisations to centrally manage employees access to many different services.

The provided code contains a docker-compose file that on executing docker-compose up will bring up a keycloak server with its administrative interface accessible at http://localhost:8080/auth/admin/ using username “admin” and password “password”.

On the “clients” screen, a client named “aws-credentials-issuer” is present, users will use this client to generate their JWT tokens. This client is pre-configured to work with the Authorization Code Grant for command line interfaces to generate tokens and the Implicit Grant for a frontend application.

Under the “aws-credentials-issuer” additional roles can be added, these roles must exist on AWS and they must have a trust relationship to the account that will be running the “aws-credentials-issuer”.

Additionally, these roles must be placed into the users JWT tokens, this is pre-configured under “mappers”.

Finally, the role must be assigned to a user. This can be done by navigating to users -> demo -> role mappings and moving the wanted role from “available roles” to “assigned roles” for the client “aws-credentials-issuer”

AWS Credentials Issuer Service

Backend

The provided code supplies a lambda function which will take care of validating the users JWT token and exchanging it using AssumeRole for AWS Credentials.

This code can be deployed to an AWS account by using the serverless framework and the supplied definition. The definition will create the following:

With AWSCLI configured with credentials for the account that the service will run in execute sls deploy, this will deploy the lambda functions and return URLs for executing them.

Frontend

The provided code supplies a frontend which will provide users with a graphical experience for accessing the AWS Web Console or generating AWS API Credentials.

The frontend can be deployed to an S3 bucket using the serverless framework. Before deploying it some variables must be modified. In the serverless definition (serverless.yml), replace “ianduffy-aws-credentials-issuer” with a desired S3 bucket name and modify ui/.env to contain your Keycloak and Backend URL as highlighted above. The deployment can be executed with sls client deploy.

On completion, a URL in the format of http://<bucket-name>.s3-website.<region>.amazonaws.com will be returned. This needs to be supplied to keycloak a redirect URI for the “aws-credentials-issuer” client.

Usage

Browser

By navigating to the URL of the S3 bucket a user can get access to the AWS Web Console or get API credentials which they can use to manually configure an application.

Command line

To interact with the “aws-credentials-issuer” the user must have a valid JWT. This can be done by executing the Authorization Code Grant against keycloak.

token-cli can be used to execute the Authorization Code Grant and generate a JWT token, this can be downloaded from the projects releases page; alternatively, on OSX it can be installed with homebrew brew install imduffy15/tap/token-cli.

Once token-cli is installed it must be configured to run against keycloak, this can be done as follows:

Finally, a token can be generated with token-cli token get aws-credentials-issuer -p 9000. On first run the users browser will be opened and they will be required to login, on subsequent runs the token will be cached or refreshed automatically.

This token can be used against the “aws-credentials-issuer” to get AWS API credentials:

curl https://<API-GATEWAY>/dev/api/credentials?role=<ROLE-ARN> \
-H "Authorization: bearer $(token-cli token get aws-credentials-issuer)"

Alternatively, a AWS Amazon Federated Sign-On URL can also be generated:

curl https://<API-GATEWAY>/dev/api/login?role=<ROLE-ARN> \
-H "Authorization: bearer $(token-cli token get aws-credentials-issuer)"

Scala and AWS managed ElasticSearch

AWS offer a managed ElasticSearch service. It exposes an HTTP endpoint for interacting with ElasticSearch and requires authentication via AWS Identity Access Management.

Elastic4s offers a neat DSL and Scala client for ElasticSearch. This post details how to use it with AWS’s managed ElasticSearch service.

Creating a request signer

Using the aws-signing-request-interceptor library its easy to create an HttpRequestInterceptor which can be later added to the HttpClient used by Elastic4s for making the calls to ElasticSearch

private def createAwsSigner(config: Config): AWSSigner = {
  import com.gilt.gfc.guava.GuavaConversions._

  val awsCredentialsProvider = new DefaultAWSCredentialsProviderChain
  val service = config.getString("service")
  val region = config.getString("region")
  val clock: Supplier[LocalDateTime] = () => LocalDateTime.now(ZoneId.of("UTC"))
  new AWSSigner(awsCredentialsProvider, region, service, clock)
}

Creating an HTTP Client and intercepting the requests

The ElasticSearch RestClientBuilder allows for registering a callback to modify the customise the HttpAsyncClientBuilder enabling registering the interceptor to sign the requests.

The callback can be created by implementing the HttpClientConfigCallback interface as follows:

private val esConfig = config.getConfig("elasticsearch")

private class AWSSignerInteceptor extends HttpClientConfigCallback {
  override def customizeHttpClient(httpClientBuilder: HttpAsyncClientBuilder): HttpAsyncClientBuilder = {
    httpClientBuilder.addInterceptorLast(new AWSSigningRequestInterceptor(createAwsSigner(esConfig)))
  }
}

Finally, an Elastic4s client can be created with the interceptor registered:

private def createEsHttpClient(config: Config): HttpClient = {
  val hosts = ElasticsearchClientUri(config.getString("uri")).hosts.map {
    case (host, port) =>
      new HttpHost(host, port, "http")
  }

  log.info(s"Creating HTTP client on ${hosts.mkString(",")}")

  val client = RestClient.builder(hosts: _*)
    .setHttpClientConfigCallback(new AWSSignerInteceptor)
    .build()
  HttpClient.fromRestClient(client)
}

Full Example on GitHub

Azure bug bounty Root to storage account administrator

In my previous blog post Azure bug bounty Pwning Red Hat Enterprise Linux I detailed how it was possible to get administrative access to the Red Hat Update Infrastructure consumed by Red Hat Enterprise Linux virtual machines booted from the Microsoft Azure Marketplace image. In theory, if exploited one could have gained root access to all virtual machines consuming the repositories by releasing an updated version of a common package and waiting for virtual machines to execute yum update.

As an attacker, this would have granted access to every piece of data on the compromised virtual machines. Sadly, the attack vector is actually much more widespread than this. Given some poor implementation within the mandatory Microsoft Azure Linux Agent (WaLinuxAgent) one is able to obtain the administrator API keys to the storage account used by the virtual machine for debug log shipping purposes, at the time of research this storage account defaulted to one shared by multiple virtual machines.

At the time of research, the Red Hat Enterprise Linux image available on the Microsoft Azure Marketplace came with WaLinuxAgent 2.0.16. When a virtual machine was created with the “Linux diagnostic extension” enabled the API key for access to the specified storage account was written to /var/lib/waagent/Microsoft.OSTCExtensions.LinuxDiagnostic-2.3.9007/xmlCfg.xml.

Once acquired one can simply use the Azure Xplat-CLI to interact the storage account:

export AZURE_STORAGE_ACCOUNT="storage_account_name_as_per_xmlcfg"
export AZURE_STORAGE_ACCESS_KEY="storage_account_access_key_as_per_xmlcfg"
azure storage container list # acquire some container name
azure storage blob list # provide the container name
# Copy, download, upload, delete any blobs available across any containers you can access.

If the storage account was used by multiple virtual machines there is potential to download their virtual hard disks.

Azure bug bounty Pwning Red Hat Enterprise Linux

TL;DR Acquired administrator level access to all of the Microsoft Azure managed Red Hat Update Infrastructure that supplies all the packages for all Red Hat Enterprise Linux instances booted from the Azure marketplace.

I was tasked with creating a machine image of Red Hat Enterprise Linux that was compliant to the Security Technical Implementation guide defined by the Department of Defense.

This machine image was to be used for both Amazon Web Services and Microsoft Azure. Both of which offer marketplace images which had a metered billing pricing model[1][2]. Ideally, I wanted my custom image to be billed under the same mechanism, as such the virtual machines would be able to consume software updates from a local Red Hat Enterprise Linux repository owned and managed by the cloud provider.

Both Amazon Web Services and Microsoft Azure utilise a deployment of Red Hat Update Infrastructure for supplying this functionality.

This setup requires two main parts:

Red Hat Update Appliance

There is only one Red Hat Update Appliance per Red Hat Update Infrastructure installation, however, both Amazon Web Services and Microsoft Azure create one per region.

The Red Hat Update Appliance is responsible for:

The Red Hat Update Appliance does not need to be exposed to the repository clients.

Content Delivery server

The content delivery server(s) provide the yum repositories that clients connect to for updated packages.

Achieving metered billing

Both Amazon Web Services and Microsoft Azure use SSL certifications for authentication against the repositories.

However, these are the same SSL certificates for every instance.

On Amazon Web Services having the SSL certificates is not enough, you must have booted your instance from an AMI that had an associated billing code. It is this billing code that ensures you pay the extra premium for running Red Hat Enterprise Linux.

On Azure it remains undefined how they manage to track billing. At the time of research, it was possible to copy the SSL certificates from one instance to another and successfully authenticate. Additionally, if you duplicated a Red Hat Enterprise Linux virtual hard disk and created a new instance from it all billing association seemed to be lost but repository access was still available.

Where Azure Failed

On Azure to setup repository connectivity, they provide an RPM with the necessary configuration. In the older version of their agent, it is responsible for this task [3]. The installation script it references comes from the following archive. If you expand this archive you will find the client configuration for each region.

By browsing the metadata of the RPMs we can discover some interesting information:

$ rpm -qip RHEL6-2.0-1.noarch.rpm
Name        : RHEL6                        Relocations: (not relocatable)
Version     : 2.0                               Vendor: (none)
Release     : 1                             Build Date: Sun 14 Feb 2016 06:40:54 AM UTC
Install Date: (not installed)               Build Host: westeurope-rhua.cloudapp.net
Group       : Applications/Internet         Source RPM: RHEL6-2.0-1.src.rpm
Size        : 20833                            License: GPLv2
Signature   : (none)
URL         : http://redhat.com
Summary     : Custom configuration for a cloud client instance
Description :
Configurations for a client to connect to the RHUI infrastructure

As you can see, the build host enables us to discover all of the Red Hat Update Appliances:

$ host westeurope-rhua.cloudapp.net
westeurope-rhua.cloudapp.net has address 104.40.209.83

$ host eastus2-rhua.cloudapp.net
eastus2-rhua.cloudapp.net has address 13.68.20.161

$ host southcentralus-rhua.cloudapp.net
southcentralus-rhua.cloudapp.net has address 23.101.178.51

$ host southeastasia-rhua.cloudapp.net
southeastasia-rhua.cloudapp.net has address 137.116.129.134

At the time of research, all of servers were exposing their REST APIs over HTTPs.

The URL to the archive containing these RPMs was discovered a package labeled PrepareRHUI on available on any Red Hat Enterprise Linux Box running on Microsoft Azure.

$ yumdownloader PrepareRHUI
$ rpm -qip PrepareRHUI-1.0.0-1.noarch.rpm
Name        : PrepareRHUI                  Relocations: (not relocatable)
Version     : 1.0.0                             Vendor: Microsoft Corporation
Release     : 1                             Build Date: Mon 16 Nov 2015 06:13:21 AM UTC
Install Date: (not installed)               Build Host: rhui-monitor.cloudapp.net
Group       : Unspecified                   Source RPM: PrepareRHUI-1.0.0-1.src.rpm
Size        : 770                              License: GPL
Signature   : (none)
Packager    : Microsoft Corporation <xiazhang@microsoft.com>
Summary     : Prepare RHUI installation for Redhat client
Description :
PrepareRHUI is used to prepare RHUI installation for before making a Redhat image.

The build host is interesting rhui-monitor.cloudapp.net, at the time of research running a port scan revealed an application running on port 8080.

Microsoft Azure RHUI Monitoring tool

Despite the application requiring username and password based authentication, It was possible to execute a run of their “backend log collector” on a specified content delivery server. When the collector service completed the application supplied URLs to archives which contain multiple logs and configuration files from the servers.

Included within these archives was an SSL certificate that would grant full administrative access to the Red Hat Update Appliances [4].

Pulp admin keys for Microsoft Azure's RHUI

At the time of research all Red Hat Enterprise Linux virtual machines booted from the Azure Marketplace image had the following additional repository configured:

[rhui-PA]
name=Packages for Azure
mirrorlist=https://eastus2-cds1.cloudapp.net/pulp/mirror/PA
enabled=1
gpgcheck=0
sslverify=1
sslcacert=/etc/pki/rhui/ca.crt

Given no gpgcheck is enabled, with full administrative access to the Red Hat Enterprise Linux Appliance REST API one could have uploaded packages that would be acquired by client virtual machines on their next yum update.

The issue was reported in accordance to the Microsoft Online Services Bug Bounty terms. Microsoft agreed it was a vulnerability in their systems. Immediate action was taken to prevent public access to rhui-monitor.cloudapp.net. Additionally, they eventually prevented public access to the Red Hat Update Appliances and they claim to have rotated all secrets.

[1] https://azure.microsoft.com/en-in/pricing/details/virtual-machines/red-hat/

[2] https://aws.amazon.com/partners/redhat/

[3] https://github.com/Azure/azure-linux-extensions/blob/master/Common/WALinuxAgent-2.0.16/waagent#L2891

[4] https://fedorahosted.org/pulp/wiki/Certificates

Hello World

resource "null_resource" "hello_world" {
    provisioner "local-exec" {
        command = "echo 'Hello World'"
    }
}

Interested in automation and the HashiCorp suite of tools? If so you’ll love this blog. Through different posts we will explore lots different automation tasks utilising both public and private cloud with the HashiCorp toolset.

Thanks, Ian.