Quantifying data exchange volumes using AWS VPC flow logs

Bharat Kalluri / 2023-10-07

How much data is flowing between our mongoDB cluster and our main server?

This simple looking question can be pretty complex to figure out without the right tools. Let's assume the database is mongoDB and it is connected to a server running in a VPC. As suggested by mongoDB, we'll be doing a peering connection to connect to the database.

This now means that we'll need to figure out the database IP, our server's load balancer IP and figure out how much data is flowing between both of these servers. Since they are both inside the VPC, we'll need to trace down network traffic inside the VPC.

AWS let's us push logs regarding the data flow patterns inside the VPC. This is a very powerful data source to understand how much data is flowing where. AWS charges on data transfer and your data transfer costs can blow up if a high TPM server keeps pulling a lot of data from the database for example.

for starters, publish VPC logs to athena so that we can query them using normal SQL. Here is a guide from AWS to do that.

Querying VPC logs

select (sum(bytes)/1024/1024) as gigabytes_transferred, srcaddr, dstaddr
from vpcflowlogs.vpc_flow_logs_parquet
where year = '2023'
	and month = '09'
	and day = '15'
	and hour = '15'
group by srcaddr, dstaddr
order by gigabytes_transferred desc
limit 15;

this query in athena on the new table just created spits out how much data from transferred from point A to point B grouped by hour. Visualizing this can be a pretty simple exercise of graphing this data out.

here is a simple python script to visualize the 15 nodes with python using the lib pyvis

import pandas as pd
from pyvis.network import Network

df = pd.read_csv("~/Downloads/flowlogs.csv")

# Create Network instance
net = Network(height='800px', width='100%', bgcolor='#222222', font_color='white')

# Add nodes and edges
for _, row in df.iterrows():
    net.add_node(row['srcaddr'])
    net.add_node(row['dstaddr'])
    net.add_edge(row['srcaddr'], row['dstaddr'], title=str(row['gigabytes_transferred']), value=str(row['gigabytes_transferred']))

net.toggle_physics(True)
net.show_buttons(filter_=['nodes', 'edges', 'physics'])
net.show('vpcflowlogs.html', notebook=False)

this would write a html file and some js files to the same directory. Double click on the html file to visualize the logs.

Now, you can pick an IP address and trace down which component of the infrastructure it belongs to and draw a picture of how much information is flowing where.

PS: finding this is great, but its not really helpful in terms of understanding right away where data is flowing since the IP addresses don't have any self explanatory tags on them. For this we'll need to grep an IP and find out the resource name of the instance. We'll do this in another post.