For fun
Salut world wide web! In this post, I intend to provide a step-by-step tutorial on how to schedule your own crawlers using AWS, without having to pay third parties lots of money just because they’ve made the process easy.
The purpose of this approach is to collect random data from various sources that interest you. For example, in this post, I’m sharing my template for Centris.ca, a Quebec website for house rentals and sales. I’ve been collecting data from this website since September 2023.
Of course, it will depend on what you want to collect from the internet and how often you intend to do it, but overall, this approach will solve 90% of your problems.
This is purely for personal interests—whether it’s for conducting a real estate market analysis to buy your own house, finding great deals on cars, or anything else your creativity inspires.
Eventually, you could grow this idea into a business, and if you do, feel free to contact me—I can help you.
The Ideia #
There are basically two services that increase the cost of web scraping: databases and proxies. While proxies are sometimes inevitable, the following approach focuses on avoiding the need for expensive databases.
The main idea here is to use S3 for data storage, ECS to manage the tasks (in this case, the crawler) to run periodically, and ECR to store the image containing the code for the task.
A key point is to store all collected data in JSON lines and use SQLite to control how frequently you re-collect information from the website.
In the context of real estate, if you check out centris.ca, you will see it is a marketplace for houses and apartments.
The crawler I am sharing visits this website once a week and collects all the data available. However, to avoid overloading, it only re-collects data on a property if it has not been collected in the last 60 days. This means that every 60 days, if a property hasn’t been sold or rented, the crawler will re-collect that information, creating a track record with a 60-day delay.
Tools #
- To get your crawler up and running, you’ll need an AWS Account. We’ll primarily use three AWS services: ECR, ECS, and S3, as I explained earlier.
- On your computer, you’ll need to install the AWS CLI and Docker.
- You can install Python and try to run it in your machine but this is up to you, just take care to not block your ip on the website.
Mind Map #
How I see the process
- 1 Everything starts with the Scrapy framework.
- After creating an image of it, we store it in ECR.
- We create a cluster (ECS), define a task for each spider, and schedule accordingly.
- We create an S3 bucket to store the crawled data.
- The process begins by retrieving the SQLite database from S3 and collecting the data.
- After finishing, we store the updated SQLite database back to S3 and save the data.
Scrapy Details #
Some import details of the project.
There are two parts of the code I would like to highlight which are
src/pipelines/jsonWriteAWS.py
-> which is respnsable for storage the data processed and send it to s3
class JsonWriterAWS:
def __init__(self):
pass
def open_spider(self, spider):
hd = os.path.expanduser("~") + "/data"
if not os.path.exists(hd):
os.makedirs(hd)
self.datapath=hd + "/" + spider.name + ".jsonl"
self.file=open(self.datapath, "w")
def close_spider(self, spider):
self.file.close()
upload_blob(
BUCKET,
self.datapath,
destymd(spider.name, "jsonl")
)
def process_item(self, item, spider):
line = json.dumps(ItemAdapter(item).asdict()) + "\n"
self.file.write(line)
return item
src/middlewares/DeltaFetchAWS.py
-> which its responsible to check in the database if its already contain the weblink or not, and after done, save all the data collected in the S3.
def process_request(self, request, spider):
item_id = request.meta.get('id')
if item_id:
result = None
if spider.delta_days > 0:
self.cursor.execute("SELECT * FROM scrapy WHERE id={}".format(item_id))
result = self.cursor.fetchone()
if result:
delta=now(True, 0) - datetime.strptime(result[1], "%Y-%m-%d %H:%M:%S")
if delta.days > spider.delta_days:
print("The record is outdated. Creating a new request for it.")
self.cursor.execute("DELETE FROM scrapy WHERE id={}".format(item_id))
self.conn.commit()
self.cursor.execute("INSERT INTO scrapy (id, url) VALUES ({}, '{}')".format(item_id, request.url))
self.conn.commit()
else:
spider.logger.info(f"ID {item_id} exists in database. Ignoring request.")
return Response(url=request.url, status=200, body=b"Fake")
else:
try:
self.cursor.execute("INSERT INTO scrapy (id, url) VALUES ({}, '{}')".format(item_id, request.url))
self.conn.commit()
except sqlite3.IntegrityError:
print("The record does not exist in the database. Creating a new request for it.")
Building the Enviroment #
Create a S3 Bucket #
Here, there is no secret, you just click in create a bucket, put a name on it and voila.
Once you start running your spiders, a folder will be created using the spider’s name. Inside this folder, there will be a .db
file to manage the requests and a structure of yyyy/mm/dd.jsonl
files to store the data.
Create a ECR #
Same idea, just create a private repository to store your images and voila.
.env #
To start the process you must add a .env
in your project with the following credentials
BUCKET=aw-src # in my case
ECR=awsrc # in my case
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
AWS_DEFAULT_REGION=
AWS_ACCOUNT_ID=
Deploy the image #
If everything worked well you just need to run the scrip bash deploy.sh
and the image is going to build and be uploaded to your ECR repository.
Create a ECS #
Create your cluster to run tasks.
Followed by the task definition
While you are creating the task definition, there only one really important part which is the variables to pass to execute the crawler.
The rest of the task definition is up too you.
My settings overall are this:
{
"family": "centris_ca",
"containerDefinitions": [
{
"name": "awsrc",
"image": "xxxxxxxxxx.dkr.ecr.xxxxxxx.amazonaws.com/awsrc:latest",
"cpu": 0,
"portMappings": [
{
"name": "awsrc-80-tcp",
"containerPort": 80,
"hostPort": 80,
"protocol": "tcp",
"appProtocol": "http"
}
],
"essential": true,
"environment": [
{
"name": "spider",
"value": "centris_ca"
}
],
"environmentFiles": [],
"mountPoints": [],
"volumesFrom": [],
"ulimits": [],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/centris_ca",
"awslogs-create-group": "true",
"awslogs-region": "us-east-2",
"awslogs-stream-prefix": "ecs"
},
"secretOptions": []
},
"systemControls": []
}
],
"executionRoleArn": "arn:aws:iam::xxxxxxxxxxxxxxxxx:role/ecsTaskExecutionRole",
"networkMode": "awsvpc",
"requiresCompatibilities": [
"FARGATE"
],
"cpu": "512",
"memory": "1024",
"runtimePlatform": {
"cpuArchitecture": "X86_64",
"operatingSystemFamily": "LINUX"
}
}
After that you just need to go to ECR > Cluster, down the page Scheduled tasks, create a new one for your spider and done.
Final Thoughts #
I hope this gives you a clear understanding of the architecture behind it.
There are plenty of AWS tutorials available online to help you along the way.
If you still encounter any issues or get stuck at any point, feel free to message me on LinkedIn or email me at .
I’ve provided the pieces, now it’s up to you to build the puzzle.
Bye for now.