Dec 12, 2018 10:37:32

Techno in Asia, day 4: My first scraper and how I learn on the go

by @flowen PATRON | 1086 words | 🐣 | 44💌 | 3💧

lowen flowen

Current day streak: 0🐣
Total posts: 44💌
Total words: 16091 (64 pages 📄)
Broken streaks: 3💧

Writing my first scraper! Very exciting, very challenging and working through lots of annoyances today 🤤


I can work fast. I can work super fast. I can work so fast that I can work myself into a burnout. Especially when I'm trying to learn new things. I expect myself to learn as fast as I work, but programming sometimes need to be treated carefully. Sometimes it's simply complicated. 

So now I'm taking things slower. Deliberately slower. Staying calm and not allowing frustrations to get to me. Mindfulness, yada yada.


Learn on the go

Prior to this you should understand the basics of programming and having a grasp of the concepts really help. Programming apps can be done in a million ways, so understanding the concepts others use for a framework, etc is super important to grasp something quickly.

So when I learn new things, I first find some tutorials. Use the one I assume is best (also based on comments whether code actually works). Skim all the blabla, install some tools, plugin or framework, copy their code and start playing around. Sometimes I look back what was actually written when I don't understand things. Once I get a feel for it I write out small tasks on a note.

For something like Gatsby, I wanted to understand it a bit better. For example, I started wondering how Gatsby was transforming it's datasources into static files. This helped me in understanding the concepts behind Gatsby and can help me be more creative with solutions.


My first scraper

So I want to scrape data and insert it into my database. I chose Strapi as an interface, so I wouldn't have to write my own API. I could also insert straight into MongoDB, but I'm just not aware of all the conventions used. Conventions are surely a good thing, but learning them, can also costs a lot of time. I just hope for the best and see if Strapi works out for me. Loopback looks interesting as well.

Day 1 I installed Strapi on my local machine and ubuntu server. Day 3 I looked up in their docs how to authorise and post data. My first task is: create a small node.js app to post data. First challenge: the example to post data doesn't work when you also have to authorise. In the WIP React telegram group Sander came to the rescue. There are different ways to use axios to connect to the API. You can do axios.post(), but have no way to add headers. But axios({ method: 'POST' }) does. Alright I know how to post data into Strapi.


Next task: scrape some pages and insert this data into strapi.

ResidentAdvisor is my go to source for Techno events. I collect the url's that concern me and quickly notice some differences :
https://www.residentadvisor.net/events/vietnam

but Japan, China and India are divided in regions
https://www.residentadvisor.net/events/jp/tokyo

I filter by month and it's added to the url:
https://www.residentadvisor.net/events/jp/tokyo/month/2018-12-12

First I collect all the different country pages by hand. Took me 5-10 minutes. I created an array with countries. For a country divided in regions I made it an object with an array for all the regions. Why? I don't know, I'm just playing around :)

['vietnam', { japan: ['jp/chubu', 'jp/tokyo'] }]
(I should probably omit the 'jp/' but who cares)

The country pages contain a list of events with a link to the actual event page, which is what I want. Such as:

https://www.residentadvisor.net/events/1186622

I use request to get the html for these pages (see this tutorial) and nitpick the data I need with Cheerio. Cheerio is basically jQuery for Nodejs - There is no browser in node, it's running server-side, remember? I can easily create selectors for the data I need. I end up with some results and insert this into an object, ready to post to the Strapi API. I tested for a single url. Now all I need to do is to loop through all the url's of the different countries. But how?

I'm not sure, I'm still playing around as this is all new territory. Right now my idea is this:

I collect all the event id's of a country/region page and write it out to a country.json (eg vietnam.json). I'm not sure if I want a separate region.json (eg japan-tokyo.json) or just put all the events in a country as well.

I use fs (filestream) to write out the files as json. I've never done this before, but I simply found some code online and customise it. Understanding the concepts of streams in Node help me understand it. 

I end up with all these country/region .json files. Next I'll write a script that imports these files and gathers all the data with Cheerio and insert them into the Strapi API. 

Challenge: make sure I don't insert duplicate data.

Another solution: In Gatsby there's a json transformer that transforms .json files into accessible nodes. Now I can bypass Strapi and mongoDB completely 😂

But I don't just want to source RA, I also want other sources. So instead I should figure out how to post to the Strapi API directly, including checks if the data already exists. 

You see, while I'm writing this, I'm thinking of ideas and evaluating them. This is what writing does for me :)

Another idea is to import the countries and regions in Strapi. With the scraper I first query the API to see which countries / regions it needs to scrape. I can probably also do some checks this way if data already exists.

Current challenges:

- I have no clue yet how I want to import the images into Strapi. I guess it's like this:

1. download the image (headless browser? or a filestream with nodejs and pipe it to Strapi?)
2. upload the image to Strapi using their api (/upload) 
3. receive response, grab the image ID and post this together with the other data into Strapi

- I need to make sure I don't get banned. So I need to find a way to change proxies / User agents / IP's, etc. scraperapi.com offers a service of 1000 free requests per month, but that probably won't cut it.


So far for my scraping adventure. Conclusion: scraping a page is easy (using Cheerio and request) but database structure is hard.


ps @jasonleow (not sure if tags work here), is this of any use for you?







  • 1

    @flowen wow thanks for taking the effort to write them down. Appreciate it! 😃 Hmmm unfortunately you're so much further ahead that I don't really understand most of it! But I'll be sure to bookmark this so that i can get back to it after I brushed up on the basics.

    Jason Leow avatar Jason Leow | Dec 12, 2018 13:38:10
    • 1

      @jasonleow hahah damn. Ok, I see. I think the first part is the most important part. Understand fundamentals and concepts once you know the tool. The tool here is programming like what are variables, arrays, objects and what are the things you can use to play with these variables such as conditionals (if then else) and loops. Once you master that you move on and understand concepts of different models, etc. Anyways, if you are getting to that point, hit me up, I'll help ya out :)

      lowen flowen avatar lowen flowen | Dec 12, 2018 17:59:04
    • 1

      @flowen thanks so much for offering to help 😃👍

      Jason Leow avatar Jason Leow | Dec 13, 2018 03:46:45
  • 1

    @jasonleow I don't think mentions work in articles, here ya go

    lowen flowen avatar lowen flowen | Dec 12, 2018 10:37:54
contact: email - twitter - facebook