Hi scout, thanks for visiting this node of the Holy Internet Graph; Before the combination Internet Accident confuse you, let me decode its meaning in the context of this post.
Encounter with something on an edge node of the Holy Internet.
So this post is all about my attempt to solve a question, I saw on Quora.
And I think that my answer is 98.76% correct, respecting the No-Zero-Redundancy law.
I saw this question while jumping between my chrome tabs when I was struggling with old drafts of my blog. I clicked the link to give it a separate tab; even questions have self respect.
I stumbled upon it as I saw that the question had a topic named GitHub also.
So, to answer this, all I had to do was to find a repository which uses highest number of programming languages in it.
Wait, the question also says a “good known project”. Now that’s a plot twist.
Now, to look for languages statistics of a repository you need the names of the repository and its owner. Then you can check a repo’s languages statistics using.
As far as a I know, there is not any way to do a reverse lookup, where you just provide number of programming languages and it gives you all the associated repositories.
So I thought, lets give up on it and enjoy the coffee, I had one on my desk.
I closed the question tab and started enjoying the coffee but then I opened it again because I felt that there may be a chance, if I use the GitHub Archive project.
GitHub Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.
This shift in plan was also not enough as GitHub Archive gives only the major programming language for a project. Now it was the time to shift to coffee full time but Nope! I had another idea.
The idea was to collect good and known projects of GitHub and rank them according to no. of programming languages in them. This sounds like Brute Force but it was the only way left.
I headed towards the GitHub API for it but found that they gives you the first 1000 search results only.
So I moved to GitHub Archive finally and downloaded a .csv file of all the repositories with greater than equal to 500 stars. Now, I think it covers most of your good and known projects, happy now?
I Vimed the .csv file and found that some URLs were missing the owner’s name.
I removed all the corrupted lines using this command.
If you’re thinking that by removing lines, we won’t consider these repositories then take a deep breath and don’t think so, we will consider them, keep reading.
Then I wrote a Python module to interact with the GitHub API, named it postman. The postman uses Authentication Token from GitHub to prevent the rate-limiting.
I don’t know how much did postman take in its task as I was attending my classes then. But when I came back from class in the evening, I saw that postman was done with the task. Wow, an obedient postman.
Right now if you search in GitHub for repositories having 500 or more stars, you will see around 7,200 results. So we have been doing this right so far as GitHub Archive don’t have all the data.
Once all the data is available, it’s time to rock ‘n’ roll. Rstudio is always there to help us.
Plotting the stars and languages distribution over all the repositories, we get this graph.
You can see that most of our good and known projects are in the bottom-left corner. Almost 90% of our repositories have less than 20,000 stars and 20 languages.
There is only one repository with more than 40,000 stars, which is the twbs/bootstrap.
OK, let’s come to the question again. I think ordering repositories according to their no. of programming languages will work.
Top 25 projects by their no. of programming languages
So, is that all? Not at all! We have a lot to talk about my friend.
Lets talk about something interesting I found in the data there.
Subject to the detection of language by GitHub.
Along with the required dataset, I was also collecting the logs data. So, there is something more interesting.
Now this is weird, as status 403 denotes the Forbidden state. But underlying secrets are much more awesome here.
The code for this report is on GitHub, pravj/Post-mortem.
If you feel that you can do more after going through this post-mortem report, you are welcome to show your medical skills.