Daniel's Weblog
Posts Tags Colophon
About

Tags / Pre2025

Scraping React with Python

Apr 14, 2018


The Campus Labs system Lehigh is using for student engagement is called “LINC” and it sucks. There, I said it. It’s a bloated, confusing React app that hinders club activities on campus. I say that having been a club president trying to administer a LINC page and now I need to contact all of the clubs to get their info for the yearbook.

I was able to get a CSV file from an administrator with the email addresses of all the club presidents, but not with their respective clubs.

I need to email all of the clubs to get them to submit photos and info to the yearbook, and my generic call to action mail-merge isn’t having great results.

So, let’s get that information ourselves.

First we need a list of all the clubs

That’s relatively easy. Just gotta scroll down through https://lehigh.campuslabs.com/engage/organizations repeatedly clicking the “load more” button at the bottom to generate a list of all the currently active clubs.

Then we can download the HTML.

Get the clubs URLs out of the HTML

This is what a single club’s entry looks like (There are around 350 clubs at Lehigh), have they never heard of classes?

<div>
  <a href="/engage/organization/marching97" style="text-decoration: none;">
<div tabindex="-1" style="color: rgba(0, 0, 0, 0.870588); transition: all 450ms cubic-bezier(0.23, 1, 0.32, 1) 0ms; box-sizing: border-box; font-family: &quot;Source Sans Pro&quot;, sans-serif; box-shadow: rgba(0, 0, 0, 0.117647) 0px 1px 6px, rgba(0, 0, 0, 0.117647) 0px 1px 4px; border-top-left-radius: 2px; border-top-right-radius: 2px; border-bottom-right-radius: 2px; border-bottom-left-radius: 2px; z-index: 1; height: 100%; border: 10px; display: block; cursor: pointer; text-decoration: none; margin: 0px 0px 20px; padding: 0px; outline: none; font-size: 16px; font-weight: inherit; position: relative; line-height: 16px; background-image: none; background-position: initial initial; background-repeat: initial initial;">
  <div style="padding-bottom: 0px;">
<div>
  <div style="margin-left: 0px; padding: 15px 30px 11px 108px; position: relative; background-color: rgb(255, 255, 255); height: 82px; overflow: hidden;">
<img alt="" size="75" src="https://images.collegiatelink.net/clink/images/c0489b78-ab1d-440e-a2e6-9f5d7bd331810db2b1fd-31d2-489f-acec-9202a9e0d090.png?preset=small-sq" style="color: rgb(255, 255, 255); background-color: rgb(188, 188, 188); display: inline-flex; align-items: center; justify-content: center; font-size: 37.5px; border-top-left-radius: 50%; border-top-right-radius: 50%; border-bottom-right-radius: 50%; border-bottom-left-radius: 50%; height: 75px; width: 75px; position: absolute; top: 9px; left: 13px; margin: 8px; background-size: 55px;">
<div style="font-size: 18px; font-weight: 600; color: rgb(73, 73, 73); padding-left: 5px; text-overflow: ellipsis; margin-top: 5px;"> Marching 97 </div>
<p class="DescriptionExcerpt" style="font-size: 14px; line-height: 21px; height: 45px; margin: 15px 0px 0px; color: rgba(0, 0, 0, 0.541176); overflow: hidden; text-overflow: ellipsis; display: -webkit-box; -webkit-line-clamp: 2; -webkit-box-orient: vertical; padding-left: 5px;">Since 1906, the Marching 97 has been a part of Lehigh with its signature leg-liftery, sing-singery, and a unique brand of spirit called psyche! The Marching 97 performs a new student-written show during halftime at every performance.</p>
  </div>
</div>
  </div>
</div>
  </a>
</div>

What we need is each club’s unique URL, which is embedded somewhere inside this hellscape of React output.

/engage/organization/marching97  

A little regex can help us do that:

\/engage\/organization\/[^"]*

I like building Regex on regexr.com.

Turn that in to a full URL

Now we’ve got all of the paths from AlphaOmegaEpsilon to zoellnerartscenter stored in All_Orgs.txt. Time to turn those in to full URLs with a quick perl script.

perl -e 'while (<>) {print "https://lehigh.campuslabs.com$_"}' < All_Orgs.txt > All_Orgs_full.txt

Set up our Python Enviroment

In 4 easy steps:

  1. mkdir scraper
  2. virtualenv venv
  3. pip install beautifulsoup4
  4. source venv/bin/activate

Determine what needs to be scraped

The anatomony of each club page looks like this:

<!DOCTYPE html>
	<head>
		<script>Google Analytics</script>
		<script>Various other things</script>
		<link rel="stylesheet" href="some_css_file.css">
	</head>
	<body>
		<div id="react-app"></div>
		<script>window.initialAppState={LOTS OF GOOD STUFF HERE}</script>
	</body>
</html>

This is a React app, so when we download each page it doesn’t look like pretty HTML, instead all of the info is stored in a JSON array called, “initialAppState” which React then renders into the actual content in your browser.

It’s full of good stuff, lets tear apart the page for the a cappella group A Whole Step Up:

(I’ve removed some items for the sake of suicintness)

{
	"institution": {
		"id": 3928,
		"name": "LINC- Lehigh Involvement Connection",
		"coverPhoto": "edf6fa7f-9f57-492d-9f72-227bc6a877206c03f3f6-70e2-4fdc-b1e1-11ad2ada0a88.png",
		"src": null,
		"accentColor": "#11b875",
		"config": {
			"institution": {
				"name": "Lehigh University",
				"campusLabsHostUrl": "https://lehigh.campuslabs.com/engage/"
			},
			"communityTimeZone": "America/New_York",
			"communityDisplayName": "LINC- Lehigh Involvement Connection",
			"communityDirectoryEnabled": false,
			"analytics": {
				"enabled": true,
				"property": "UA-309883-56"
			}
		}
	},
	"preFetchedData": {
		"event": null,
		"organization": {
			"id": 26298,
			"communityId": 118,
			"institutionId": 3928,
			"name": "A Whole Step Up",
			"shortName": "A Whole Step Up",
			"nameSortKey": "A",
			"websiteKey": "wholestep",
			"email": "Redacted",
			"description": "<p>A Whole Step Up is Lehigh's best, and only, all-male a cappella group. Whole Step is dedicated to delivering high quality, comedic music into the hearts, minds, and souls of its listeners.</p>",
			"summary": "A Whole Step Up is Lehigh's best, and only, all-male a cappella group. Whole-Step is dedicated to delivering high quality, comedic music into the hearts, minds, and souls of its listeners.",
			"status": "Active",
			"comment": null,
			"showJoin": true,
			"statusChangeDateTime": null,
			"startDate": "1969-12-31T00:00:00+00:00",
			"endDate": null,
			"parentId": 26237,
			"wallId": 32888,
			"discussionId": 32888,
			"groupTypeId": null,
			"organizationTypeId": 941,
			"cssConfigurationId": 3782,
			"deleted": false,
			"enableGoogleCalendar": false,
			"modifiedOn": "0001-01-01T00:00:00+00:00",
			"socialMedia": {
				"externalWebsite": "",
				"flickrUrl": "",
				"googleCalendarUrl": "",
				"googlePlusUrl": "",
				"instagramUrl": "",
				"linkedInUrl": "",
				"pinterestUrl": "",
				"tumblrUrl": "",
				"vimeoUrl": "",
				"youtubeUrl": "",
				"facebookUrl": "https://www.facebook.com/awholestepup",
				"twitterUrl": null,
				"twitterUserName": null
			},
			"profilePicture": "d56fb8db-2533-41c4-b7c6-7e56b9b5b8c1f471c68d-8e54-4654-9236-31310cb6246d.jpg",
			"organizationType": {
				"id": 941,
				"name": "Student Senate Recognized Organization"
			},
			"primaryContact": {
				"id": "Redacted",
				"firstName": "Garrett",
				"preferredFirstName": null,
				"lastName": "Redacted",
				"primaryEmailAddress": "Redacted",
				"profileImageFilePath": null,
				"institutionId": 3928,
				"privacy": "Unselected"
			},
			"isBranch": false,
			"contactInfo": [{
				Redacted
			}]
		},
		"article": null
	},
	"imageServerBaseUrl": "https://images.collegiatelink.net/clink/images/",
	"serverSideRender": false,
	"baseUrl": "https://lehigh.campuslabs.com",
	"serverSideContextRoot": "/engage",
	"cdnBaseUrl": "https://static.campuslabsengage.com/discovery/2018.4.13.4"
}

Retrieve the Info with Python

Since we’ve got the all of the URLs stored in a file, I think it makes the most sense to iterate through the file with a bash script and pass in the URL to scrape as a command line argument to a python script.

Usually with BeautifulSoup we could just specify what element we wanted by class or id, but unfortunately there’s nothing identifying this beyond the script tag.

I’m going to assume that all of these pages are laid out identically and that the JSON is always located in the 5th script tag on the page.

response = simple_get(url)
html = BeautifulSoup(response, 'html.parser')

# If we got a successful response....
if response is not None:
	# There are multiple <script> tags containing things like Google Analytics
	# the React client library, and some other stuff, but we only care about the
	# one with the initialAppState JSON array in it
	script_sections = html.find_all('script')
	print(script_sections[4].text)

Next we save that element:

	json_raw = script_sections[4].text

So the variable json_raw now looks like this:

window.initialAppState = {...};

Now we’ve got to slice off everything on either end of the curly brackets, and parse the json into a format python understands.

	json_raw = json_raw[25:-1]
	club_info = json.loads(json_raw)

And we’ve got our JSON array!

Write the relevant info to a CSV file

Now that we’ve identified the relevant fields all we have to do is pull everything we want out of the variable holding the JSON!

    fields = [
        club_info['preFetchedData']['organization']['primaryContact']['primaryEmailAddress'],
        club_info['preFetchedData']['organization']['primaryContact']['firstName'],
        club_info['preFetchedData']['organization']['socialMedia'].get('facebookUrl', ''),

     ]

    with open('clubs.csv', 'a') as csvfile:
        club_writer = csv.writer(csvfile)
        club_writer.writerow(fields)

Not all of the clubs have all of the various social media fields in their JSON array, so using .get returns nothing instead.

Make a bash script to do this for all the clubs!

  1. get_clubs.sh
#!/bin/bash
source venv/bin/activate

while read p;
do 
    python scraper.py ${p}
done < All_Orgs_full.txt
  1. chmod +x get_clubs.sh

Full code and results:

Looks like it worked! :-)

github.com/djbeadle/lehigh-clubs-2018

Bird Cam!

Apr 10, 2018


A mourning dove nested above our doorway recently! So we did the obivious thing and pointed a webcam at it.

You can check that out at lehigh.party:8081. HTTP only, I’m afraid.

It’s powered by a Raspberry Pi B 2 with a Raspberry Pi Camera V2 (8mp), connected to the internet by a wireless dongle. The software is MotionPiEye. Be patient, it’s doing it’s best.

April 1st / Easter Scavenger Hunt

Apr 1, 2018


In the grand tradition of Marching 97 April Fools Day events, Veronica & I put together a scavanger hunt around Lehigh’s campus. The first clue went live at 2:00 pm on lehigh.party:9000:

Clue #1:

The staircase where the day began on Nov. 17th, 2017

Some people were exicted:

And some were less so:

I would like to ensure any members of the Dean of Student’s Office that this event was entirely voluntary, and no hazing was involved.

Leaderboard

First Place:

Made it to the end at 2:28 pm, winning a disgusting chocolate bunny. They attribute their winning time to splitting their group. One person ran across campus to get the final clue, and then texted a picture of it to the others who went straight to the end point.

img

Marta
Vinny
Tim
Mike

Second Place:

Arrived at 2:41 pm, apparently they got delayed when Evan stopped to talk to an incoming student.

second place Evan C
Erin
Brianna
Henry

Third Place:

2:59 pm
Grace
Rebecca

Fourth Place:

4:00 pm. Apparently they drove from clue to clue.
Anil
Joe
Ryan B


final

Hugo Blog Setup

Mar 31, 2018


Hugo is installed on webserver, blog exists in git repository in /home/djbeadle/blog.

The git repository is synchronized via GitHub, for easy editing a copy exists on my laptop.

To author a new blog post on laptop:

  1. Open MacDown
  2. Write!
  3. Save to /Users/Djbeadle/Sites/blog/content/post/[NEW-POST].md
  4. Commit [NEW-POST]
  5. ./update_blog.sh

update_blog.sh logs in to the webserver using SSH keys, updates the repository, clears the public directory and then updates it with the new content.

#!/bin/bash
git -C /Users/Djbeadle/Sites/blog push
ssh website "
    git -C /home/djbeadle/blog pull 
    cd /home/djbeadle/blog
    rm -rf /var/www/blog/*
    rm -rf /home/djbeadle/blog/public/
    hugo
    cp -r /home/djbeadle/blog/public/* /var/www/blog/
"
Dev Enviroment

Mar 31, 2018


Mostly for my own future reference:

13" Retina MacBook Pro 2014
2.8 GHz Intel Core i5
16 GB RAM, 256 GB

Hardware Tinkering Stuff:

CoolTerm: Simple graphical terminal emulator

SerialTools: Available on the Mac App store. Nifty little terminal emulator.

Etcher: Burn SD cards and USB drives with great ease.

Fritzing: For diagrams

Arduino: Both vanilla and with the ESP8266 Arduino Core

Raspberry Pi:

PiBakery: Easy way to configure SD card images

ApplePiBaker: Easy GUI to backup Raspberry Pi SD cards

Communications

Airmail: Gmail-centric email client. Great with lots of accounts.

Textual: For the occasional IRC chat.

Thunderbird: solely for mail merges. Best way to get people to respond? An email with their name on the very first line.

Keybase: Because I like the idea of privacy. How often do I actually use it? Rarely.

Browser

Firefox: Love the browser, love Mozilla. With these settings performance on Mac is much improved.

Media

Spotify: I will be very dissapointed when my half-price student subscription runs out.

Fog: Unofficial Electron based client for the Overcast podcast app.

Boom 2: System wide equalizer for macOS

File Management

Compare Folders: Does exactly what the name implies.

Disk Inventory X: I always seem to run out of disk space.

FileZilla: For all of your file transferring needs.

ImageOptim: Everything going on a webpage goes through this first.

Text Editors

MacDown: Wrote this with MacDown!

Sublime Text & Visual Studio Code: Haven’t settled on one or the other.

Dev

Macports

Homebrew

Other:

Workspaces: “Remembers your stuff. Launches it for you”

Amphetamine: Keeps your Mac awake.

TinkerTool: Great for modifying harmless macOS settings that aren’t exposed via Control Panel.

  • Finder > Show hidden and system files
  • Dock > Disable animation when hiding or showing the dock
  • Dock > Disable delay when showing hidden dock

OpenEMU: Still haven’t beat FireRed yet.

Better Touch Tool: Gestures. All the gestures.

Deliveries: And it’s corresponding iOS app

VMware Fusion: Expensive, but useful.

Web Server Settings

Mar 31, 2018


Some config settings, nothing crazy but it took me a while to figure out.

Ubuntu 17.10
Artful Aardvark

Exposed Services:

Grafana: https://danielbeadle.net/stats

Virtual Radar Server (VRS): https://danielbeadle.net/radar

Grafana:

The important details of /etc/grafana/grafana.ini:

[server]
protocol = http
http_port = 3000
domain = danielbeadle.net
root_url = http://danielbeadle.net/stats

Apache:

/etc/apache2/sites-enabled/danielbeadle.net.conf:

TODO*: This doesn’t properly redirect HTTP requests to HTTPS.*

<VirtualHost [DOMAIN_NAME]:80>
    ServerName [DOMAIN_NAME]
    DocumentRoot [WEBSITE_PATH]
</VirtualHost>

<IfModule mod_ssl.c>
<VirtualHost [DOMAIN_NAME]:443>
    ServerName [DOMAIN_NAME]

    ServerAdmin [EMAIL]
    DocumentRoot [WEBSITE_PATH]

    ErrorLog ${APACHE_LOG_DIR}/error.log
    CustomLog ${APACHE_LOG_DIR}/access.log combined

    Alias /blog [BLOG_PATH]
    <Directory [BLOG_PATH]>
        Options MultiViews Indexes Includes FollowSymLinks ExecCGI
        AllowOverride All
        Require all granted
        allow from all
    </Directory>
    
    # Reverse Proxy to Grafana:
    ProxyPass /stats http://localhost:3000
    
    # Reverse Proxy to Virtual Radar Server
    ProxyPass /radar http://danielbeadle.net:8080/VirtualRadar

RewriteEngine On
RewriteCond %{HTTPS} off
RewriteRule ^(.*)$ https://%{HTTP_HOST}$1 [R=301,L]

# Letsencrypt files for HTTPS:
SSLCertificateFile [path to fullchain.pem]
SSLCertificateKeyFile [path to privkey.pem]
Include [path to options-ssl-apache.conf]
</VirtualHost>
</IfModule>

# vim: syntax=apache ts=4 sw=4 sts=4 sr noet
Lehigh Links

Mar 10, 2018


General Reference:

Roomfinder:

A map of all the building and rooms on campus with a search tool. https://gisweb2.cc.lehigh.edu/roomfinder/default.aspx

Computer Lab usage

Want to see if there’s any free space in a computer lab? http://cf.lehigh.edu/pubsite/

https://lts.lehigh.edu/services/explanation/print-email-lts-printers

Listserv

https://lists.lehigh.edu/mailman/listinfo/

Public Google Calendars:

Lehigh calendars that you can add to your own calendars! http://lts.lehigh.edu/services/explanation/lehigh-pubilc-calendars-google-apps-help

Media:

Flickr Pages

Lehigh University:

https://www.flickr.com/photos/lehighuniversity

Lehigh Engineering:

https://www.flickr.com/photos/lehighengineering

Special Collections:

https://www.flickr.com/photos/143267113@N08

Baker Institute:

https://www.flickr.com/photos/lehighbakerinstitute

Tumblr:

Scenes from South Mountain:

http://lehighu.tumblr.com

Twitter:

Twitter list of Lehigh University Colleges, People, Orgs, etc:

https://twitter.com/djbeadle/lists/lehigh-orgs-and-people

News:

Greek Life Blog:

http://lehighgreeks.blogspot.com

Interesting Network Stuff:

Lehigh Grafana

https://grafana.cc.lehigh.edu

The Server of Moria

http://moria.cse.lehigh.edu

Lehigh Network Information Page

More info about Lehigh’s network setup than you ever wanted to know, including some good photos of network hardware in the various buildings over the years. http://www.lehigh.edu/~insna/

Lehigh Network Statistics

Want to know the total traffic going in and out of Lehigh’s network? The number of users on the VPN? The average age of Cisco switches? Must be on the Lehigh network to access. https://rover.cc.lehigh.edu/network/

Wireless Network Statistics by Building

https://ss.cc.lehigh.edu/woc/

Other:

Info on Phones

If you dial “0” on the dorm phones during business hours and ask the operator “What phone number am I calling from?” they’ll tell you. https://rover.cc.lehigh.edu/extensions.txt

Lost Forest Map:

https://www.google.com/maps/d/u/0/viewer?mid=15I-WbirRcVLeIWGlnR2eXQRlr7E&ll=40.59645280422191%2C-75.37894257355651&z=17