Improve Your Upwork Job Search with RSS and Data Scraping

https://s.w.org/images/core/emoji/14.0.0/72×72/2705.png

Rate this post

TLDR

Freelancers that use Upwork have an advantage if they apply to job soon after they are posted.

Upwork offers an RSS feed that can be parsed for job information sent in the jobs broadcast.

Feedparser is a python module that can be used to extract some of the key data from the XML data in that RSS field.

Some of the data in the feed is more deeply embedded and so must be extracted and cleaned before use.

By combining the extracted data into a Pandas DataFrame there is the ability to filter data and save to a more useful format.

At the end of this article, I’ll provide an interactive Google Colab link for the interactive version of this article. But let’s start with the video:

The Coding Challenge

Upwork is seen to be a good platform for potential freelance jobs.

But there can be some challenges in getting to the jobs quickly enough. Early applications are frequently the ones accepted.

The job search interface is also not very well suited to the filtering and listing of jobs you are looking for.

And…
Freelancers need to actively search the jobs page!

This tutorial and video look at a way to accelerate that search and filter the jobs on preferred criteria.

Learning Objectives

By the end of this tutorial, you will have:

Defined the project requirements
Explored aspects of Data Scraping
Explored RSS feeds and XML
Followed a potentially useful and repeatable workflow
Built a useful tool
Developed some useful Python skills

Approach

As freelancers, it can be helpful to approach every task as a formal project.
It is good practise and you never know when project might become something more valuable.

A client may want something similar or it may become a product you can sell.
So, it is a good discipline and will save time in the long run to approach such projects professionally.

A Useful Workflow

This pattern of development has proven helpful for me:

Set the project requirements
Follow a sound process for data scraping
- Investigate the data source
- Acquire the data
- Extract the data you want
- Clean the data
- Filter the data
- Output the data
- Use the data and confirm the information is valid
Document the project
Deliver to the client

And we’ll follow this process now.

Requirements Setting

Just like we do for our clients, we should have specific requirements.

I use the MoSCow approach to setting my requirements.

This identifies the parts that the project we:

Must Do
Should Do
Could Do
Won’t Do

And sets out clearly what will be delivered and equally as important, what will not be delivered.

Our requirements:

MUST:

Provide data from Upwork relevant to the Freelancer
Present the information in a readable format

SHOULD:

Allow filtering and manipulation of the data as needed by the user
Allow for rapid refresh

COULD:

Run from the command line with arguments
Could be automated

WONT:

Have a graphical interface

We are limited on time. So we will focus on the data-scraping aspect of the task. And we will only complete the Must and Should requirements.

Getting the Data

Investigation

We can see here on the Upwork ‘search page’, entering a search term will give you a large number of potential tasks. But we only want some of these and preferably the latest. And we want them filtered to our needs. So we need to defer from what is presented here.

This symbol identifies the Really Simple Syndication (RSS) feed that we shall be using that feed for our data.

If we click the link and select RSS, a new page opens with the job feed structured in the Extensible Markup Language (XML).

We can also see that the feed format is XML.

This is similar to HTML and is a markup language that is readable by the computer and by people (people who can look past the tags and format).

Apparently this dense text is ‘person’ readable!

Some of it seems ok but most is hard to read.

Let’s make this more readable.

Time is money for the Freelancer.
So let’s copy this data and use a web tool, an XML formatter to explore the RSS XML data.

Here we can see that the XML forms a tree. With 10 indiviual elements, one for each job in this field.

If we look in to the first few items, we can see the information about each job.

Observations

It looks like the feed has 10 elements and each element has the Title, Link to the Job and a Description. The description appears to be an HTML script that contains some of the information that we need.

So let’s scrape that data next!

Acquisition

There are Python packages that we can use to scrape data from such XML feeds.

And it looks like we will be able to extract data from the ‘description’ field too. It appears to be a long string object and we have Python mehtods for strings.

For the RSS feed with will use ‘feedparser’

Note: For simple data-scraping tasks, I like to use a Jupyter notebook. The notebook is useful because it holds the data in memory so it can be explored while we change the code.

This means we don’t need to capture the feed many times.

There is no reason you can’t use VSCode or Pycharm or any other editor.

Again as a freelancer, time is money, so use the tools you are familiar with.

Looking at the feedparser documents

We can learn here that the data is acquired by and then parsed by

import feedparser
data = feedparser.parse(UPWORK_RSS_FEED_URL)

and then we can extract our 3 elements of data using:

import feedparser
data = feedparser.parse(UPWORK_RSS_FEED_URL)

item_title = data.entries[0].title
item_title
item_link = data.entries[0].link
item_link
description  = data.entries[0].description
description

Note: This data will change frequently as the RSS feed is updated with the latest jobs. This is why the RSS feed is so useful to us Finxters.

Extraction

We have already extracted the Title and the Link of the job just from the feedparser entries data.

item_title = data.entries[0].title
item_link = data.entries[0].link

Now we need to extract the individual elements of information from the ‘description‘ string.

Let’s take a closer look at one of the ‘description‘ strings.

description: We are looking for a skilled developer who can create a mobile application and web application for a fitness app. The main feature of the app will be the integration of AI technology to detect the user&#039;s body, diet, and workout plan. The successful candidate will be responsible for designing and developing the app, ensuring it is user-friendly and has a modern, sleek design. The app should be able to track user progress and provide personalized recommendations based on the user&#039;s inputs and body data. Key skills required for this project include: <br /><br />
- Mobile app development <br />
- Web app development <br />
- AI integration <br />
- UX/UI design <br />
- Data analysis and interpretation<br /><br /><b>Hourly Range</b>: $8.00-$10.00
<br /><b>Posted On</b>: December 02, 2023 17:57 UTC<br /><b>Category</b>: Mobile App Development<br /><b>Skills</b>:iOS, Android, Smartphone, Python, Mobile App Development
<br /><b>Skills</b>: iOS, Android, Smartphone, Python, Mobile App Development <br /><b>Country</b>: United States
<br /><a href="https://www.upwork.com/jobs/Fitness-App-Development-with-Functionality_%7E01494dc445d89c9f7f?source=rss">click to apply</a>

Here we see 14 lines of text with HTML markup and tag and characters such as ‘br‘ and &# 039 ;

Then we see a selection of headings inside HTML bold tags.

So the general theme for the description block is:

description – HTML code of variable lengths and with some HTML character codes and tags
Hourly Range – b_tags and some text
Posted On – b_tags and some text
Category – b_tags and some text
Skills – b_tags and 1 or more skills with commas and spaces in between
Skills – a repeated line of skills
Country – b_tags and some text
Link – a repeat of the link

Knowing this data structure, we can now use Python to extract the information we need.

Let’s write some code!

First we need to import some packages.

feedparser for the RSS feed.
pandas for our data storage and filtering
ssl to bypass some ssl elements of the feed broadcast.

import feedparser
import pandas as pd
import ssl

Now we need a function to create and return an empty and prepared Dataframe in Pandas.

As we’ve discussed, we need to store each jobs:

Title
Link
Description
Posted on
Category
Skills List
Price Type (Hourly Range or Budget
Price of budget of max Hourly Rate
Country the job originates in

def make_dataframe():

    jobs_df = pd.DataFrame(columns=[
        'Title', 
        'Link',
        'Description', 
        'Posted',
        'Category',
        'Skills',
        'Price Type', 
        'Price', 
        'Country' ])

    return jobs_df

We need a function that steps through the feed and extracts our information.

Firstly, we set up a list of blank data so that if there are gaps in the information, we still have data to place in the DataFrame. Failure to do this would raise an error.

Then we ‘Parse the Feed’

Title and Job link we can get directly from the feed entry.

But for the ‘description‘, we need to use the ‘string.split‘ method and split the string into a list of elements using the ‘bold‘ tag as the separator.

This gives us:

description[0] is the first item in the list and is the main description field we just need to stip this of HTML tags Here we use the ‘clean_string‘ function.
‘Posted On’ and ‘Category’ also get cleaned with ‘clean_string‘.

Notice that we slice off only that part we need to send to be ‘cleaned’ eg clean_string(b_tag[15:])

‘Hourly Range’ / Budget’ get special treatment in the ‘clean_price’ function where we return a float for the money value and a string for ‘Budget’ or ‘Hourly Rate’
‘Skills’ needs to be stripped into a list (for searching) and also cleaned.
‘Country’ also needs some special treatment

Once cleaned the data is assigned to a dictionary and added to a DataFrame and added to the master DataFrame.

def get_data(entry):  # entry is a job item from the RSS feed

    # Some data ends up Null so set those values just in case
    item_posted = ''
    item_cat = ''
    item_price_type = ''
    item_price = 0.0
    item_skills = []
    item_country = ''

    # Set from parsing the feed
    item_title = entry.title
    item_link = entry.link
    description  = entry.description
    description = description.split('<b>')
    item_desc = clean_string(description[0])
    for b_tag in description[1:]:
        if "Hourly Range" in b_tag or "Budget" in b_tag:
            item_price_type, item_price = clean_price(b_tag)
        elif "Posted On" in b_tag:
            item_posted = clean_string(b_tag[15:])
        elif "Category" in b_tag:
            item_cat = clean_string(b_tag[14:])
        elif "Skills" in b_tag and not item_skills :
            item_skills = clean_skills(b_tag[11:])
        elif "Country" in b_tag:
            item_country = clean_country(b_tag[10:])




    # build the DataFrame and return it

    new_job = {
    'Title': item_title, 
    'Link': item_link, 
    'Description': item_desc,
    'Posted': item_posted,
    'Category': item_cat, 
    'Skills': item_skills, 
    'Price Type': item_price_type, 
    'Price': item_price, 
    'Country': item_country}

    new_job_df = pd.DataFrame([new_job])

    return new_job_df

The ‘clean_string‘ function uses the ‘replace‘ method and takes each substring that isn’t required and either removes it or replaces it with the correct value.

Note: This is not the most pythonic approach, but it has been written for clarity for beginners in mind. How would you make it more Pythonic?

def clean_string(string):

    string = string.replace('<br />','')
    string = string.replace('</b>','')
    string = string.replace('&nbsp;','')
    string= string.replace('&#039;','\'')
    string = string.replace('&rsquo;','\'')
    string = string.replace('&ldquo;','\"') 
    string = string.replace('&rdquo;','\"') 
    string = string.replace('quot;','\'')
    string = string.strip()

    return string

The ‘clean_price‘ function splits the identifier (‘Hourly Range’ or ‘Budget’) into a new string.

It then extracts the number (also a string) and returns it as a float along wth the identifier.

def clean_price(item_Bud_HR):

    price_split = item_Bud_HR.split(':')
    item_price_type = clean_string(price_split[0]) 
    item_price = price_split[1] # Get and clean the value
    item_price = item_price.replace('$','')
    item_price = item_price.replace('<br />','')
    item_price = item_price.strip()
    if '-' in item_price:
        item_price = item_price.split('-') # If the price is an 'Hourly Range' we split, returning the number on the right of  '-'  
        item_price = item_price[1]
    item_price = float(item_price)

    return item_price_type, item_price

The ‘clean_country‘ function splits the string on '\n'. It then takes the first element, cleans off the white space and returns the Country name.

def clean_country(item_country):

    item_country = item_country.split('\n')
    item_country = clean_string(item_country[0])
    item_country = item_country[1].strip()

    return item_country

The ‘clean_skills‘ function is a little more complex.

We create a new empty list, ‘item_skills_list‘.

We then clean the string by removing HTML tags.

We split the string on the ',' character and step through the list that is created, cleaning each string and then appending it to the list before it is returned.

def clean_skills(item_skills):

    item_skills_list =[]
    item_skills = item_skills.replace('<br />','')
    item_skills = item_skills.split(',')
    for skill in item_skills:
        item_skills_list.append(skill.strip())

    return item_skills_list

Once a new job DataFrame is created for each job, it is ‘concatenated’ to the master DataFrame for later filtering.

def join_dataframes(new_item_df,jobs_df):

    jobs_df = pd.concat([jobs_df, new_item_df], ignore_index=True)

    return jobs_df

The DataFrame jobs_df now holds all of the RSS feed jobs and their associated data. We can now filter it as required.

The ones I have presented here (commented out) offer examples for your own filters.

Strips out any duplicates based on the ‘Posted’ time.
Looks for budgets and hourly figures above 20.0 dollars
Looks for selected countries (United States and India)

What would you want to filter for?

def filter_output(jobs_df):


    # FILTER THE DATA USING Pandas

# 1. Strip out non unique values - Posted on is the pseudo-primary key
    # Uncomment if needed
    #jobs_df = jobs_df.drop_duplicates(subset=['Posted'], ignore_index=True) 

 #2. Only save for Price or budget greater than $10
    # Uncomment if needed
    #jobs_df = jobs_df[jobs_df['Price'] > 20] 

# 3. Only save for Specific Country 
    # selecting rows based on condition 
    # Uncomment below if needed
    #options = ['United States', 'India'] 
    #jobs_df = jobs_df[jobs_df['Country'].isin(options)]

    return jobs_df

Here we have the main() function that takes the Upwork RSS URL and feeds it to the function in turn.

# MAIN
def main():

# Why SSL
# Python is adding http verification in in the std library
# This bypasses the check for th moment
    if hasattr(ssl, '_create_unverified_context'):
        ssl._create_default_https_context = ssl._create_unverified_context


    url="https://www.upwork.com/ab/feed/jobs/rss?q=Python&sort=recency&paging=0%3B10&api_params=1&securityToken=6b9f07dc2632b4ac772d5daa37626af471b7d2526826c56a0c16aad6580245646f4e13804c72bd1ed3755f3bd552f5ba1d3f67a021987f714a1ff340ba7659dc&userUid=1215586676591329280&orgUid=1215586676603912193"

    #Get the Feed Data
    data = feedparser.parse(url)

    # Make the master dataframe
    jobs_df = make_dataframe()

    #Get the data for each item and add it to the DataFrame
    for entry in data.entries:
        new_job_df = get_data(entry)
        # join the new dataframe to the list
        jobs_df = join_dataframes(new_job_df, jobs_df)

Now we have the jobs from the RSS feed in a DataFrame, we can filter using the pandas methods.

    filter_output(jobs_df)

	Title	Link	Description	Posted	Category	Skills	Price Type	Price	Country
0	AWS Python Consultant – Upwork	https://www.upwork.com/jobs/AWS-Python-Consult…	Fluent English speaking Python developer with …	December 02, 2023 20:13 UTC	DevOps Engineering	[Ubuntu, Amazon Web Services, Python, AWS Lamb…	Hourly Range	10.0	United Kingdom
1	1min Time Frame Forex Scalper – Upwork	https://www.upwork.com/jobs/1min-Time-Frame-Fo…	If you scalp the forex market on the m1 time f…	December 02, 2023 20:13 UTC	Deep Learning	[Forex Trading]	Hourly Range	40.0	United Kingdom
2	AI – driven crypto charting project – Upwork	https://www.upwork.com/jobs/driven-crypto-char…	Scope of work\nThedevelopment of a crypto char…	December 02, 2023 20:03 UTC	Machine Learning	[Artificial Intelligence, Machine Learning, Bl…	Hourly Range	40.0	Nigeria
3	Gelato Smart Contract Integration Upgrade – Up…	https://www.upwork.com/jobs/Gelato-Smart-Contr…	I’m looking for a Solidity developer with Foun…	December 02, 2023 20:01 UTC	Emerging Tech	[Solidity, Blockchain, TypeScript, Ethereum]	Hourly Range	40.0	United States
4	Publish Open-source AI Agent to Web UI (Flutte…	https://www.upwork.com/jobs/Publish-Open-sourc…	The goal of this project is to create a web UI…	December 02, 2023 20:01 UTC	Full Stack Development	[AI Agent Development, AI App Development, Flu…	Budget	100.0	Canada
5	Price check automation – Upwork	https://www.upwork.com/jobs/Price-check-automa…	Would like one of the experts to build me a bo…	December 02, 2023 19:57 UTC	Scripting & Automation	[Automation, Data Scraping, Data Mining, Data …	Hourly Range	100.0	Saudi Arabia
6	Need for Good Hackers to Assist in Scamming Si…	https://www.upwork.com/jobs/Need-for-Good-Hack…	We are looking for good hackers who can assist…	December 02, 2023 19:38 UTC	Information Security	[Data Entry, Python]	Hourly Range	45.0	United States
7	Microservices Architecture Help – Upwork	https://www.upwork.com/jobs/Microservices-Arch…	### The Data Synchronization Dilemma\n—\…	December 02, 2023 19:35 UTC	Back-End Development	[Python, Microservice, Software Architecture &…	Hourly Range	40.0	India
8	Build two AVL trees for project – Upwork	https://www.upwork.com/jobs/Build-two-AVL-tree…	I need an avl tree to hold a string node (key)…	December 02, 2023 19:28 UTC	Full Stack Development	[C++]	Budget	250.0	United States
9	ROMP texture on 3D SMPL mesh using Pytorch (No…	https://www.upwork.com/jobs/ROMP-texture-SMPL-…	(WARNING to SCAMMER)\nStarting from an existin…	December 02, 2023 19:23 UTC	AR/VR Design	[Python, PyTorch, Augmented Reality, Linux, Ub…	Budget	300.0	Germany

Once we have filtered the data to meet our needs, we use the pandas method to save the DataFrame to an Excel file.

It also prints out the top 3 entries to demonstrate the data has been captured.

    print(jobs_df.head(3))
    # Export to excel
    jobs_df.to_excel('jobs.xlsx', index=False)

                                          Title  \
0                AWS Python Consultant - Upwork   
1        1min Time Frame Forex Scalper - Upwork   
2  AI - driven crypto charting project - Upwork   

                                                Link  \
0  https://www.upwork.com/jobs/AWS-Python-Consult...   
1  https://www.upwork.com/jobs/1min-Time-Frame-Fo...   
2  https://www.upwork.com/jobs/driven-crypto-char...   

                                         Description  \
0  Fluent English speaking Python developer with ...   
1  If you scalp the forex market on the m1 time f...   
2  Scope of work\nThedevelopment of a crypto char...   

                        Posted            Category  \
0  December 02, 2023 20:13 UTC  DevOps Engineering   
1  December 02, 2023 20:13 UTC       Deep Learning   
2  December 02, 2023 20:03 UTC    Machine Learning   

                                              Skills    Price Type  Price  \
0  [Ubuntu, Amazon Web Services, Python, AWS Lamb...  Hourly Range   10.0   
1                                    [Forex Trading]  Hourly Range   40.0   
2  [Artificial Intelligence, Machine Learning, Bl...  Hourly Range   40.0   

          Country  
0  United Kingdom  
1  United Kingdom  
2         Nigeria

That completes our exploration of the code.

I hope you have found some insight and value here.

Let us review.

Learning Objectives

This tutorial and video looked at how to read the RSS feed from Upwork, to accelerate your search and allow you to filter the jobs on your preferred criteria.

We have covered:

Defining your project requirements
Data Scraping
RSS feeds and XML (briefly)
A potentially useful workflow
The building of a useful tool
Some useful Python skills

Next Steps

This code is very flexible and so here are some options you might want to consider if you are extending its utility:

You may want to run this code on a timer to give you frequent updates.
You may also want to load the previous jobs scraped into the jobs_df DataFrame so that you can append new jobs.
You may also want to have a ‘list of urls’ for different searches that you step through in order to cover lots of searches
If you searches are very specific you might want to have the script email you when a job is posted.

What will you do?

Resources:

https://jsonformatter.org/xml-formatter
https://ascii.cl/htmlcodes.htm

You can also check out this guide on Google Colab using this link.

Be on the Right Side of Change