Contents

Social Media Web Scraping Tool for Autohome

Problem Statement

When I first started working in this digital marketing department, I recognized that my colleagues and agents were spending an excessive amount of time obtaining social media data. At times, they even had to make substantial payments to platforms for data purchase, even though we only required a portion of the openly available data. I summarized a few key issues:

  1. Manual copy-pasting of social media data was too slow and prone to errors.
  2. Individuals used their methods of data analysis, resulting in inconsistent and non-standardized data over time.
  3. The high cost of data acquisition was affecting our ability to plan marketing strategies. Believing that technology could alleviate these issues, I planned to use Python web scraping to gather and analyze data.

Methodology

The methodology of the project was divided into two main stages, Data Collection and Data Processing, each with its unique complexity and process.

Data Collection

The Data collection process leveraged Python’s scraping capabilities to extract the relevant data from the Autohome web forum. A script was developed using Python’s requests and json modules along with functionalities from the csv and urllib libraries.

Steps

  • Configured HTTP headers and parameters for the request to adhere to the structural requirements of the target webpage.
  • Sent HTTP GET requests to the target URLs, incorporating the above parameters to retrieve the required data.
  • Parsed the JSON response from each successful GET request to extract the desired data points.

Using this methodology, we were able to collect a variety of data including user id, thread titles, publishing dates, thread types, reply counts, and direct URLs to threads. https://res.cloudinary.com/dn5fmt3xj/image/upload/f_auto,q_auto/v1/OB_Assets/ob/MyPhotos/qrhxjgodhcgl28htejvd

Data Processing

After successful data collection, the next step was data processing and analysis. This was accomplished using Python’s pandas library due to its ease of use and powerful data manipulation capabilities.

Steps

  • Loaded the scraped data into a pandas dataframe for ease of manipulation.
  • Conducted data cleaning tasks, which included converting date columns to a datetime object and replacing numeric identifiers with their respective labels.
  • Created subsets of data based on target months using pandas’ boolean indexing feature.
  • Applied the groupby functionality in pandas to aggregate data on a monthly basis, calculating metrics such as post count, reply count, and premium post count for each month for each vehicle model.

https://res.cloudinary.com/dn5fmt3xj/image/upload/f_auto,q_auto/v1/OB_Assets/ob/MyPhotos/jusacichgyu4h95fguj9

The final result was a monthly summary of the activity and engagement trends on Autohome’s web forum for each target vehicle model. This granted us comprehensive insights into forum users’ behavior and the popularity of various vehicle models over time. The data was finally exported into a CSV file for further use or analysis. This methodology, combining web scraping and data analysis, provided a reliable approach for gathering and interpreting online forum data. This is beneficial in various fields, including market research, customer sentiment analysis, brand monitoring, and more.

Results

The result of this project was a successful extraction, processing, and analysis of valuable data from Autohome’s web forum. The data gathered provided critical insights relating to user engagement, popularity of different car models, and overall forum activity over several months.

The culmination of the project work actively translates to a visually comprehensive report that brings data to life, making it more consumable and interpretable.

https://res.cloudinary.com/dn5fmt3xj/image/upload/f_auto,q_auto/v1/OB_Assets/ob/MyPhotos/xllswrhgnzalpqb3renr Pictured above, the report exhibits various data diagrams illustrating the robustness and versatility of the information extracted. The depicted graphs and charts allow for an easy understanding of trends, patterns, and standout features. These graphical representations thus offer an efficient way to grasp the complexities of user behavior and car model popularity on Autohome’s web forum.

Summary of Key Findings:

  • Engagement Analysis: The total post count and reply count proved beneficial to gauge the level of engagement on the web forum for specific car models. High-post count and reply count signified increased interest and discussions around a particular car model.
  • Popularity Index: Premium or ‘featured’ post count, a distinctive feature in the forum, acted as an indicator to measure a model’s popularity. An increase in the number of such posts over time suggested a growing popularity and approval of that model among forum users.
  • Trend Analysis: Comparing these metrics on a month-on-month basis brought forth the emerging popularity trends of various car models. This could highlight surges in popularity, possibly due to new releases, model upgrades, or successful marketing campaigns.
  • Detailed Insights: Finally, the linking of individual forum threads helped in conducting a more in-depth analysis when required. For topics of heightened interest, the actual thread could be visited for qualitative analysis or sentiment analysis which could provide answers beyond the quantitative metrics.

The data obtained through the Autohome web scraping project was invaluable in generating actionable insights. This approach to data collection and analysis proved effective and could aid in a wide variety of fields beyond automotive research, including business intelligence, brand monitoring, and market research. Similar tactics were used in subsequent projects, such as the TikTok scraping project, with analogous efficiency and success.