Techie-Talk: The Inner Workings of Data Technology

Eleanor Barlow

Interview with Dean Wronowski: Freelance Full-Stack Software Developer

JavaScript, HTML, CSS, and JQUERY books on a shelf with a cactus plant

Good morning and thank you for joining me today. So, to kick-start our Techie-Talk, can you please explain to us what it is you do at Pansensic?

 

Dean Wronowski: “Right… my job is basically to programme and, at Pansensic, I have developed a system that basically allows colleagues within the company to launch a scrape, which means to harvest or extract comment data.

There are millions of comments on thousands of websites on the internet. So, let’s just say, for instance, you have been tasked to scrape URL’S relating to a certain product for a client, you can put all those URLS’ into the scraper UI and then specify whether they need to be translated, using either Google or Microsoft translator. Once you have adapted the settings to what you want, the scrapper will then go and grab all the available data. This could include the product title, descriptions, images and comments.

“After the data has been scraped, the system will then run a number of algorithms, such as natural language processing, to process the text. As well as machine learning algorithms, such as classification and image recognition, for identifying objects required for analysing.

“These comments then get stored in a database. At the same time, you can start multiple scrapes to go through other sites and do the same thing to download information. This information gets put into a separate database. So, at the end of the day you can have several databases full of products and comments, all running at the same time.”

 

Words ready for print

 

Right, what happens next?

“Once all the scraping is done for each project, we use a database mapper, which I have also developed. This is using the same programming language (PHP) and basically allows Pansensic colleagues to transfer all the comments that have been scrapped, and for these comments to be auto-mapped into the correct Pansensic database. Pansensic can then get down to the analysing part…which I am not at liberty to divulge.

“Webpages, websites and the tags within each webpage are, however, constantly changing. When we start off a scrape, we are trying to extract certain tags in order to get certain data. But because these tags are always changing we are having to change the scrape on a constant basis and look for different tags. As the website has changed, or new technologies are being used, we have to change our scraper to match their new technology. So, I have to keep adapting to these parameters frequently.

“I have also started developing a new system to make managing the projects, alongside the scrappers, easier. It is called a Pan Organiser. This is using a technology called REACT (which is built by Facebook), REDUX, NODEJS, SASS and WEBPACK. All these technologies are quite new. We also use a WebSocket, so as soon as Pansensic colleagues go into Pan Organiser, they can add in a project, or the date of a project, and then when any of our clients log in to the external system they can see any updates in real time.

“As the scrappers go around the internet downloading information, it automatically uploads the stats (the number of comments and products). These will mechanically transfer to the Pan Organiser so that external clients can actually see if any comments have been downloaded for each project.”

How do you see the system you are building progressing in the future?

“Each time we update our work it is faster and easier for both us and our clients. It is always evolving. Like with the new system we went down the route of using version control, by using a method called Bitbucket so that, every time changes are made to Pan Organiser, they are recorded and tracked. I can select any files that have changed since the last time they were updated and commit a message to say that these files have changed. I can commit said change to Bitbucket and then because it’s using version control, you will be able to see what changes I have made, where I have made these changes, at what time and how long it has taken me. Others (clients and Pansensic directors) can then actually go into a chosen file and see what has changed, removed, been deleted and so on.

“On top of the technologies listed previously, we are also using this thing called Vagrant, which basically means that when you download or clone a repository from Bitbucket, which is where all the files are contained, you can download that repository to another developer’s machine and have all the files and all the set-up files in one place. So, if another developer were to launch a project, then all the frameworks would automatically install, as would all the libraries and code. And once they have installed all of this, it would then initialise them and set up the database, ready for the system.  The system would then start-up and then a developer could login directly and start developing without having to learn how to transfer all the files. They would be up and running, ready to start developing, alongside the team at Pansensic.

“And that’s why we are going down the route of Bitbucket, repositories and version control. It makes it easier to expand. And means we can all work as a team, together.”

cogs turning in a machine

 

You briefly touched upon it in response to my first question, and as it is an ever-expanding and discussed topic at the moment, I have to ask, do you use ML (Machine Learning) and/ or AI (Artificial Intelligence) in your technologies at Pansensic?

“We are starting to. Nowadays, when you go to sites such as Instagram, it is all about imagery, whereas in the past it was all about the comments. Of course, comments are still abundant, but images are rapidly increasing, especially on social media sites. In places, such as Instagram, people tend not to leave lots of comments or lengthy texts that we can analyse so, as a result, we are starting to go down the route of actually downloading the images, then use image recognition to work out what is in a picture.

“Say someone uploads a picture of a green Nike shoe, for example. If a review is left saying ‘I really love this product’, in the past we would have analysed the text but that relates to nothing without the image alongside it.  Whereas if we upload the picture as well, and then combine it with machine learning to work out what is in the picture, the comment becomes more useful. The machine will be able to identify that in the picture there is a shoe, it has the colour green in it and it has a brand logo, so can work out that Nike is in the image. So now that we have a green Nike show, combined with the comments, we know that this person loves this particular shoe-which is very useful.”

 

Green Nike Shoe

 

Thank you, Dean, for enlightening us as we travel through our Techie-Talks! And we look forward to our next interview with you as we take a closer look into just how amazing Artificial Intelligence and Machine Learning can be!

“My pleasure, see you soon!”

 

 

(This interview has been lightly edited, for the purpose of clarity.)

Leave a Reply

Close Menu