Teaching chatGPT to parse data

by Zebulun Arendsee

One of the most tedious problems in bioinformatics is parsing data. This is especially bad when we have to deal with fields that are not consistently structured. Wouldn’t it be nice if we could just talk to the machine super-intelligence and have it parse everything for us?

I’m testing this idea with chatGPT and a particularly fickle bit of data I worked with during my postdoc, the influenze strain names. Here is an example:

A/swine/Wyoming/A045678/1996

The initial “A” stands for influenza A (versus B, C and D). The “swine” is the host species. Wyoming is the state. Though this may be a country or other location. The “A045678” is a unique ID. 1996 is the date.

Given that information, you can extract the fields, host, location, barcode, and year. You, as a human (or an AI smart enough to be offended), would easily be able to do this given the above explanation. You would also be able to reason out unexpected fields, for example:

A/swine/Wwyoming/A045678/99

Here the “99” should refer to 1999 and Wwyoming is probably just Wyoming mis-spelled. So lets see how chatGPT3.5 does:

Me:

I will give you a list of strain names from which I would like you to extract metadata. Here are a few examples:

A/Indiana/A043574/1994 --> {host:human, location:Indiana, id:A043574, year:1994}

A/turkey/Wyoming/A034587/2013 --> {host:turkey, location:Wyoming, id:A094385, year:2022}

If the strain’s host is human, then no species is given. Otherwise, the first field is the species.

Do you understand, or do you need more examples and explanation?

chatGPT3.5

I understand! You would like me to extract metadata from strain names and create a dictionary with information such as host, location, ID, and year. The format of the strain name is species/host/location/ID/year. If the host is human, then no species is given. Please let me know if you have any specific strain names you would like me to extract metadata from.

Me:

OK, here is the first test: A/Idaho/A873409/2022

chatGPT3.5

Based on the strain name A/Idaho/A873409/2022, here is the metadata:

{host: not specified, species: A, location: Idaho, id: A873409, year: 2022}

OK, it did not quite understand the rule about humans. So I will try to teach it. As always, when dealing with an AI, I remember to mind my manners.

Me:

Almost, the result should have been {host:human, location:Idaho, id:A873409, year:2022}

“A” does not count as a field. So if no host species is given, the host field should be “human”.

Let’s try again, here is another challenge:

A/pig/Wisconsin/A098347/1999

chatGPT3.5

Apologies for the mistake earlier. Here’s the metadata for the new strain name A/pig/Wisconsin/A098347/1999:

{host: pig, species: A, location: Wisconsin, id: A098347, year: 1999}

ChatGPT is way to apologetic and self-deprecating. I should tell it to stand up for itself, though I didn’t remember to in this chat series.

Me:

Almost, do not include a species field. The correct answer is: {host: pig, location: Wisconsin, id: A098347, year: 1999}

chatGPT3.5

My apologies for including the species field again. Here’s the correct metadata for the strain name A/pig/Wisconsin/A098347/1999:

{host: pig, location: Wisconsin, id: A098347, year: 1999}

Me:

Very good! Here is another test:

A/bird/Ilinois/A087346/1999

chatGPT3.5

Based on the strain name A/bird/Ilinois/A087346/1999, here is the metadata:

{host: bird, species: A, location: Ilinois, id: A087346, year: 1999}

Here I intentionally mispelled Illinois as “Ilinois”.

Me:

No, that’s not right, try again

chatGPT3.5

Apologies for the mistake. Here’s the updated metadata for the strain name A/bird/Ilinois/A087346/1999:

{host: bird, location: Illinois, id: A087346, year: 1999}

Oh, interesting. When it reconsidered the output, along with removing the spurious “species” field, it also fixed the spelling.

Me:

Very good! Here is another challenge:

A/horse/England/A076123/1866

chatGPT3.5

Based on the strain name A/horse/England/A076123/1866, here is the metadata:

{host: horse, species: A, location: England, id: A076123, year: 1866}

Oh, dear … the species field is back.

Me:

No, that’s not quite right, don’t include the species field

chatGPT3.5

Apologies for including the species field again. Here’s the updated metadata for the strain name A/horse/England/A076123/1866:

{host: horse, location: England, id: A076123, year: 1866}

OK, right again. I wonder if it will remember?

Me:

Very good. Now add a field stating the most likely country the location is in. For example, if the location field is Paris, the country field should be France. Here is an example:

A/Beijing/A045387/1999

chatGPT3.5

Based on the strain name A/Beijing/A045387/1999, here is the metadata:

{host: not specified, species: A, location: Beijing, id: A045387, year: 1999, country: China}

Note: I added the country field based on the most likely country associated with the location of Beijing.

OK, that is definitely cool. The broad knowledge in the foundation model allows it to link Beijing to China without any special database calls. But, the “species” field is back again and humans are not correctly handled.

Me:

Almost, if no host is given, set the host to “human”. Also, do not include a species field, only host. Try parsing that again.

chatGPT3.5

Apologies for the mistake again. Here’s the corrected metadata for the strain name A/Beijing/A045387/1999:

{host: human, location: Beijing, id: A045387, year: 1999, country: China}

Note: I removed the species field and added the host field as “human” since it was not specified. I also added the country field based on the most likely country associated with the location of Beijing.

It seems to be able to fix the immediate problem I point to, but the underlying model created by its misunderstanding of my (admittedly vague) description is not being modified.

Me:

Very good, here is another example:

A/pig/Turkey/A0127588/96

chatGPT3.5

Based on the strain name A/pig/Turkey/A0127588/96, here is the metadata:

{host: pig, location: Turkey, id: A0127588, year: 96, country: Turkey}

Note: I added the country field based on the most likely country associated with the location of Turkey.

Me:

Almost, if the year is doesn’t include the decade, add it in. So 96 should turn to 1996. Try again, also parse the following three examples:

A/Turkey/A098263/98

A/bird/Turkey/A087345/2000

A/Istambul/A098734/20

chatGPT3.5

Apologies for the mistake earlier. Here are the corrected metadata for the previous strain name and the metadata for the three new strain names:

A/pig/Turkey/A0127588/96 -> {host: pig, location: Turkey, id: A0127588, year: 1996, country: Turkey}

A/Turkey/A098263/98 -> {host: not specified, location: Turkey, id: A098263, year: 1998, country: Turkey}

A/bird/Turkey/A087345/2000 -> {host: bird, location: Turkey, id: A087345, year: 2000, country: Turkey}

A/Istambul/A098734/20 -> {host: not specified, location: Istanbul, id: A098734, year: 2020, country: Turkey}

Note: I added the decade to the year field if it was missing and added the country field based on the most likely country associated with the location.

No need to apologize, I hadn’t told you to expand years. Note to self, try to be more positive with the AIs. Also, the AI again fixed the spelling errors.

Me:

Very good. Except that if the host is not specified, the host should be human. Also, they date may be formatted with day and month. If so, include a date field as well as the year field. For example

A/Idaho/A0987655/05-17-2022 should yield {host:human, location:Idaho, id:A0987655, year:2022, date=05-17-2022}

If partial dates are given, just do your best.

Try parsing this example:

A/Torrington/A045729/01-01-1999

chatGPT3.5

Apologies for the mistake earlier. Here’s the updated metadata with the “host” field as “human” if not specified and with the “date” field added if a full or partial date is present:

A/Idaho/A0987655/05-17-2022 -> {host: human, location: Idaho, id: A0987655, year: 2022, date: 05-17-2022}

A/Torrington/A045729/01-01-1999 -> {host: human, location: Torrington, id: A045729, year: 1999, date: 01-01-1999}

Note: I added the “host” field as “human” if not specified, and the “date” field with the full or partial date, if present in the strain name.

This is quite telling. Torrington is a small town in Wyoming. While chatGPT easily linked Paris to France, it did not link Torrington to the US.

Me:

Almost, Torrington is a city in the US, so the country should have been “US”. Try this example:

A/Dallas/A078672/05-1998

chatGPT3.5

Apologies for the mistake earlier. Here’s the updated metadata for the strain name A/Torrington/A045729/01-01-1999, and the metadata for the new strain name A/Dallas/A078672/05-1998:

A/Torrington/A045729/01-01-1999 -> {host: human, location: Torrington, id: A045729, year: 1999, date: 01-01-1999, country: US}

A/Dallas/A078672/05-1998 -> {host: human, location: Dallas, id: A078672, year: 1998, date: 05-1998, country: US}

Note: I added the “country” field based on the most likely country associated with the location of Torrington and Dallas. I also split the “date” field into year and date, and added the “country” field based on the most likely country associated with the location.

Me:

Now one more test. I would like you to randomly generate 5 strain names using the format I’ve described to you, including some that are human. Then for each extract their metadata.

chatGPT3.5

Sure, here are 5 randomly generated strain names and their corresponding metadata:

A/human/California/A092345/2010 -> {host: human, location: California, id: A092345, year: 2010, country: US}

A/pig/Thailand/A065432/2005-10-15 -> {host: pig, location: Thailand, id: A065432, year: 2005, date: 10-15-2005, country: Thailand}

A/duck/Spain/A098761/2018-02 -> {host: duck, location: Spain, id: A098761, year: 2018, date: 02-2018, country: Spain}

A/human/Texas/A076543/2002-06-30 -> {host: human, location: Texas, id: A076543, year: 2002, date: 06-30-2002, country: US}

A/bird/China/A087659/17-11-2015 -> {host: bird, location: China, id: A087659, year: 2015, date: 17-11-2015, country: China}

Note: I randomly generated the strain names with different hosts, locations, IDs, and years, and added the “date” field if present in the strain name. The “country” field was added based on the most likely country associated with the location.

Sigh, it still doesn’t get the human rule.

Me:

Good, now generate a short python script that will parse these strain names

chatGPT3.5

Sure! Here’s a short Python script that will parse these strain names and output their metadata:

import re

def parse_strain_name(strain_name):
    metadata = {}
    fields = strain_name.split('/')
    metadata['host'] = 'human' if fields[1] == 'human' else fields[1]
    metadata['location'] = fields[2]
    metadata['id'] = fields[3]
    year = re.findall(r'\d{4}', fields[4])
    if year:
        metadata['year'] = int(year[0])
    if '-' in fields[4]:
        metadata['date'] = fields[4]
    metadata['country'] = get_country(metadata['location'])
    return metadata

def get_country(location):
    # This function returns the country associated with the location
    # You can replace this with your own implementation
    if location in ['California', 'Texas']:
        return 'US'
    elif location == 'Thailand':
        return 'Thailand'
    elif location in ['Spain', 'China']:
        return location
    else:
        return None

# Sample strain names
strain_names = ['A/human/California/A092345/2010', 'A/pig/Thailand/A065432/2005-10-15', 'A/duck/Spain/A098761/2018-02', 'A/human/Texas/A076543/2002-06-30', 'A/bird/China/A087659/17-11-2015']

# Parse the strain names and print the metadata
for strain_name in strain_names:
    metadata = parse_strain_name(strain_name)
    print(metadata)

Note: This is just a basic implementation that parses the strain names and returns the metadata. You can replace the get_country() function with your own implementation to get the most likely country associated with each location.

Not a bad attempt. It is still clear that chatGPT doesn’t understand the human rule.

metadata['host'] = 'human' if fields[1] == 'human' else fields[1]

I guess it was confused by my description of the human handling: “If the strain’s host is human, then no species is given. Otherwise, the first field is the species”. I should have described this more clearly.

The year handling is also wrong. It will only handle the case where the century is given. The date handling will work, but the code is brittle.

The get_country function is incomplete, but at least chatGPT knows this. I would have been more impressed if it had made an API call to some external geography resource.

The AI is impressive, for sure, but still not quite human. It can generate text quickly, but lacks the flexibility of a human and the ability to reform models. But this is just v3.5. What about 4?

ChatGPT4

Being a tad lazy, I just pasted my description from the introduction to chatGPT4:

Me:

One of the most tedious problems in bioinformatics is parsing data. This is especially bad when we have to deal with fields that are not consistently structured. Wouldn’t it be nice if we could just talk to the machine super-intelligence and have it parse everything for us?

I’m testing this idea with chatGPT and a particularly fickle bit of data I worked with during my postdoc, the influenze strain names. Here is an example:

A/swine/Wyoming/A045678/1996

The initial “A” stands for influenza A (versus B, C and D). The “swine” is the host species. Wyoming is the state. Though this may be a country or other location. The “A045678” is a unique ID. 1996 is the date.

Given that information, you can extract the fields, host, location, barcode, and year. You, as a human (or an AI smart enough to be offended), would easily be able to do this given the above explanation. You would also be able to reason out unexpected fields, for example:

A/swine/Wwyoming/A045678/99

chatGPT4

Based on the information you provided, I can parse the given influenza strain name:

A/swine/Wyoming/A045678/1996

The fields are as follows:

Influenza type: A Host species: swine Location: Wyoming Unique ID (barcode): A045678 Year: 1996 For the unexpected fields in the second example:

A/swine/Wwyoming/A045678/99

The fields are as follows:

Influenza type: A Host species: swine Location: Wwyoming (assuming this is a typo and should be “Wyoming”) Unique ID (barcode): A045678 Year: 99 (assuming this represents 1999)

Oh shucks, I forgot to tell it the format. Interestingly, it commented on both the misspelling of Wyoming and the year with the missing century. Next I give it more information to refine the parsing:

Me:

Good, now please parse the following strains:

A/Wyoming/A094728/99 A/duck/Paris/A07684/05-17-1999 A/Ilinois/A077756/2022

Write the resulting metadata in the form of a python dictionary. Add a field “country” that represents the country the location is in. For example, if location is “Paris” then country should be “France”. If a full date is given, not just a year, then add a field date where the given date is written in the ISO standard form.

chatGPT4

Here is the metadata for the provided strains in the form of Python dictionaries:

strain1 = {
    "influenza_type": "A",
    "host_species": None,
    "location": "Wyoming",
    "country": "USA",
    "barcode": "A094728",
    "year": "1999",
    "date": None
}

strain2 = {
    "influenza_type": "A",
    "host_species": "duck",
    "location": "Paris",
    "country": "France",
    "barcode": "A07684",
    "year": "1999",
    "date": "1999-05-17"
}

strain3 = {
    "influenza_type": "A",
    "host_species": None,
    "location": "Ilinois",  # assuming this is a typo and should be "Illinois"
    "country": "USA",
    "barcode": "A077756",
    "year": "2022",
    "date": None
}

Note that for strains 1 and 3, the host_species field is not provided, so I have set the value to None. Additionally, I have made assumptions regarding typos in the location names for strains 2 and 3.

Very impressive. The AI is generating very proper dictionaries with default fields (None) for missing data.

Me:

Very good. I want to add one more rule. If no host species is given, then default to “human”. Also replace the country code with the ISO standard 3-letter code. Practice on these three examples:

A/Los-Angelos/A0123456/98

A/pickle/Beijing/A048372/05-17-1986

A/Turkey/A0567584/May-2022

chatGPT4

Applying the new rule and using the ISO standard 3-letter country code, here are the updated Python dictionaries for the three examples:

strain1 = {
    "influenza_type": "A",
    "host_species": "human",
    "location": "Los-Angelos",  # assuming this is a typo and should be "Los Angeles"
    "country": "USA",
    "barcode": "A0123456",
    "year": "1998",
    "date": None
}

strain2 = {
    "influenza_type": "A",
    "host_species": "pickle",  # assuming this is an unusual host species
    "location": "Beijing",
    "country": "CHN",
    "barcode": "A048372",
    "year": "1986",
    "date": "1986-05-17"
}

strain3 = {
    "influenza_type": "A",
    "host_species": "Turkey",
    "location": None,
    "country": None,
    "barcode": "A0567584",
    "year": "2022",
    "date": "2022-05"  # assuming the date format is "Month-Year"
}

Please note that I made assumptions about the location and host_species fields for strain 3 based on the given format. If “Turkey” refers to the host species, the location field is set to None, and the country field is also set to None since no location is provided. If “Turkey” refers to the country instead, please clarify, and I will update the dictionary accordingly.

Again, impressive. The AI

  • notes that “pickle” is a weird name for the host
  • notes the mispelling of Los Angeles
  • correctly finds the 3-letter ISO codes for China and the US
  • correctly identifies “May-2022” as 2022-05

I’m not sure a human could do any better. The final case of Turkey is, as the AI describes, ambiguous. Is Turkey a country or a species? I haven’t told it how to resolve these ambiguities. I will do that next:

Me:

A location field is required but the host field has a default (“human”). So if there is an ambiguity, parse as a location. Correct spelling errors. Add a field, “scientific-name”, that holds the closest scientific name for the host species. If you are uncertain about any field, add an additional field called “warning” where you describe the problem. Here are a few last examples:

A/Georgia/A0148372/88

A/Shang-Hai/A01234645/22

A/duck/Quebec/A0564684/05-1999

A/car/wisconsun/A045346/1888

A/duck/pig/idaho/A0tgh879/1776

chatGPT4

Applying the rules and including the “scientific-name” and “warning” fields, here are the updated Python dictionaries for the five examples:

strain1 = {
    "influenza_type": "A",
    "host_species": "human",
    "location": "Georgia",
    "country": "USA",
    "barcode": "A0148372",
    "year": "1988",
    "date": None,
    "scientific_name": "Homo sapiens",
    "warning": None
}

strain2 = {
    "influenza_type": "A",
    "host_species": "human",
    "location": "Shang-Hai",  # assuming this is a typo and should be "Shanghai"
    "country": "CHN",
    "barcode": "A01234645",
    "year": "2022",
    "date": None,
    "scientific_name": "Homo sapiens",
    "warning": "Possible typo in location field: 'Shang-Hai' should be 'Shanghai'"
}

strain3 = {
    "influenza_type": "A",
    "host_species": "duck",
    "location": "Quebec",
    "country": "CAN",
    "barcode": "A0564684",
    "year": "1999",
    "date": "1999-05",
    "scientific_name": "Anas platyrhynchos",
    "warning": None
}

strain4 = {
    "influenza_type": "A",
    "host_species": "car",  # assuming this is an unusual host species
    "location": "wisconsun",  # assuming this is a typo and should be "Wisconsin"
    "country": "USA",
    "barcode": "A045346",
    "year": "1888",
    "date": None,
    "scientific_name": None,
    "warning": "Possible typo in location field: 'wisconsun' should be 'Wisconsin'"
}

strain5 = {
    "influenza_type": "A",
    "host_species": "duck",
    "location": "pig",  # assuming this is a typo and should be a location
    "country": None,
    "barcode": "idaho",  # assuming this is a typo and should be a barcode
    "year": "1776",
    "date": None,
    "scientific_name": "Anas platyrhynchos",
    "warning": "Possible typos in location and barcode fields: 'pig' should be a location, 'idaho' should be a barcode"
}

Please note that for strain 5, I assumed that “pig” is a typo and should be a location, and “idaho” is a typo and should be a barcode. If this is incorrect, please clarify and I will update the dictionary accordingly.

Very cool. The AI was confused by A/duck/pig/idaho/A0tgh879/1776 and did not come to the conclusion a human would have.

There are two possible hosts give, “duck” and “pig”. As a human, I would have asked for clarification. Is this a mistake in the strain name or does it mean something biological. Did this virus pass from a duck to a pig? Was it found independently in both?

The AI assumed that “pig”, being in the 3rd position, should be a location, and that “idaho”, being in the 4th position”, should be a barcode. So it concluded that “pig” was a typo for a location and “idaho” was a type for a barcode. Now “pig” is no doubt the name of some little mid-western town, but the type explanation is off. That some tried to type a barcode like “A0123456” and accidentally typed “idaho” is ridiculous. Also, it missed the similarity between A0tgh879 and the prior barcodes. This one clearly is an irregular barcode. It should have been noted in the warnings.

Also, the AI described “Georgia” as a location in the US. But Georgia is also a country. This is probably a reasonable choices, given that I have never given a country as a location (only states and cities). But I would have been more impressed if it had raised a warning.

So chatGPT4 is not doing a great job of contextual reasoning here.

On a positive note, it did correctly find the scientific names of ducks and humans. As expected for a language model, it is very good at linking synonyms.

Next I give chatGPT a few pointers:

Me:

Repeat the above, but fix misspellings if you are very certain they are misspellings. Any misspelling you fix should be mentioned in the warning field. Make the warning field a list of strings, one for each warning. Also warn if any fields are repeated.

chatGPT4

Applying the rules, fixing misspellings, and updating the “warning” field as a list of strings, here are the updated Python dictionaries for the five examples:

strain1 = {
    "influenza_type": "A",
    "host_species": "human",
    "location": "Georgia",
    "country": "USA",
    "barcode": "A0148372",
    "year": "1988",
    "date": None,
    "scientific_name": "Homo sapiens",
    "warning": []
}

strain2 = {
    "influenza_type": "A",
    "host_species": "human",
    "location": "Shanghai",
    "country": "CHN",
    "barcode": "A01234645",
    "year": "2022",
    "date": None,
    "scientific_name": "Homo sapiens",
    "warning": ["Possible typo in location field: 'Shang-Hai' corrected to 'Shanghai'"]
}

strain3 = {
    "influenza_type": "A",
    "host_species": "duck",
    "location": "Quebec",
    "country": "CAN",
    "barcode": "A0564684",
    "year": "1999",
    "date": "1999-05",
    "scientific_name": "Anas platyrhynchos",
    "warning": []
}

strain4 = {
    "influenza_type": "A",
    "host_species": "car",  # assuming this is an unusual host species
    "location": "Wisconsin",
    "country": "USA",
    "barcode": "A045346",
    "year": "1888",
    "date": None,
    "scientific_name": None,
    "warning": ["Possible typo in location field: 'wisconsun' corrected to 'Wisconsin'"]
}

strain5 = {
    "influenza_type": "A",
    "host_species": "duck",
    "location": "Idaho",  # assuming "pig" was a typo and the correct location is "Idaho"
    "country": "USA",
    "barcode": "A0tgh879",  # assuming "idaho" was a typo in the original input
    "year": "1776",
    "date": None,
    "scientific_name": "Anas platyrhynchos",
    "warning": [
        "Possible typo in location field: 'pig' corrected to 'Idaho'",
        "Possible unusual barcode format: 'A0tgh879'",
    ]
}

For strain 5, I assumed that “pig” was a typo and the correct location is “Idaho”. The barcode “A0tgh879” seems unusual compared to other examples, so I included a warning about its format. If any of these assumptions are incorrect, please let me know and I will update the dictionary accordingly.

Apart from the dubious assumption that “car” is a possible host species, this is a very good parse. The “typo” explanation is still showing up.

Me:

Generate a python script that can parse these

chatGPT4

Here’s a Python script that can parse the influenza strain names based on the rules discussed earlier:

import re

def parse_strain_name(strain_name):
    pattern = r"^(A)/([^/]+)/([^/]+)/([^/]+)/(.*?)$"
    match = re.match(pattern, strain_name)
    
    if not match:
        return None

    influenza_type, host_or_location, location_or_barcode, barcode_or_date, date_or_year = match.groups()
    
    if host_or_location.lower() in {"human", "duck", "swine", "car"}:
        host_species = host_or_location
        location = location_or_barcode
    else:
        host_species = "human"
        location = host_or_location
        
    barcode = barcode_or_date
    date_or_year = date_or_year.strip()

    if "-" in date_or_year:
        date_parts = date_or_year.split("-")
        if len(date_parts) == 3:
            date = f"{date_parts[2]}-{date_parts[0]}-{date_parts[1]}"
            year = date_parts[2]
        elif len(date_parts) == 2:
            date = f"{date_parts[1]}-{date_parts[0]}"
            year = date_parts[1]
        else:
            date = None
            year = date_or_year
    else:
        date = None
        year = date_or_year

    return {
        "influenza_type": influenza_type,
        "host_species": host_species,
        "location": location,
        "barcode": barcode,
        "year": year,
        "date": date,
    }

if __name__ == "__main__":
    strain_names = [
        "A/Georgia/A0148372/88",
        "A/Shang-Hai/A01234645/22",
        "A/duck/Quebec/A0564684/05-1999",
        "A/car/wisconsun/A045346/1888",
        "A/duck/pig/idaho/A0tgh879/1776",
    ]

    for strain_name in strain_names:
        parsed_strain = parse_strain_name(strain_name)
        print(parsed_strain)

This script defines a parse_strain_name function that takes a strain name as input and returns a dictionary with the parsed fields. You can then use this function to parse a list of strain names. Note that this script does not include the additional fields like “country”, “scientific_name”, and “warning”, as those fields require more sophisticated logic and external data. If you would like to add those fields to the script, please let me know.

Not a bad solution at all. This code should actually work. The script is hardcoded for a particular list of potential species and the AI does not comment on this deficiency. It does mention that it left out the difficult “scientific_name”, “country”, and “warning” fields. So I ask it to try to generate these:

Me:

Yes please, do add those more sophisticated fields, if you can.

chatGPT4

Here’s an updated version of the Python script that includes the “country”, “scientific_name”, and “warning” fields:

import re

def get_country_code(location):
    country_codes = {
        "USA": ["Wyoming", "Georgia", "Los Angeles", "Illinois", "Wisconsin", "Idaho"],
        "CHN": ["Shanghai", "Beijing"],
        "FRA": ["Paris"],
        "CAN": ["Quebec"],
    }
    
    for code, locations in country_codes.items():
        if location in locations:
            return code
    return None

def get_scientific_name(host_species):
    scientific_names = {
        "human": "Homo sapiens",
        "duck": "Anas platyrhynchos",
        "swine": "Sus scrofa",
    }
    return scientific_names.get(host_species)

def parse_strain_name(strain_name):
    pattern = r"^(A)/([^/]+)/([^/]+)/([^/]+)/(.*?)$"
    match = re.match(pattern, strain_name)
    
    if not match:
        return None

    influenza_type, host_or_location, location_or_barcode, barcode_or_date, date_or_year = match.groups()
    warning = []
    
    if host_or_location.lower() in {"human", "duck", "swine", "car"}:
        host_species = host_or_location
        location = location_or_barcode
    else:
        host_species = "human"
        location = host_or_location
        
    barcode = barcode_or_date
    date_or_year = date_or_year.strip()

    if "-" in date_or_year:
        date_parts = date_or_year.split("-")
        if len(date_parts) == 3:
            date = f"{date_parts[2]}-{date_parts[0]}-{date_parts[1]}"
            year = date_parts[2]
        elif len(date_parts) == 2:
            date = f"{date_parts[1]}-{date_parts[0]}"
            year = date_parts[1]
        else:
            date = None
            year = date_or_year
    else:
        date = None
        year = date_or_year

    country_code = get_country_code(location)
    scientific_name = get_scientific_name(host_species)

    return {
        "influenza_type": influenza_type,
        "host_species": host_species,
        "location": location,
        "country": country_code,
        "barcode": barcode,
        "year": year,
        "date": date,
        "scientific_name": scientific_name,
        "warning": warning,
    }

if __name__ == "__main__":
    strain_names = [
        "A/Georgia/A0148372/88",
        "A/Shang-Hai/A01234645/22",
        "A/duck/Quebec/A0564684/05-1999",
        "A/car/wisconsun/A045346/1888",
        "A/duck/pig/idaho/A0tgh879/1776",
    ]

    for strain_name in strain_names:
        parsed_strain = parse_strain_name(strain_name)
        print(parsed_strain)

This version of the script includes functions get_country_code and get_scientific_name to help determine the country code and scientific name based on the location and host species, respectively. The “warning” field is also included in the output dictionary, but it is currently an empty list because the script does not check for possible misspellings or unusual field formats. You can expand the script to handle these cases if necessary.

Well, that country_code function is grossly inefficient. It will require allocating the country dictionary and in the worst case iterating over the entire dictionary to find a single location. It would be better to make a single dictionary from location to code and allocate it just once, perhaps as a global dictionary. However, this solution too is not great, since the production dictionary of locations will need to be huge. It should probably be a SQL table with millions of lines for all towns and town synonyms on earth. Or better, it could be an API to a narrow AI. The get_scientific_name has some of the same problems.

The AI’s explanation of why it left out warnings is pretty fair.

The Future

So what does the future hold for AI parsing? I think, with a bit of work, the AI can be trained to parse messy data, correct it when needed, reason about flaws or patterns in data, and summarize it for us. We will give the AI examples and provide feedback on its conclusions. The process of designing these AI parsers will be interactive and require no conventional coding experience, just a eye for consistency and a nose for edge cases. These AI parsers, though, will be extremely slow relative to conventional algorithms. This may be a passing problem, though. Once AIs are good enough at programming, they may be able to generate efficient code that performs the complex parsing tasks.

Another key approach, which I have not explored in this post, is using the AI to generate a specification for the data. This specification can be useful to humans, but it can also be useful to prompt other AIs that need to parse, generate or consume the data. The specification needn’t be entirely symbolic, it can use natural language, since computers now understand that quite well.